Data preprocessing and visualization are important skills of managing Internet services such as search engines and online social networks. In this post, we are going deal with two weeks of search logs from a large search engine and learn some practical techniques to analyze the data.
Background and data
Once you submit a query to a search engine, the search engine will log some related attributes regarding this query, such as when the query is submitted (e.g., timestamp) and the search response time (SRT).
In this post, we use 2 weeks of search logs from a global top search engine. Each day of search logs are written into one log ﬁle, so we have 14 (7 * 2 weeks) log ﬁles in total. Note that, because there are more than one billion queries submitted every day, we do not log them all but only a small random part of them. The log ﬁles are of the CSV (Comma Separated Values) format, and each column represents one attribute. The ﬁrst line in the ﬁle contains the names of each attribute, and the following lines are the speciﬁc values for each query. A sample of the raw log ﬁle is as follows:
To be more clear, we show the above data in a table below.
The description of those attributes in the above list is as follows:
- Timestamp: the unix timestamp when the query is submitted. For example, “1411315200” represents “2014/9/22 0:0:0”.
- #Images: the number of images embedded in the result page.
- UA: user agent, the type of the user’s browser where the query submitted from.
- Ad: whether the result page contains ads or not, “AD” for yes and “noAD” for not.
- ISP: the ISP (Internet Service Provider) that the user or the query comes from.
- Province: the location of the user (32 provinces for this attribute since the logs only contain queries from China mainland).
- PageType: whether the page is loaded synchronously or asynchronously.
- Tnet: the 1st component of SRT (ms), the page transmission time over the network .
- Tserver: the 2nd component of SRT (ms), the server-side processing time of the query.
- Tbrowser: the 3rd component of SRT (ms), the DOM parsing time of the browser.
- Tother: the last component of SRT (ms), the remaining time for acquire other embedded elements in the page, such as images.
- SRT: search response time (ms), which is the sum of the above four SRT components.
The ﬁgure below shows a simpliﬁed timeline of a search, where you can ﬁnd the details of the four SRT components and their relationship with SRT.
Next we are going to analyze this data by visualizing it on different kind of charts. Based on the kind of data we want to visualize, diﬀerent charts can result in a more complete expressiveness and in a more eﬀective visualization of the data than others. Thus, choosing the most appropriate one plays a fundamental role in displaying the data in the most eﬀective way and allowing an user or a company operator to detect interesting pattern in the data in the most natural way.
Important : The Python code used to collect the data, preprocess it, generate the charts and the data itself can be found on the following Github repository: https://github.com/davide97l/Visualization-and-analysis-of-web-engine-data
Line chart: average SRT of every 10 minutes
For example, a line chart is eﬀective when we want to visualize the values from a data-source and its variation over the time, in other words it shows the trend of the data. Furthermore, on the same chart can be also displayed more variables in order to compare their values and their trend.
In this chart we can clearly see that the displayed data follows a particular pattern, this allows a human operator, or even better, a machine learning model, to predict the future trend of the data. Beside predicting the future, it is also possible to observe the past data in order to detect possible anomalies and try fix them, or at least to find their possible cause. Also for this task is more convenient using some anomaly detection methods based on automated log analysis rather than engaging a team of human expers and ask them to analyze several millions of logs lines.
Stacked area chart: average of each SRT component of every 10 minutes
Another kind of chart that is useful when comparing the component of the data is the stacked-area chart. It displays on the same ﬁgure the trend of both the total and the value of the single components of a numeric variable, thus allowing the comparison between the value of each component with the other components and at the same time comparing them with the total value of the variable. The value of each component can both be visualized as numeric value or relative value, in the second case we will have a 100% stacked area chart.
In the above chart we can observe that the four SRT components are all correlated between each other and with the total SRT time with Tbrowser being the shortest and Tohter the longest one.
CDF chart: SRT distribution
Conversely, a CDF (Cumulative Distribution Function) chart displays the cumulative distribution of the data and it’s useful to understand how the values of a variable are distributed. In fact, It shows the probability of the values of the variables fo the data to be less than or equal to an X value and thus can be useful to compare the distribution of diﬀerent variables.
For example, from the CDF chart of the SRT it’s clear that about 70% of the times the SRT is less than 1 second but in some rare cases it can also be slower requiring about 5 seconds. Reducing the SRT is very important for a web company to survive, normally, online users don’t like waiting too much, thus, they are more incline to visit again a website whose response time has been short rather than long. Since these kind of online companies highly rely on business advertisement, attracting less users will have a serious negative impact on their incomes.
Line chart: number of queries of each minute
By comparing this line chart with the first one, we can easily observe that both charts present regular patterns and even more interesting is that the trend of both charts are very similar over the time, in particular, by carefully analyzing them can be seen that their high and low peaks occur at the same time corresponding to the alternance of the nights and the daytimes. Even during the daytime the trend of the data is not linear but tends to have low and high peaks corresponding to different hours of the day. We can thus deduce that the more total PVs, the slower is the average SRT. The same pattern can be also noticed in the stacked area chart since the SRT components and the total SRT are highly correlated.
Histogram: number of queries of each province
If we have to compare the possible labels of one variable of the data can assume and their frequency distribution, one of the most appropriate chart is the histogram where each bar displays the number of times a certain label occurs in the variable. The taller a bar is, themore labels fall in its relative range.
The histogram showing the number of queries of each province displays that Guangdong is the province with the most PVs, the result is not surprising since Guangdong is one of the most developed and populated province of China. Zhejiang, Jiangsu and Shandong follow in the list being very populated and rich regions as well. Conversely, can be seen that the poorest and smallest provinces like Xizang and Gansu present very low numbers of PVs, thus we can say there is a correlation between the PVs of a province and its population and welfare.
Pie chart: number of queries of each UA
Lastly, the pie chart, as the histogram does, measures the occurrences of each label present in a variable of the data. The diﬀerence is that the chart shows the proportion of each component of the data instead than their value and displays it in a diﬀerent way which emphasizes these proportions. Pie charts are very often used in the business world and the mass media because its semplicity and its effectiveness on comparing the data but, on the other hand, it is has the drawback that it’s difficult to compare data across different pie charts .
From the pie chart of the number of queries of each UA (user agent, the type of the user’s browser where the query submitted from) we can see that MSIE+ is by far the most used browser, this might be also caused by the fact that Chrome, which is the most popular browser in the world, is blocked in China so that chinese users are more encouraged to use other browsers to submit their queries. Anyway, still Chrome mantains a solid second position, in fact, many users such as influencers and researchers are still encouraged to find alternative way of using Google because of some important worldwide services it offers such as Youtube or Google Schoolar.