Professional Documents
Culture Documents
ABSTRACT
The frequency of a web search keyword generally reflects the 100 7000
degree of public interest in a particular subject matter. Search logs Search Engine (left side) 6000
are therefore useful resources for trend analysis. However, access 80
Wikipedia (right side) 5000
to search logs is typically restricted to search engine providers. In
60 4000
this paper, we investigate whether search frequency can be esti-
mated from a different resource such as Wikipedia page views of 40 3000
open data. We found frequently searched keywords to have re- 2000
markably high correlations with Wikipedia page views. This sug- 20
1000
gests that Wikipedia page views can be an effective tool for de-
termining popular global web search trends. 0 0
2014/10/1 2014/11/1 2014/12/1
Categories and Subject Descriptors Figure 1: Daily trend of the keyword “Anne Hathaway” in a
H.3.3 [Information Storage and Retrieval]: Information Search search engine (i.e., Google Trends) and Wikipedia.
and Retrieval; H.3.5 [Information Storage and Retrieval]:
Online Information Services – Web-based services
We investigate whether search frequency can be estimated from a
Keywords different resource of open data. We focus on a tendency of Wik-
Search Trend; Page View; Search Engine; Wikipedia; Open Data ipedia web pages appearing in the high rank by many search result
sets. Previous studies have reported that users tend to visit more
1. INTRODUCTION highly ranked search result pages (e.g., [3]). In addition, a recent
When have the Japanese people shown most interest in the Amer- paper has shown the effectiveness of Wikipedia page views for
ican actress and producer Anne Hathaway recently? To answer understanding trends. Specifically, it investigates correlations
this question, we must know the frequency of “Anne Hathaway” between BitCoin price and Google Trends or Wikipedia [4].
being input as a keyword in a web search engine. This frequency Based on these studies, we hypothesize that the frequency of ac-
peaked on December 12, 2014, following the television broadcast cessing Wikipedia (i.e., page views) will reflect search frequency.
of a movie in which Anne Hathaway played a major role. This Figure 1 depicts an example, in which we see remarkably high
fact indicates that people use web search engines frequently to correlations between the search frequencies of Google Trends and
check the details of topics of their interest. Thus, the frequency of Wikipedia page views. We attempt to test this hypothesis by de-
a web search keyword generally reflects the prevailing degree of termining correlations between search frequencies and Wikipedia
public interest in a specific topic. Studies utilize the frequency of page views.
web search keywords to determine trends in public interest. Choi
and Varian, for example, demonstrate the use of search engine 2. METHODOLOGY AND DATA
data to forecast the near-term values of economic indicators [1].
Radinsky et al. present a method for using patterns of web search
2.1 Methodology
keywords to predict future events [2]. However, access to search We calculate the similarities between search frequencies and Wik-
logs is typically restricted to search engine providers. Although ipedia page views over daily and monthly time spans, because the
some search engines provide search logs via online services such appropriate span for estimating search frequencies varies accord-
as Google Trends1, the availability of data from these search en- ing to their applications.
gines is fairly limited. For example, we cannot obtain a set of all We use the Pearson product-moment correlation coefficient as our
trend keywords for a specific date. To do so, we would have to measurement tool. In addition, we apply an up/down concordance
burden Google servers by querying all possible keywords that rate to explicitly consider the increase/decrease in frequencies. In
may have been popular on that day. As such, there is a need for a contrast to the Pearson product-moment correlation coefficient,
source of open data that can simulate search logs. the concordance rate is sensitive to frequency trends. It inputs two
sets of time-series data 𝐗 = {𝑥1 , 𝑥2 , … , 𝑥𝑛 } and
1
𝐘 = {𝑦1 , 𝑦2 , … , 𝑦𝑛 } and computes the following value.
https://www.google.com/trends/
|{𝑡 ∈ {2,3, … , 𝑛}|𝑥′𝑡 = 𝑦′𝑡 }|
𝑈𝐷𝐶𝑅(𝐗, 𝐘) = ,
𝑛−1
where 𝑥′𝑡 is
1 (𝑥𝑡 − 𝑥𝑡−1 > 0)
𝑥 ′𝑡 = { 0 (𝑥𝑡 − 𝑥𝑡−1 = 0) ,
−1 (𝑥𝑡 − 𝑥𝑡−1 < 0)
WebSci '15, June 28 - July 01, 2015, Oxford, United Kingdom
Copyright is held by the owner/author(s). and 𝑦′𝑡 is calculated in the same manner.
DOI: http://dx.doi.org/10.1145/2786451.2786495
2.2 Data 1,000
Daily Monthly
Since access to search logs is restricted, we must rely on Google 800
Trends to obtain search frequencies. These trends are expressed as 600
the percentage integers of a maximum value for a particular 400
search interval, i.e., the absolute frequency values cannot be 200
known. The trends of low frequency keywords consist mostly of 0
zeros or are not even provided by Google Trends. In our experi-
ment, we collect daily and monthly search frequencies by input-
ting keywords into Google Trends.
Figure 2: Distribution of the number of collected keywords in
The Wikimedia Foundation, Inc. distributes page view statistics Google Trends: The x-axis indicates the access rank in Wik-
for Wikimedia projects2. Each statistic comprises the page path ipedia, while the y-axis shows the number of keywords whose
(e.g., “wiki/Anne_Hathaway”) and the number of hourly page frequency was obtained by Google Trends.
views. To generate our experimental data, we first prepare a
keyword list from which we estimate trends. Second, we correlate Daily Monthly
0.8
each keyword (e.g., “Anne Hathaway”) to a page path if the
keyword exists as a Wikipedia article title. Third, we calculate the 0.6
daily and monthly page views from hourly data to obtain the daily 0.4
and monthly page views for each specific keyword. Finally, we
compare those daily and monthly page views with search logs. 0.2