Wikipedia page views a potential key to open source web trends data
Wed 9 Sep 2015
Japanese researchers have conducted research to prove that Wikipedia’s publicly-available page view data could potentially provide a better insight into web trends than the more limited statistics available from Google.
The paper [PDF], a collaboration between researchers at three Japanese universities, notes the success of University of Prague economist Ladislav Krištoufek in demonstrating a correlation between peak interest graphs from Google Trends and Wikipedia in searching for the term ‘Bitcoin’.
The paper finds that this parity is reproducible to an average accuracy of 0.72, and presents an initial example of Japanese interest in the term ‘Anne Hathaway’ when a TV movie featuring the actress was broadcast in Japan in December of 2014. Overlaying an analysis of page view information for the actress’s page as made publicly available by Wikipedia with corresponding Google Trends data, the frequencies are notably in synch with each other:
The paper demonstrates that the average correlation coefficient for keywords ranked from 1 to 1,000 in Wikipedia page views is 0.72, whilst keywords ranked from 1,001 to 2,000 stand at an accuracy of 0.74.
To prove their contention the researchers first mapped each keyword to a page path such as wiki/Anne_Hathaway, afterwards mapping the relevant keyword (in this example ‘Anne Hathaway’) where it exists as the title of the Wikipedia page. Thereafter the researchers calculate daily and monthly page-views on a per-keyword basis and compare these to public server logs. For the paper they obtained the daily and monthly search frequencies of 2,039 keywords between October and December of 2014, and the monthly frequencies of nearly 10,000 keywords or key-phrases between 2008 and 2014.
The value of establishing the correlation is in determining that there may be some publicly-available data source for web trend information which is visible for searches below a certain threshold. Google Trends gives a tectonic overview of the biggest waves on the web, but data on smaller surges of interest is absent. Since those hidden ‘low-volume’ swells can be the earliest indicators of larger interest gathering, leveraging this more granular low-level information promises more useful predictive mechanisms for those who do not have direct access to Google’s datasets on web searches. The paper notes:
‘Although some search engines provide search logs via online services such as Google Trends, the availability of data from these search engines is fairly limited. For example, we cannot obtain a set of all trend keywords for a specific date. To do so, we would have to burden Google servers by querying all possible keywords that may have been popular on that day. As such, there is a need for a source of open data that can simulate search logs’