Yahoo releases largest ever machine learning dataset to research
Thu 14 Jan 2016
Yahoo Labs today announced that it will make available to researchers a dataset that’s record-breaking in terms of data density. The Yahoo News Feed dataset contains 110 billion events drawn from samples of anonymised transactions on the portal’s news feed, and weighs in at a whopping 1.5 terabytes zipped.
The data release, part of the company’s Webscope initiative and announced on Yahoo’s Tumblr blog, is intended for researchers to use in validating recommender systems, high-scale learning algorithms, user-behaviour modelling, collaborative filtering techniques and unsupervised learning methods.
Research on data mining will be enabled by the provision of local timestamps and limited information on the device used whilst accessing the news feeds – and these facets will also be useful in research on contextual recommender systems, currently one of the hottest fields in commercial research into AI.
The dataset features interactions between various Yahoo News feeds including the home page, news, sports, finance, movies and real estate segments of the company’s news output.
Though the dataset is anonymised, Yahoo is additionally providing demographic information related to the interactions, including age, segment, sector and originating city for a subset of the sampled users, making the News dataset of wide appeal also to social researchers.
Large scale machine learning datasets do not come into view every day, and when researchers are limited to ‘old favourites’, the research field tends to become calibrated to the available input data and hindered from generating provable objective results.
As a frequent peruser of computing research papers, it will be interesting to see the News dataset turn up in new research initiatives. Since Yahoo own image-based blogging network Tumblr, it would also be interesting to see the company negotiate a new image database for contribution to AI research around image recognition, facial recognition and deblocking, since in these sectors complaints about the limited datasets currently available are frequent.