Twitter age inference set to help research
Mon 27 Feb 2017
Researchers have developed a method to estimate the age of 700 million Twitter uses, in the hope of improving the large base of scientific research which relies on the popular social network’s open data sources and APIs.
The paper Probabilistic Inference of Twitter Users’ Age based on What They Follow utilises a base group of approximately 130,000 Twitter users who have openly revealed their age in order to extrapolate accurate assumptions based on who individual users are following.
The demographic information about Twitter’s user base is important for social and other scientific research projects which rely on the unusual amount of public data that Twitter provides, but which are hampered by the network’s increasingly uncommon disposition to protect the personal data of its adherents, including age, gender, geographical location and other key factors which are useful in mining social datasets.
There are a number of challenges in extrapolating wider demographics from the limited range of people who openly declare their age. Firstly, a user’s ‘age’ data is not systematised as a parameter (or even requested as private information on signup), and so the team had to use GREP searches in order to obtain this information.
Secondly, younger subscribers are statistically more likely to disclose their age, making it important to offset results and counteract the trend.
The Imperial College London team used a variety of methods to calculate age, including the way that names have gone in and out of fashion over the years. However, some of the more obvious indicators, such as talking about new births, house purchases and other events associated with a certain phase of life, proved not to be adequately reliable indicators of age, since younger users often write ‘aspirational’ tweets, adapting language to an older user group in order to conform and fit into the desired online social sphere.
Ultimately the best indicator the research group found was in comparing the other Twitter accounts that their target subjects followed, using the established 130k ‘age-identified’ users as a point of departure.
The paper notes previous efforts to identify age based on lexical usage, but again social and aspirational/imitational factors can skew accuracy for this method.
Another obstacle for the research comes in the form of abandoned Twitter accounts over the ten-year life cycle of the network, which can contain indicators that are not current – a problem exacerbated by the very low median Tweet count of a Twitter user (4 tweets).
The team ultimately claims to have successfully applied its methodology to 700 million users. If the classification is adopted, future research based on Twitter’s open source data could become far more revelatory.