The Stack Archive

Science vs. clickbait

Tue 1 Nov 2016

The impetus for publishers to gather traffic in order to support their more sober output  has come to interest researchers and developers recently – primarily since ‘clickbait’ articles are neither really news nor really ads, but another form of commercial underpinning.

The conceit behind the idea of sensational, uninformative and alluring headlines makes clickbait news difficult to classify and therefore difficult to keep out of news feeds, since –unlike network advertising, which can be deduced from its originating domain – they lack definite primary identifiers.

Facebook’s recent measures to filter out clickbait from users’ feeds has drawn concern from the marketers relying on viral and populist traffic, though the initiative was undermined in the same period by the initial failure of the social network’s AI-driven news trending feature.

In 2015 a Norwegian developer designed a clickbait generator as a broad complaint against the practice, and in 2014 snipe.net released a dedicated Chrome extension called Downworthy (as a joke against notoriously successful clickbait empire Upworthy) which aims to identify sensationalist, low-content headlines and obfuscate them.

However, like the apparently now-abandoned Clickbait Remover For Facebook Chrome extension, Downworthy relies on a fixed algorithm of common clickbait phrases and usage, unlikely to adapt and evolve as marketers do.

A group of researchers from the Indian Institute of Technology at Kharagpur have now applied machine learning to the problem of classifying clickbait, developing syntactic and crowd-sourced procedures to create a new anti-clickbait Chrome extension called Stop Clickbait.


The team set about identifying the characteristics of a clickbait headline in order to attempt more accurate, real-time classification. Some of the features are obvious, some not. The paper finds that the length of a clickbait headline, predictably, is longer than a headline for a ‘real’ news story: 10 words versus 7. Clickbait sentences are also usually long, natural English sentences which use contracted forms and colloquial language, as well as being low in ‘content words’:

‘Traditional news headlines typically contain mostly content words referring to specific persons and locations, while the function words are left out for readers to interpret from context. As an example, consider the news headline: “Visa deal or no migrant deal, Turkey warns EU”. Here most of the words are content words summarizing the main takeaway from the story, and it has very few connecting function words in between the content words.’

Additionally words in clickbait headlines are shorter and hyperbolic, with ‘positive’ sentiments – something which the researchers note is ‘almost non-existent’ in traditional news headlines.

Informal punctuation patterns (!?…), slang acronyms (LMAO, LOL), the possessive case (‘I’, ‘We’)and use of ‘curiosity gap’ language (‘Will blow your mind’…’You won’t believe’) characterise inductive headlines, according to the work.

More notable and perhaps less obvious is how small the clickbait vocabulary actually is. The Indian team notes: ‘Nearly 62% of the clickbait headlines contained one of the 40 most common clickbait subject words. On the other hand, only 16% of the non-clickbait headlines contained the top 40 non-clickbait subject words.’

The problem online publishers face is that good headline writing is now antagonistic to good commercial practice. Old school journalistic ethos teaches the writer and editor to boil down headlines so that as much of the entire story is encapsulated in the headline as possible. The dogma decrees: say it a) once in the headline, b) slightly longer in the opening paragraph and c) again, at length, in the rest of the article.

But this practice was formulated before informative headlines constituted ‘giving work away free’. It presupposes that the user is engaging with the print medium, where the publication has already been paid for (by someone, if not the user themselves), with all the ‘dirty tricks’ restricted to the front page (whose famous 72-point type is there, amongst other reasons, to limit the amount of free information).

Ironically the inverse practice applies to magazine feature article titles, which are frequently cryptic and enticing, yet are attached to articles that can only be read once the user has purchased or otherwise obtained the edition (and here the ‘curiosity gap’ can be exploited by making the edition’s page numbering obscure, or even omitting it, to gain greater exposure for ad pages).

Twitter’s artificial SMS-based character limit has taken the practice of pithy headline-distillation into the mainstream over the last decade, but epitomises the conflict inherent in paying people to write genuinely informative headlines. If any new technology succeeds in identifying clickbait, it remains to be seen if it will be able to evolve in tandem with the publishers’ need to draw the reader to where the ads are.


advertising feature India research
Send us a correction about this article Send us a news tip