Features Hub

Unmasking Adversarial AI with Pin-Yu Chen

Fri 21 Jun 2019 | Pin-Yu Chen

Machine learning researchers are not solely concerned with improving the accuracy of models. They want to know how they can be corrupted and undermined – a research agenda that warrants more attention. Techerati spoke to an IBM researcher at the heart of it

It’s no exaggeration to say that artificial intelligence (AI) has been the most transformative technology this side of the millennium. Although it has its origins in the 1950s, techniques developed in the last few years in machine learning, and its offshoots neural networks and natural language processing, have produced remarkable results that would have been unthinkable to those early researchers.

Today, AI surrounds us. We can now interact with retailers and other organisations via convincing chatbots, instantly translate most languages into our native tongue, and make smart investment decisions with a few taps of an app. But as AI encroaches further into our lives, and we allow it to drive our cars, decide our recruitment decisions, or predict crime, the technology community and society at large are rightly insisting that autonomous systems are not just intelligent, but trustworthy.

Trustworthy AI

It is largely agreed that AI trustworthiness involves satisfying four conditions: fairness, transparency, explainability (and thus accountability) and robustness. To understand these conditions, just think of a political body, a comparison that gets to the nub of the debate surrounding autonomy, sovereignty and legitimacy.

We can think of a trustworthy AI as a legitimate government. We want governments to heed the concerns of all of the citizens over which it presides and pass equitable laws (fairness), we want to know how it arrives at laws that affect us (transparency), laws that ought to be intelligible to us (explainability), and we want to ensure governments are resilient to the intrusion of nefarious or incompetent actors who might encourage the production of dangerous or counterproductive laws (robustness).

Since as early as 2016, the IBM Thomas J. Watson Research Center has investigated ways to bring trust to AI, so that the users who rely on it; humans; organisations and countries, can trust the decisions made on the basis of its models. To that end, Big Blue churns out papers, hosts annual conferences and releases evaluative software that enables users to gauge the trustworthiness of their AI systems, and develop more trustworthy ones. They are by no means alone in this effort, as society is gradually waking up to the implications of an increasingly automated world.

Pin-Yu Chen is a research member of the Trusted AI Group & MIT-IBM AI Lab at the IBM Watson Research Center, who primarily focuses on the robustness of AI in neural networks and is one of the field’s most prolific researchers. Speaking to Techerati, Chen explained the state of AI robustness research, and navigated us through the field’s most pressing research questions.

Neural Networks

Neural networks have come to dominate AI. They are essentially very large networks containing millions of parameters. If you flood them with an ocean of raw data, such as images with labels representing the tasks you want it to learn (e.g. which face belongs to who) and anchor the data to classifiers (this face belongs to Bob), neural networks transform into an end-to-learn learning system as if by magic. Neural networks are used not just to recognise images but speech, text and symbols, famously enabling self-driving cars to comprehend the array of visual objects through which they navigate.

But in 2013, researchers, some employed by Google, released a seminal paper that peeked under the hood of these mysterious networks and demystified their decision making. Entitled “The Intriguing properties of neural networks”, the paper (among other things) revealed how neural networks can be manipulated and become adversarial.

“Public awareness of adversarial AI has increased markedly in the past few years”

Adversarial AI

“When I first saw this paper, I was amazed,” Chen said. “Understanding how neural networks work, how they make decisions, including wrong decisions; this the core mission for machine learning researchers like us. That’s why I started to be very interested in this field and think I had some tools and ideas I could offer this domain, and this is how I started this mission.”

By adding small artefacts to the data fed into models (such as noise to images), the researchers revealed how models could be misled into thinking chalk is cheese, corruptions the literature refers to adversarial examples.

Around the same time as Google and co. released their paper, the term adversarial AI was gaining traction in the AI universe for a different reason. Researchers had developed a training method that was deliberately adversarial called a generative adversarial network model (GAN). GANs pit a generator and an adversarial network called a discriminator against each other in a two-way battle that can produce altogether new worlds (to make matters even more complicated, researchers have even used GANs to generate adversarial examples). If a generator is tasked with predicting (or generating) features of a face, the discriminator then evaluates these confections for authenticity. Over time, GANs can synthesise spookily convincing faces, commonly known as deepfakes.


It was the discovery of adversarial examples that ultimately proved more compelling for researchers. A research direction for AI robustness revealed itself, marking an important milestone in the history of AI, causing research to splinter. As adversarial examples are the route through which AI robustness is threatened, Chen and his peers are no longer solely concerned with improving model accuracy, but exploring how adversarial examples interact with and undermine models. For Chen, adversarial AI refers to not a productive training method, but an adversarial objective.

“Before AI robustness came into the picture, AI usually only had one mission: to try to make sure the prediction accuracy was as high as possible, but in terms of the adversarial objective, it’s basically finding a way to make a performance degradation of the AI model,” he said.

Adversarial Avenues

Adversarial attacks can take place at each phase in the lifecycle of an AI model, phases that are broadly-speaking divisible into the training and deployment/inference phases.

During training, researchers specify the model they are going to use, what it will learn, and have the model learn it. The fuel that powers this whole engine is data and usually lots of it. Rather than collecting data first hand, it is common for data scientists to trawl the web for public data or rely on vast deposits of data crowdsourced by their peers, a tactic that shaves hours off training time.

The practice of funneling thousands of peoples’ faces and online utterances into AI models has come under fire for obvious ethical reasons (Microsoft recently deleted its own mega-archive of facial recognition data). Those reasons aside, third-party data also cannot be trusted, as there is nothing to prevent actors from embedding poisonous triggers in a subset of it, in what’s known as a poisoning attack.

Street sign modifications: An example of a physical adversarial example

In these attacks, input data is manipulated to evade classification during training time. Recent research has laid bare the danger of such a scenario with the example of autonomous cars. Consider a model trained to classify traffic signs. Now imagine that it is fed a selection of images of signs distorted with a specific trigger pattern that tricks the classifier into labelling them as speed limit signs instead. The attacker simply recreates the distortion in the real-world during deployment to ensure misclassification. An untold number of data-driven products and services could be devastated by privacy and security scandals or economic losses if someone poisoned these data reservoirs in this fashion.

“In that case, attackers basically have a way to manipulate the decision of an AI model. He can basically ask the classifier to determine it is a speed limit at any time he wants. Clearly, there are some safety concerns,” Chen said.

This type of attack was infamously the undoing of Microsoft Tay, a chatbot designed to learn how to converse with Twitter users by digesting their tweets. The project ran aground almost as soon as it launched when trolls cajoled the service into voicing racist and sexist opinions. While Tay was programmed to mimic human behaviours it was not taught to recognise misleading ones. What began as a benign experiment to train chatbots in the art of polite conversation mutated into a horrifying (yet hilarious) example of how unfettered data can be ruinous to AI models.

An alternative route for attackers is to lay down “traps” during deployment/testing time that cause certain labels to evade classification, in what is known as an evasion attack. With evasion attacks, the attacker does not know what data the model is trained on, but may have knowledge of the model it uses. Like poisoning attacks, evasion attacks should not be dismissed as academic navel-gazing, but real threats that could quite literally throw autonomous systems off-track. It was only in April that researchers sent a Tesla S careering off-road by placing a series of small stickers in its path.

White-box vs. black box attacks

It’s alarming that a few stickers plastered onto a highway can have such damaging effects, but thankfully these sorts of attacks are still theoretical. The examples mentioned so far all assume an attacker has intimate knowledge of the AI application: the models it uses (and their key parameters) and data off which it was trained. These are what are known as white-box attacks. In the real world it’s unlikely that an actor would be privy to all this information, as measures can be easily implemented to shelter an autononous system’s machinery from prying eyes.

One of the IBM Watson research team’s most important contributions to the field of adversarial AI (in collaboration with the University of California (UoC)), is demonstrating that attackers do not need to know a model’s decision-making process to wreak havoc. It is a discovery that took adversarial AI out of the academic petri dish and into the real world.

“When you actually consider a practical scenario, there is actually no way for an attacker to know what is the AI model behind the scenario,” Chen said. “We figured out a way to say yes, a model can still be attacked.”

The paper proposed a novel technique called “the zero order of transition” that reverse engineered adversarial examples by successively adding noise to the input and observing the output (for instance, the prediction of the model). By applying optimisation theories to their observations, the researchers showed they could cause misclassification in a black-box fashion.

“Basically, this showed an adversarial attack is indeed a practical threat. It’s something overtly plausible, grounded and down to earth. Thus we need to think hard about how we make our models more robust,” Chen said.

Improving robustness

While examining how AI systems can be undermined by adversarial examples is an integral component of the adversarial AI research agenda, discovering ways to improve the robustness of AI models is just as important. But it’s not as straightforward as simply repairing broken models.

It first needs to be ascertained that a given model is adversarial, which, given the complexity of neural networks, is a modern-day analogue to cracking the enigma code. Indeed, IARPA and other intelligence agencies are desperately looking to develop software that can inspect or predict if a system has been tampered with.

“Like poisoning attacks, evasion attacks should not be dismissed as academic navel-gazing, but real threats that could quite literally throw autonomous systems off-track”

“A neural network is a big, giant, and complicated model, as it kind of learns by itself in an intuitive fashion. As a researcher, we don’t have much control over what a neural network is learning, so establishing whether it is poisoned is a very challenging task,” Chen said.

Supposing tampering has been established, research into how models can be repaired is still preliminary. When it comes to evasion attacks the research represents a simulated arms race between researchers, one side playing defence and the other side playing attack.

“On the one side is the defenders can leverage better AI models to improve robustness, but on the other an attacker will also use AI models to try to break the new defences. It’s very common to see new defences published, defences that are then broken the next month. It’s a very fun dynamic,” he said.

When it comes to backdoor attacks, such as the example of misclassification of traffic signs, in principle repairing is relatively straightforward. Researchers can observe the training data and perform anomaly detection to filter out poisoned data.

“If I know which data symbols could be tampered, I just remove them from my training data. I then train a new model based on the rest of the data that looks clean and smaller – then I can somehow repair my model,” Chen said.

In practice, this is difficult as it requires knowledge of the symbols that an attacker has compromised, even before it is known if a model has been poisoned at all. If it’s a false alarm data scientists are doing something redundant that negatively impacts the performance of the model. Compounding the problem, there is nothing stopping an attacker refining his attack, Chen explained.

“If I as an attacker know a defender is going to do anomaly detection and filter my data, I just come out with a better attack and try to hide my trigger pattern better, or perform a new backdoor attack in a different way that bypasses the detection.”

Nevertheless, there is a case to be made that engineers should employ a “better safe than sorry” approach and perform anomaly filtering on all datasets as part of the AI production process. In the research community though, it’s still undecided whether this constitutes a “robust” model. Some would argue the true test of robustness is if a model can be repaired after the attacker has made the first move.

“I don’t think our community has a very good conclusion at this point,” Chen said.

Evaluation and verification

Even though there is still has a long way to go, Chen and IBM have managed to arm data scientists with a toolbox that can prima facie evaluate the robustness of their AI models; a sort of roadworthiness test for models that teams can apply before systems are let loose in the real world. The toolbox provides several benchmark scores to help users quantify model robustness, tools to “harden” a model to make it more robust, and that flag any inputs an adversary might have tampered with.

“I would project that in the near future, people will be required to meet robustness standards. Whoever develops the new AI models, will have to pass those robustness tests, or satisfy those robustness standards. This is where our evaluation or verification tools will be very useful.”

In a paper penned with researchers from MIT and UoC, Chen and his colleagues applied the same approaches to evaluate ImageNet, an image database widely used by image recognition researchers. Until 2017, developers competed to demonstrate the accuracy of their models by deploying them on the database, a competition that has played a significant role in improving the precision of AI models.

In the paper, the researchers revisited 18 neural network models submitted to the ImageNet Challenge that achieved state-of-the-art performance and tested their robustness. Much to their surprise, they discovered that the more accurate models are the most vulnerable to adversarial examples, establishing what is now known as the robustness-accuracy trade off.

“We took those 18 different models and showed that there is indeed a trade off. It is now kind of accepted as common sense in this research, but back then it was not obvious.”

Modelling the future

It is still early days in the adversarial AI arena, one where the stakes are high and outcomes uncertain. But there is cause for optimism. Just like efforts to improve machine learning accuracy, adversarial AI research has progressed at a staggering pace, and largely given “the defenders” – researchers, companies and public sector organisations – a decisive head start. Thankfully, public awareness of adversarial AI has increased markedly in the past few years. Though it is important this pressing and genuine threat spreads further into the public sphere, superseding the misguided fear of robot enslavement on the list of our AI concerns.

“Robustness was kind of a brand new direction. But people accepted the notion very quickly and it’s very surprising to me how people now desire AI to be robust in addition to being accurate. Personally, I find this direction very promising,” Chen said.

Experts featured:

Pin-Yu Chen

Research Staff Member
Trusted AI Group MIT-IBM AI Lab, IBM Thomas J. Watson Research Center


adversarial ai AI artificial intelligence machine learning neural networks
Send us a correction Send us a news tip