This article is part of our reviews of AI research papers, a series of posts that explore the latest findings in artificial intelligence.

There’s growing concern about new security threats that arise from machine learning models becoming an important component of many critical applications. At the top of the list of threats are adversarial attacks, data samples that have been inconspicuously modified to manipulate the behavior of the targeted machine learning model.

Adversarial machine learning has become a hot area of research and the topic of talks and workshops at artificial intelligence conferences. Scientists are regularly finding new ways to attack and defend machine learning models.

A new technique developed by researchers at Carnegie Mellon University and the KAIST Cybersecurity Research Center employs unsupervised learning to address some of the challenges of current methods used to detect adversarial attacks. Presented at the Adversarial Machine Learning Workshop (AdvML) of the ACM Conference on Knowledge Discovery and Data Mining (KDD 2021), the new technique takes advantage of machine learning explainability methods to find out which input data might have gone through adversarial perturbation.

Creating adversarial examples

Say an attacker wants to stage an adversarial attack that causes an image classifier to change the label of an image from “dog” to “cat.” The attacker starts with the unmodified image of a dog. When the target model processes this image, it returns a list of confidence scores for each of the classes it has been trained on. The class with the highest confidence score corresponds to the class to which the image belongs.

machine learning image classification dog

The attacker then adds a small amount of random noise to the image and runs it through the model again. The modification results in a small change to the model’s output. By repeating the process, the attacker finds a direction that will cause the main confidence score to decrease and the target confidence score to increase. By repeating this process, the attacker can cause the machine learning model to change its output from one class to another.

Adversarial attack algorithms usually have an epsilon parameter that limits the amount of change allowed to the original image. The epsilon parameter makes sure the adversarial perturbations remain imperceptible to human eyes.

Adversarial example
Adding adversarial noise to an image reduces the confidence score of the main…

Continue reading: