Adversarial Machine Learning: Defense Strategies

Adversarial attacks manipulate ML model predictions, steal models, or extract data.

Different attack types exist, including evasion, data poisoning, Byzantine, and model extraction attacks.

Defense strategies like adversarial learning, monitoring, defensive distillation, and differential privacy improve robustness against adversarial attacks.

Multiple aspects have to be considered when evaluating the effectiveness of different defense strategies, including the method’s robustness, impact on model performance, and adaptability to the constant flow of brand-new attack mechanisms.

The growing prevalence of ML models in business-critical applications results in an increased incentive for malicious actors to attack the models for their benefit. Developing robust defense strategies becomes paramount as the stakes grow, especially in high-risk applications like autonomous driving and finance.

In this article, we’ll review common attack strategies and dive into the latest defense mechanisms for shielding machine learning systems against adversarial attacks. Join us as we unpack the essentials of safeguarding your AI investments.

Understanding adversarial attacks in ML

“Know thine enemy”—this famous saying, derived from Sun Tzu’s The Art of War, an ancient Chinese military treatise, is just as applicable to machine-learning systems today as it was to 5th-century BC warfare.

Before we discuss defense strategies against adversarial attacks, let’s briefly examine how these attacks work and what types of attacks exist. We will also review a couple of examples of successful attacks.

Goals of adversarial machine learning attacks

An adversary is typically attacking your AI system for one of two reasons:

To impact the predictions made by the model.
To retrieve and steal the model and/or the data it was trained on.

Adversarial attacks to impact model outputs

Attackers could introduce noise or misleading information into a model’s training data or inference input to alter its outputs.

The goal might be to bypass an ML-based security gate. For example, the attackers might try to fool a spam detector and deliver unwanted emails straight to your inbox.

Alternatively, attackers might be interested in ensuring that a model produces an output that’s favorable for them. For instance, attackers planning to defraud a bank might be seeking a positive credit score.

Finally, the corruption of a model’s outputs can be driven by the will to render the model unusable. Attackers could target a model used for facial recognition, causing it to misidentify individuals or fail to recognize them at all, thus completely paralyzing security systems at an airport.

Adversarial attacks to steal models and data

Attackers can also be interested in stealing the model itself or its training data.

They might repeatedly probe the model to see which inputs lead to which outputs, eventually learning to mimic the proprietary model’s behavior. The motivation is often to use it for their own purpose or to sell it to an interested party.

Similarly, attackers might be able to retrieve the training data from the model and use it for their benefit or simply sell it. Sensitive data such as personally identifiable information or medical records are worth a lot on the data black market.

Types of adversarial attacks

Adversarial machine learning can be categorized into two groups.

In white-box attacks, the adversary has full access to the model architecture, its weights, and sometimes even its training data. They can feed the model any desired input, observe its inner workings, and collect the raw model output.

In black-box attacks, the attacker knows nothing about the internals of their target system. They can only access it for inference, i.e., feed the system an input sample and collect the post-processed output.

Unsurprisingly, the white-box scenario is better for attackers. With detailed model information, they can craft highly effective adversarial campaigns that exploit specific model vulnerabilities. (We’ll see examples of this later.)

Regardless of the level of access to the targeted machine learning model, adversarial attacks can be further categorized as:

Evasion attacks,
Data-poisoning attacks,
Byzantine attacks,
Model-extraction attacks.

Evasion attacks

Evasion attacks aim to alter a model’s output. They trick it into making incorrect predictions by introducing subtly altered adversarial inputs during inference.

An infamous example is the picture of a panda below, which, after adding some noise that is unrecognizable to the human eye, is classified as depicting a gibbon.

Evasion attack. A model classifies an image as a panda. After adding a small amount of random noise to the image, invisible to the human eye, it is classified as a gibbon with extremely high confidence. — Evasion attack. A model classifies an image as a panda. After adding a small amount of random noise to the image, invisible to the human eye, it is classified as a gibbon with extremely high confidence | Source

Attackers can deliberately craft the noise to make the model produce the desired output. One common approach to achieve this is the Fast Gradient Sign Method (FGSM), in which the noise is calculated as the sign of the gradient of the model’s loss function with respect to the input, with the goal of maximizing the prediction error.

The FGSM approach bears some resemblance to the model training process. Just like during regular training, where, given the inputs, the weights are optimized to minimize the loss, FGSM optimizes the inputs given the weights to maximize the loss.

Attacks with FGSM are only feasible in a white-box scenario, where the gradient can be calculated directly. In the black-box case, attackers must resort to methods like Zeroth-Order Optimization or Boundary Attacks that approximate the gradients.

Data-poisoning attacks

Data-poisoning attacks are another flavor of adversarial machine learning. They aim to contaminate a model’s training set to impact its predictions.

An attacker typically needs direct access to the training data to conduct a data-poisoning attack. They might be the company’s employees developing the ML system (known as an insider threat).

Consider the following data sample a bank used to train a credit-scoring algorithm. Can you spot anything fishy?

Adversarial machine learning: data-poisoning attacks.

If you look closely, you will notice that every 30-year-old was assigned a credit score above 700. This so-called backdoor could have been introduced by corrupt employees. A model trained on the data will likely pick up on the strong correlation of age==30 with the high credit score. This will likely result in a credit line being approved for any 30-year-old – perhaps the employees themselves or their co-conspirators.

However, data poisoning is also possible without direct data access. Today, a lot of training data is user-generated. Content recommendation engines or large language models are trained on data scraped from the internet. Thus, everyone can create malicious data that might end up in a model training set. Think about fake news campaigns attempting to bias recommendation and moderation algorithms.

Byzantine attacks

Byzantine attacks target distributed or federated learning systems, where the training process is spread across multiple devices or compute units. These systems rely on individual units to perform local computations and send updates to a central server, which aggregates these updates to refine a global model.

In a Byzantine attack, an adversary compromises some of these compute units. Instead of sending correct updates, the compromised units send misleading updates to the central aggregation server. The goal of these attacks is to corrupt the global model during the training phase, leading to poor performance or even malfunctioning when it is deployed.

Model-extraction attacks

Model-extraction attacks consist of repeatedly probing the model to retrieve its concept (the input-output mapping it has learned) or the data it was trained on. They are typically black-box attacks. (In the white-box scenario, one already has access to the model.)

To extract a model, the adversary might send a large number of heterogeneous requests to the model that try to span most of the feature space and record the received outputs. The data collected this way could be enough to train a model that will mimic the original model’s behavior.

For neural networks, this attack is particularly efficient if the adversary knows a model’s entire output distribution. In a process known as knowledge distillation, the model trained by the attackers learns to replicate not just the original model’s output but also its inner prediction process.

Extracting the training data from the model is more tricky, but bad actors have their ways. For example, the model’s loss on training data is typically smaller than previously unseen data. In the white-box scenario, the attackers might feed many data points to the model and use the loss to infer if the data points were used for training.

Attackers can reconstruct training data with quite high accuracy. In the paper Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures by Fredrikson et al., the authors demonstrated how to recover recognizable images of people’s faces given only their names and access to an ML face recognition model. In his post on the OpenMined blog, Tom Titcombe discusses the approach in more detail and includes a replicable example.

Model-extraction attack. The original training sample (right) was reconstructed from a face-recognition model (left) | Source

Examples of adversarial attacks

Adversarial machine learning attacks can have disastrous consequences. Let’s examine a couple of examples from different domains.

Researchers from Tencent’s Keen Security Lab conducted experiments on Tesla’s autopilot system, demonstrating they could manipulate it by placing small objects on the road or modifying lane markings. These attacks caused the car to change lanes unexpectedly or misinterpret road conditions.

In the paper “DolphinAttack: Inaudible Voice Commands,” the authors showed that ultrasonic commands inaudible to humans could manipulate voice-controlled systems like Siri, Alexa, and Google Assistant to perform actions without the user’s knowledge.

In the world of finance, where a great deal of securities trading is performed by automated systems (the so-called algorithmic trading), it has been shown that a simple, low-cost attack can cause the machine learning algorithm to mispredict asset returns, leading to a money loss for the investor.

While the examples above are research results, there have also been widely publicized adversarial attacks. Microsoft’s AI chatbot Tay was launched in 2016 and was supposed to learn from interactions with Twitter users. However, adversarial users quickly exploited Tay by bombarding it with offensive tweets, leading Tay to produce inappropriate and offensive content within hours of its launch. This incident forced Microsoft to take Tay offline.

Defense strategies for adversarial machine learning

Equipped with a thorough understanding of adversaries’ goals and strategies, let’s look at some defense strategies that improve the robustness of AI systems against attacks.

Adversarial learning

Adversarial learning, also called adversarial training, is arguably the simplest way to make a machine-learning model more robust against evasion attacks.

The basic idea is to put on the attacker’s hat and generate adversarial examples to add to the model’s training dataset. This way, the ML model learns to produce correct predictions for these slightly perturbed inputs.

Technically speaking, adversarial learning modifies the model’s loss function. During training, for each batch of training examples, we generate another batch of adversarial examples using the attacking technique of choice based on the model’s current weights. Next, we evaluate separate loss functions for the original and the adversarial samples. The final loss used to update the weights is a weighted average between the two losses:

Defense strategies for adversarial machine learning: adversarial learning

Here, m and k are the numbers of original and adversarial examples in the batch, respectively, and λ is a weighing factor: the larger it is, the stronger we enforce the robustness against adversarial samples, at the cost of potentially decreasing the performance on the original ones.

Adversarial learning is a highly effective defense strategy. However, it comes with one crucial limitation: The model trained in an adversarial way is only robust against the attack flavors used for training.

Ideally, one would use all the state-of-the-art adversarial attack strategies to generate perturbed training examples, but this is impossible. First, some of them require a lot of compute, and second, the arms race continues, and attackers are constantly inventing new techniques.

Monitoring

Another approach to defending machine-learning systems against attacks relies on monitoring the requests sent to the model to detect adversarial samples.

We can use specialized machine-learning models to detect input samples that have been intentionally altered to mislead the model. These could be models specifically trained to detect perturbed inputs or models similar to the attacked model but using a different architecture. Since many evasion attacks are architecture-specific, these monitoring models should not be fooled, leading to a prediction disagreement with the original model signaling an attack.

By identifying adversarial samples early, the monitoring system can trigger alerts and proactively mitigate the impact. For example, in an autonomous vehicle, monitoring models could flag manipulated sensor data designed to mislead its navigation system, prompting it to switch to a safe mode. In financial systems, monitoring can detect fraudulent transactions crafted to exploit machine-learning systems for fraud detection, enabling timely intervention to prevent losses.

Defensive distillation

In the paper Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks, researchers from Penn State University and the University of Wisconsin-Madison proposed using knowledge distillation as a defense strategy against adversarial machine learning attacks.

Their core idea is to leverage the knowledge distilled in the form of probabilities produced by a larger deep neural network and transfer this knowledge to a smaller deep neural network while maintaining comparable accuracy. Unlike traditional distillation, which aims for model compression, defensive distillation retains the same network architecture for both the original and distilled models.

The process begins by training the initial model on a dataset with a softmax output. The outputs are probabilities representing the model’s confidence across all classes, providing more nuanced information than hard labels. A new training set is then created using these probabilities as soft targets. A second model, identical in architecture to the first, is trained on this new dataset.

Defensive distillation. The probabilities of the initial network are used as training labels for the distilled network. — Defensive distillation. The probabilities of the initial network are used as training labels for the distilled network | Source

The advantage of using soft targets lies in the richer information they provide, reflecting the model’s relative confidence across classes. For example, in digit recognition, a model might output a 0.6 probability for a digit being 7 and 0.4 for it being 1, indicating visual similarity between these two digits. This additional information helps the model generalize better and resist overfitting, making it less susceptible to adversarial perturbations.

Defense against data-poisoning attacks

So far, we have discussed the defense strategies against evasion attacks. Let’s consider how we can protect ourselves against data-poisoning attacks.

Unsurprisingly, a large part of the effort is guarding the access to the model’s training data and verifying whether it’s been tampered with. The standard security principles comprise:

Access control, which includes policies regulating user access and privileges and ensuring only authorized users can modify training data.
Audit trails, i.e., maintenance of records of all activities and transactions to track user actions and identify malicious behavior. This helps swiftly exclude or downgrade the privileges of malicious users.
Data sanitization, which comprises cleaning the training data to remove potential poisoning samples using outlier detection techniques. This might require access to pristine, untainted data for comparison.

Differential privacy

As we have seen earlier, data extraction attacks aim to find the exact data points used for training a model. This data is often sensitive and protected. One safeguard against such attacks is employing differential privacy.

Differential privacy is a technique designed to protect individual data privacy while allowing aggregate data analysis. It ensures that removing or adding a single data point in a dataset does not significantly affect the output of any analysis, thus preserving the privacy of individual data entries.

The core idea of differential privacy is to add a controlled amount of random noise to the results of queries or computations on the dataset. This noise is calibrated according to a parameter known as the privacy budget, which quantifies the trade-off between privacy and accuracy. A smaller budget means better privacy but less accurate results, and a larger budget allows more accurate results at the cost of reduced privacy.

In the context of training machine learning models, differential privacy adds noise to the training data, so the accuracy of the model trained on these data is unchanged. However, since the training examples are obscured by noise, no precise information about them can be extracted.

Finally, let’s analyze defense strategies against model-extraction attacks.

As discussed earlier, extraction attacks often involve the adversary making repeated requests to the model. An obvious protection against that is rate-limiting the API. By reducing the number of queries an attacker can make in a given time window, we slow down the extraction process. However, determined adversaries can bypass rate limits by using multiple accounts or distributing queries over extended periods. We are also running the risk of inconveniencing legitimate users.

Alternatively, we can add noise to the model’s output. This noise needs to be small enough not to affect how legitimate users interact with the model and large enough to hinder an attacker’s ability to replicate the target model accurately. Balancing security and usability requires careful calibration.

Finally, while not a defense strategy per se, watermarking the ML model’s output may allow us to track and identify the usage of stolen models. Watermarks can be designed to have a negligible impact on the model’s performance while providing a means for legal action against parties who misuse or steal the model.

Selecting and evaluating defense methods against adversarial attacks

Picking defense strategies against adversarial machine-learning attacks requires us to consider multiple aspects.

We typically start by assessing the attack type(s) we need to protect against. Then, we analyze the available methods based on their robustness, impact on the model’s performance, and their adaptability to the constant flow of brand-new attack mechanisms.

I have summarized the methods we discussed and key considerations in the following table:

	Targeted attack type	Robustness against attack type	Impact on model performance	Adaptability to new attacks
		Strong against known attacks but weak against new techniques.	May decrease performance on clean data.	Needs regular updates for new attacks.
		Effective for real-time detection but can miss sophisticated attacks.	No direct impact but requires additional resources.	Adaptable but might require updates.
			Maintains accuracy with slight overhead during training.	Less adaptable without retraining.
				Prevents all poisoning attacks by external adversaries.
		Effective if all relevant activity is captured and recognized.		Attackers might find ways to evade leaving traces or delay alerts.
		Somewhat effective if clean baseline and/or statistical properties are known.	If legitimate samples are mistakenly removed or altered (false positives), model performance might degrade.	Only known manipulation patterns can be detected.
		Effective against data extraction attacks as it obscures information about individual data points.	Needs careful calibration to balance privacy and model accuracy.	Highly adaptive: regardless of the attack method, the data is obscured.
	Model and data extraction	Effective against attackers with limited resources or time budget.	Legitimate users who need to access model at high rate are impacted.	Effective against all attacks that rely on a large number of samples.
Adding noise to model output	Model and data extraction		Degraded performance if too much noise is added.	Effective against all extraction attacks that rely on accurate samples.
Watermarking model outputs		Does not prevent extraction but aids in proving a model was extracted.

What’s next in adversarial ML?

Adversarial machine learning is an active research area. A quick Google Scholar search reveals nearly 10,000 papers published on this topic in 2024 alone (as of the end of May). The arms race continues as new attacks and defense methods are proposed.

A recent survey paper, “Adversarial Attacks and Defenses in Machine Learning-Powered Networks,“ outlines the most likely future developments in the field.

In the attackers’ camp, future efforts will likely focus on reducing attack costs, improving the transferability of attack approaches across different datasets and model architectures, and extending the attacks beyond classification tasks.

The defenders are not idle, either. Most research focuses on the trade-off between defense effectiveness and overhead (additional training time or complexity) and the adaptability to new attacks. Researchers attempt to find mechanisms that provably guarantee a certain level of defense performance, irrespective of the method of attack.

At the same time, standardized benchmarks and evaluation metrics are being developed to facilitate a more systematic assessment of defense strategies. For example, RobustBench provides a standardized benchmark for evaluating adversarial robustness. It includes a collection of pre-trained models, standardized evaluation protocols, and a leaderboard ranking models based on their robustness against various adversarial attacks.

In summary, the landscape of adversarial machine learning is characterized by rapid advancements and a perpetual battle between attack and defense mechanisms. This race has no winner, but whichever side is ahead at any given moment will impact the security, reliability, and trustworthiness of AI systems in critical applications.

Was the article useful?

Thank you for your feedback!

Thanks for your vote! It’s been noted. | What topics you would like to see for your next read?

Thanks for your vote! It’s been noted. | Let us know what should be improved.

Thanks! Your suggestions have been forwarded to our editors

Explore more content topics:

Source link
lol