Exploiting AI Models: Adversarial Attacks and Defense Mechanisms

EXPLOITING AI MODELS
Adversarial Attacks and Defense Mechanisms
Presented by
Bryan Zarnett
Chief Technology Officer at NetraScale

INTRODUCTIONS 1/12
"AI's biggest strength is also its
weakest link. The smarter it
gets, the more complexity that
is added, the better a
professional threat actor can
blend in.”
AGENDA
Using Cyber-security Threats for Inspiration
AI Vulnerabilities and Models at Risk
Adversarial Attacks and Attack Models
Defense Strategies and Tactics
Real-world Limitations or Problems to Solve?

PERSPECTIVE 2/12
Using Cyber-security Threats for Inspiration
In offensive security, we apply three points of view – yours, the opponent, and a bystander. In
developing solutions with one perspective being offensive security, take into consideration….
OFFENSIVE
Implementation of solutions that implement attacks in
a positive manner (using skills for good versus evil)
PREVENTATIVE
Assurance that the fundamental and emerging
attacks are not present in your solution.
3
2
DEFENSIVE
Implementation of solutions that detect, deter,
or respond to attacks and related activities.
1

INTRODUCTION 3/12
AI VULNERABILITIES
AI models are inherently vulnerable to attacks due to their reliance on data, complex architectures,
and often opaque decision-making processes. These vulnerabilities make AI systems prime targets
for malicious actors who seek to manipulate outcomes, extract information, or disrupt services.
Model
Dependencies
Complexity
Increased
Deployment
AI models rely heavily on large volumes of data for training and
refinement. Any tampering with this data (data poisoning) can degrade
model performance, making AI systems prone to making incorrect or
biased decisions.
Many AI models, especially deep learning networks, operate as "black
boxes" with limited interpretability. This opacity creates a security
challenge: it’s difficult to predict how a model might respond to
adversarial inputs, making it challenging to detect or mitigate potential
vulnerabilities.
As AI becomes more prevalent in essential systems like financial
services, healthcare, and autonomous vehicles, the consequences of
attacks can be severe. AI security isn’t just a technical concern but a
matter of public safety, regulatory compliance, and societal trust.

INTRODUCTION 4/12
MODELS AT RISK
1
2
3
4
IMAGE
CLASSIFIERS
NLP MODELS
REINFORCEMENT
LEARNING MODELS
GENERATIVE
MODELS
These models are often deployed in security systems (like facial recognition) and autonomous systems
(such as driverless cars). However, small, carefully crafted perturbations can mislead these classifiers,
causing the model to misidentify objects or people, potentially leading to severe safety risks.
NLP models, used in applications like chatbots, sentiment analysis, and language translation, are vulnerable
to input manipulation. Adversaries can craft textual inputs to produce biased or harmful outputs, extract
confidential information, or mislead users.
These models are commonly used in dynamic environments like trading systems, game AI, and robotics.
Adversarial attacks on reinforcement learning can lead to suboptimal or harmful decisions. In the financial
sector, this might cause a model to buy or sell assets incorrectly.
Generative models, such as Generative Adversarial Networks (GANs), produce content like images or text.
These models can be attacked to create fake or misleading content, which could have severe consequences
for cybersecurity and information integrity, such as generating synthetic identities for fraud.

FUNDAMENTALS 5/12
ADVERSARIAL ATTACKS
Adversarial attacks are deliberate attempts to deceive, manipulate, or compromise AI models by exploiting their vulnerabilities. These
attacks are designed to produce unintended behaviors in the model, often with severe implications depending on the application.
Evasion Poisoning Model Extraction
Attack
INFERENCE PHASE TRAINING DATASET PARAMETERIZATION
The attacker subtly alters the input
data to deceive the model without
needing access to the training data.
This is particularly effective against
image classification models, where
small modifications to pixels can
cause the model to misclassify an
object, often without human detection.
The attacker contaminates the training
dataset, introducing incorrect or
malicious data points to compromise
the model’s accuracy. Poisoning can
lead to long-term issues in the model,
as the model will consistently exhibit
biased or inaccurate behavior.
Aimed at learning the internal
parameters or the decision-making
logic of the model. Attackers query the
model multiple times to reconstruct or
approximate its functionality, often with
the intent of creating a similar model
without needing access to the original
training data or understanding the
internal architecture.
Example
By slightly altering an image of a stop
sign, attackers can cause an AI-driven
vehicle’s vision system to perceive it
as a yield sign
Introduce biased data into a medical AI
training set, causing it to misdiagnose
certain diseases or recommend
incorrect treatments.
Use model extraction to replicate a
proprietary recommendation algorithm
in a financial system. This could lead to
intellectual property theft and
significant competitive losses for the
model’s creator.

FUNDAMENTALS 6/12
ATTACK MODELS
Minor, carefully crafted
changes to an image could
cause an image classifier to
misidentify objects. In the
example of the "stop sign
attack," researchers modified
a stop sign’s pixels in a way
that was invisible to the human
eye. However, this alteration
caused the classifier in a self-
driving car to misinterpret it as
a yield or speed limit sign,
creating a severe safety risk.
1
Data Poisoning
Machine learning models are
often used to detect fraudulent
transactions. Attackers have
successfully injected
manipulated transactions into
training datasets, causing the
models to ignore specific
fraudulent patterns. By
poisoning the dataset, the
attackers can make future
fraudulent transactions less
likely to trigger alerts.
Model Extraction
Attackers used a series of API
queries to replicate a proprietary
sentiment analysis model. This
allowed them to create a near-
identical model without investing in
the original research, depriving the
service provider of revenue while
raising IP theft concerns.
3
Backdoor Attacks
A backdoor attack was
successfully embedded in a facial
recognition system used for
access control. By introducing
specific images with subtle,
repetitive patterns into the training
set, attackers created a hidden
trigger. When the trigger pattern
(such as a unique accessory or
small tattoo) was present, the
system would misclassify
unauthorized individuals as
authorized users.
4
2
Image Alteration

IN DEFENSE 7/12
STRATEGIES
Defense mechanisms aim to increase the model's resilience, detect adversarial behavior, and mitigate the effects of successful
attacks. Some of the most widely adopted defense strategies include adversarial training, model robustness enhancement, and
defensive distillation.
Adversarial Training
Exposing the model to adversarial examples
during the training phase. By training on both
clean and adversarial data, the model learns
to recognize and resist attacks, making it
more robust against similar threats.
Model Robustness
Building models that are less sensitive to
small changes in input, making them harder
to fool with adversarial examples. Robustness
can be improved through regularization
techniques, noise injection, or using more
complex architectures that generalize better.
Defensive Distillation
Training a simplified "student" model to mimic
the behavior of a more complex "teacher"
model. The process smooths out the decision
boundaries, making it harder for adversarial
attacks to find precise weaknesses.

IN DEFENSE 8/12
TACTICS
Tactics are the specific actions we take to implement our strategy effectively. In the context of AI, the tactics for detecting and
mitigating attacks are grounded in core principles of preventative security, addressing both technical vulnerabilities in code and threats
from social engineering. These approaches aim to identify unusual input patterns, assess uncertainty levels, and apply filtering
techniques to reduce potential risks and improve resilience against adversarial attacks.
Input Filtering
Input filtering attempts to identify and block
adversarial inputs before they reach the model.
This is often done by scanning inputs for
suspicious patterns or perturbations that
deviate from normal data distributions.
Filtering can be based on statistical methods,
such as detecting outliers, or more advanced
approaches like employing additional models
trained to classify inputs as adversarial or
benign.
Uncertainty Assessment
AI models can be equipped to assess their own
uncertainty regarding specific inputs, which
helps identify adversarial examples. When an
input causes unusually high uncertainty, the
model may flag it as suspicious.
Methods like Bayesian inference or dropout-
based uncertainty estimation allow models to
gauge confidence levels. If an input causes
high variance in predictions across multiple
runs, it might be an adversarial attempt.
Anomaly Detection
Detection models are trained specifically to
identify adversarial inputs by analyzing
patterns that differ from normal behavior.
Anomaly detection systems flag unusual inputs
or behaviors in real-time, offering an additional
layer of defense.
Autoencoders, statistical outlier detection, and
ensemble methods are commonly used to
distinguish between benign and adversarial
inputs. Detection models run alongside the
primary model, filtering inputs.

FUTURE STATE 9/12
REAL WORLD LIMITATIONS
“CODE” QUALITY
Poorly written or inadequately tested code can
introduce vulnerabilities that adversaries may
exploit, leading to unexpected model behaviors
or complete system compromise.
MODEL INTERPRETABILITY
Often compromised by defensive measures
like robustness improvements or distillation.
This can make it difficult for organizations to
understand model decisions and adhere to
regulatory requirements,
PERFORMANCE TRADE-OFFS
Stronger defenses can significantly slow
down model performance, making them
unsuitable for real-time or high-frequency
applications.
ADVANCING THREATS
New adversarial techniques are developed,
existing defenses may become outdated or
ineffective. A successful defense strategy
typically requires layered approaches that
combine several techniques to create a more
resilient AI system.

FUTURE STATE 10/12
REAL WORLD LIMITATIONS

FUTURE STATE 11/12
LOOKING FORWARD
SECURE MODEL ARCHITECTURES
AI MONITORING SYSTEMS
EXPLAINABILITY
Consider the following
milestones in the
development of your
solution!
These architectures integrate security
measures such as adversarial
robustness, encrypted computations,
and privacy-preserving techniques to
protect sensitive data and model
integrity.
These systems monitor inputs,
outputs, and model decisions to detect
anomalies, potential adversarial
attacks, and drift in model
performance.
Explainability tools help uncover how
models reach specific decisions,
allowing organizations to understand
potential vulnerabilities, biases, or
unexpected behaviors.

Exploiting AI Models: Adversarial Attacks and Defense Mechanisms

More Related Content

Similar to Exploiting AI Models: Adversarial Attacks and Defense Mechanisms (20)

Recently uploaded (20)

Exploiting AI Models: Adversarial Attacks and Defense Mechanisms