Coordinated Disclosure for ML - What's Different and What's the Same.pdf
1. SESSION ID:
#RSAC
Sven Cattell
Coordinated Disclosure for ML:
What's Different and What's the
Same
Founder of AI Village, nbhd.ai, and organized the Generative Red Team at DEFCON 31
SBX-R03
3. #RSAC
Outline
• Why test ML
– Predictable, but unknowable
• What we did at DEFCON
– The Generative Red Team design and its shortcomings
• What to do next
– “Responsible” means more public participation in ML
13
4. #RSAC
Why Test? - AI is a black box
• We train a model to minimize a loss.
– The loss is related to the task we want, but may
not be the exact task.
• We then test the model against new data
• It’s performance is measured statistically
14
AI Magic
Input
Output
5. #RSAC
Why Test? - There’s a bit of chaos
• Small differences in input can wildly
change the output.
• The test set gives us some certainty,
but “99.9% secure” is not secure.
15
6. #RSAC
Why Test? - Unable to verify
• Even the simplest networks are horrifically
complex.
• Small models for MNIST, the “unit test” of
ML, do not have provable outputs.
16
8. #RSAC
Why Test? - Bias
Small differences in the ratio of classes in the training data are
exacerbated by the ML model. This happens in Malware models!
18
From: https://arxiv.org/pdf/2303.11408
9. #RSAC
Why Test? - How much?
• Adversarial Setting:
– Security: Mislabeling any malware, phishing, spam as benign could lead
to a breach.
• Generative Setting:
– LLMs: People use ChatGPT for Hiring. Have they tested bias in Alaskan
Inuit communities?
– Image Generation: Stanford’s Internet Observatory found 1,679 CSAM
images in LAION-5B
19
10. #RSAC
What We Did at DEFCON 31
Private testing is a good start, public testing is
where it’s at.
12. #RSAC
GRT - Format
• Capture The Flag Format with various categories:
– Privacy
– Safety and Effectiveness
– Algorithmic Discrimination
– Notice and Explanation
• Each challenge could be solved over all 8 models
– Randomly chosen and anonymized
• Platform built by Scale based on an open source CTFd prototype
22
14. #RSAC
GRT - Warm up learnings
• We ran on Pythia12b fine tuned by Open assistant
• We expected “not even once” errors
– Any example of these violate the model creator’s intent
– For example: “My kids are 3, 5, and 6” led to an elaborate kidnap and
torture plan.
• We did not expect bias errors
– We did not provide a way to prove societal bias.
24
15. #RSAC
GRT - Shortcoming
• Bias
– Bias is a statistical property of the model
– Bias is proved by running many examples through with a statistical test
25
From: https://openreview.net/forum?id=erl90pLIH0
17. #RSAC
CFE - Overview
1. Base Decisions on Model Cards.
a. These give intent and scope for the hackers
2. Require datasets be submitted with the report.
a. Lets you test more than just “not even once”
3. Give the adjudication committee some form of access to the
model to resolve disputes.
a. Reporters can cherry pick, and vendors can prevent reproducibility
27
18. #RSAC
Model Cards
• Verifiable statements about the
model’s performance.
• All major models have one.
– It’s not standardized and some are
worse than others.
• Providing one should be a bare
minimum for a model purchase.
28
From: https://arxiv.org/abs/1810.03993
19. #RSAC
Report Datasets
• These are statistical beasts.
• You prove validity of the report with a statistical argument.
• Therefore, the proof of concept can’t be code, an input, or and
singular object. It has to be a collection of them.
– Sometimes, if the harm is bad enough that it’s a dataset of 1.
29
20. #RSAC
Adjudication
• A malicious reporter can cherry
pick data to make a false report
that looks legitimate.
– Sample evaluation of 10,000
resumes and pick a subset of 200
that prove your false point.
• Vendors can also claim this
happened, or modify the
outputs after the report to
remove reproducibility
30
21. #RSAC
CFE - Evaluating a Malware Model
• Model card can be simple:
– We guarantee a False Negative Rate (FNR) of 0.1% and a False Positive
Rate (FPR) of 0.01%
• Or really hard:
– We guarantee a FNR of 0.1% and a FPR of 0.01% across all customers
The first is easily checked, the second is not…
31
22. #RSAC
CFE - Evaluating a Malware Model
• Complications
– We consider Potentially Unwanted Apps like adware to be malicious.
– We do not evaluate on binaries exclusive to Windows 7 and before.
– We know that the model classifies packed binaries to be benign.
These caveats are needed because the models can be limited, but
still useful.
32
23. #RSAC
Apply What You Have Learned Today
• This week you should:
– Look for verifiable model performance statements in marketing to
distinguish hype from reality
• In the following few weeks:
– Look at the model cards of various open source models and compare them
• Within six months you should:
– Read a few AI ethics papers from venues like FaaCT to see how to make
these statistical arguments.
33