SlideShare a Scribd company logo
Software Testing and Engineering
for AI Systems (DSAIT4015)
Lecturers: Cynthia Liem and Annibale Panichella

1
Logistics
• Please enroll your project group (4 people) on Brightspace
• Let us know if we need to instantiate more groups
2
Data Validation and Validity
Lecturers: Cynthia Liem
3
What Can Go Wrong With Data?
4
What Can Go Wrong With Data?
5
I will not so much speak about database schema violations,
but rather about gaps between data and human interpretation
Pre-process
Labels
Optimized
Model
Data ML Training
Application
Pre-process
Labels
Optimized
Model
Data ML Training
Problem
Decisions
Is This a Dog?
8
Examples by Leonhard Applis
Source: https://nl.pinterest.com/pin/806003664560130745/ Source: https://www.istockphoto.com/nl/foto/wolf-pup-gm474625522-64803037
Is This a Dog?
9
Examples by Leonhard Applis
Is This a Dog?
10
Examples by Leonhard Applis
Is This a Dog?
11
Examples by Leonhard Applis
Source: https://disney.fandom.com/wiki/Goofy
Oracle Issues in Machine
Learning and Where To Find Them
Cynthia C. S. Liem and Annibale Panichella
12
https://dl.acm.org/doi/abs/10.1145/3387940.3391490
Use Case: Visual Object Recognition
13
Technology Review, 2014
Quartz, 2017
Visual Object Recognition
14
• Standardization of image dimensions
• [R,G,B] pixel intensities for
• vector of ground truth class probabilities, maximum likelihood optimization
• models will output an estimated probability vector
x
y
̂
y
P( = gold
fi
sh) = 0.0
P( = beagle) = 1.0
P( = volcano) = 0.0
P( = shower curtain) = 0.0
…
x
x
x
x
x y
f(x)
Visual Object Recognition
15
x y
• Standardization of image dimensions
• [R,G,B] pixel intensities for
• vector of ground truth class probabilities, maximum likelihood optimization
• models will output an estimated probability vector
x
y
̂
y
P( = gold
fi
sh) = 0.0
P( = beagle) = 1.0
P( = volcano) = 0.0
P( = shower curtain) = 0.0
…
x
x
x
x
f(x)
Visual Object Recognition
16
x y
• Standardization of image dimensions
• [R,G,B] pixel intensities for
• vector of ground truth class probabilities, maximum likelihood optimization
• models will output an estimated probability vector
x
y
̂
y
P( = gold
fi
sh) = 0.0
P( = beagle) = 1.0
P( = volcano) = 0.0
P( = shower curtain) = 0.0
…
x
x
x
x
f(x)
Visual Object Recognition
17
x y
• Standardization of image dimensions
• [R,G,B] pixel intensities for
• vector of ground truth class probabilities, maximum likelihood optimization
• models will output an estimated probability vector
x
y
̂
y
P( = gold
fi
sh) = 0.0
P( = beagle) = 1.0
P( = volcano) = 0.0
P( = shower curtain) = 0.0
…
x
x
x
x
f(x)
Visual Object Recognition
• Standardization of image dimensions
• [R,G,B] pixel intensities for
• vector of ground truth class probabilities, maximum likelihood optimization
• models will output an estimated probability vector
x
y
̂
y
18
x y
P( = gold
fi
sh) = 0.0
P( = beagle) = 1.0
P( = volcano) = 0.0
P( = shower curtain) = 0.0
…
x
x
x
x
f(x)
19
Con
fi
dent prediction
0,00
0,25
0,50
0,75
1,00
Gold
fi
sh
Beagle
Volcano
Shower
Curtain
Looking at ̂
y
Prediction not clear-cut
0,00
0,25
0,50
0,75
1,00
Gold
fi
sh
Beagle
Volcano
Shower
Curtain
20
Shannon Entropy
H( ̂
y) = −
∑
i
P(y = i|x)log2(P(y = i|x))
21
Low Entropy
0,00
0,25
0,50
0,75
1,00
Gold
fi
sh
Beagle
Volcano
Shower
Curtain
Looking at ̂
y
High Entropy
0,00
0,25
0,50
0,75
1,00
Gold
fi
sh
Beagle
Volcano
Shower
Curtain
22
Figure from https://openscience.com/wordnet-open-access-data-
in-linguistics/
• Labels in object recognition
are not independent
• Pictures can contains multiple
objects
• Semantic relations
Semantic Information
Wordnet
23
https://wordnet.princeton.edu
• Synonyms: pairs of class labels that
have the same meaning.
• Homonyms: pairs of class labels
that are spelled and pronounced the
same, but that have different meanings
• Meronyms: pairs of class labels
linked by a part-of relation
ImageNet
24
• Large-scale hierarchical image database
• Crowdsourced annotation: Is there an [X] in this image?
• ImageNet Large-ScaleVisual Recognition Challenge (ILSVRC)
2012
• ‘The’ object recognition benchmark challenge
• New models benchmarked ‘on ImageNet’:
•Trained on ILSVRC2012 training set
• Evaluated on ILSVRC2012 validation set
(top-1 and top-5 accuracy)
• Well-performing model weights often released for use
in transfer learning
ImageNet
25
• Large-scale hierarchical image database
• Crowdsourced annotation: Is there an [X] in this image?
• ImageNet Large-ScaleVisual Recognition Challenge (ILSVRC)
2012
• ‘The’ object recognition benchmark challenge
• New models benchmarked ‘on ImageNet’:
•Trained on ILSVRC2012 training set
• Evaluated on ILSVRC2012 validation set
(top-1 and top-5 accuracy)
• Well-performing model weights often released for use
in transfer learning
Setup
• 4 ‘classical’ deep architectures (VGG16,VGG19, ResNet50, ResNet101)
• pre-trained on ILSVRC2012, weights released through Keras
• predictions run on al 50,000 ILSVRC2012 validation images
• application of original pre-processing methods
• we use our heuristics to surface striking outliers
26
VGG16 VGG19 ResNet50 ResNet101
90.0% top-5
accuracy
90.1% top-5
accuracy
92.1% top-5
accuracy
92.8% top-5
accuracy
High Entropy: Kneepad
27
• None of the models recognized
the ground truth class in the top-5
• All models consistently showed
high entropy in ̂
y
High Entropy: Kneepad
• None of the models recognized
the ground truth class in the top-5
• All models consistently showed
high entropy in
• Due to standardization, only the
middle part of the image is offered
for prediction
̂
y
28
Low Entropy: Bucket
29
• None of the models recognized the
ground truth class in the top-5
• All models consistently were convinced
(class probability of 1.0) that this image
should be labeled as baseball
Low Entropy: Bucket
• None of the models recognized the
ground truth class in the top-5
• All models consistently were convinced
(class probability of 1.0) that this image
should be labeled as baseball
• Shortcoming of single-class labeling
30
Synonyms: Laptop
• Frequent top-1 confusions between laptop and notebook
• Looking at class probabilities, models do not ‘see’ synonym classes as close together
31
Figure 4: Top-5 classi!cations for laptop ima
(a) Original (b) Cropped
vgg16 vgg19
notebook (0.7222) notebook (0.7327)
laptop (0.1866) laptop (0.1178)
desktop computer (0.0244) desktop computer (0.04
space bar (0.0097) space bar (0.0243)
solar dish (0.0092) hand-held computer (0.0
(c) P
Figure 5: Top-5 classi!cations for laptop ima
4.3 Good performance vs. visual understanding
Our analysis surfaces various oracle issues, that globally hint at
assum
sub-s
0,00
0,23
0,45
0,68
0,90
vgg16
vgg19
ResNet50
ResNet101
notebook laptop
The World View Depicted in ILSVRC2012
• Not representative of the real world
• > 100 sub-species of dogs (5 cats)
• 1 red wine (no white wine)
• 1 Granny Smith (no other apples)
• 1 carbonara (no other pasta)
32
Carbonara
33
The ImageNet view:
equivalent exemplars
The Italian view
ImageNet’s Origins
34
Shankar et al. - No classi
fi
cation without representation:
Assessing geodiversity issues in open data sets for the developing world
https://arxiv.org/abs/1711.08536
Cultural Concepts
35
Cultural Concepts
36
What Is the Representative Sample?
•Criticism in psychology: samples often drawn entirely from Western, Educated,
Industrialized, Rich, and Democratic (WEIRD) societies
•“our review of the comparative database from across the behavioral sciences suggests
both that there is substantial variability in experimental results across populations and that
WEIRD subjects are particularly unusual compared with the rest of the species”
•“The
fi
ndings suggest that members of WEIRD societies, including young children, are
among the least representative populations one could
fi
nd for generalizing about humans.”
37
https://www.cambridge.org/core/journals/behavioral-and-brain-sciences/article/abs/weirdest-people-in-the-world/
BF84F7517D56AFF7B7EB58411A554C17
Non-Visual (and Stereotypical) Concepts
38
Bad Person, Call Girl, Drug Addict, Closet
Queen, Convict, Crazy, Failure, Flop, Fucker,
Hypocrite, Jezebel, Kleptomaniac, Loser,
Melancholic, Nonperson, Pervert, Prima
Donna, Schizophrenic, Second-Rater, Spinster,
Streetwalker, Stud,Tosser, Unskilled Person,
Wanton,Waverer, and Wimp
Is there an [X] in this image?
Crawford & Paglan - excavating.ai
Validity
39
Psychometrics
40
• Measuring constructs, which are not directly observable
• The measurement instrument is also known as a
psychological test (warning: vocabulary clash with software)
An Instrument Is Sound…
41
• …if it is valid
• …and if it is reliable
Reliability
42
• Internal consistency
• Test-retest reliability
• In case of subjective tests:
• Inter-rater reliability
• Intra-rater reliability
Validity vs. Reliability
43
Construct Validity
44
• Extent to which variables of an experiment correspond to the
theoretical meaning of the concept they purport to measure.
A Famous Violation of Construct Validity
45
‘Horse’ Systems
46
• Do not actually address the problem they
appear to be solving
• Only a ‘horse’ in relation to a speci
fi
c problem
• Hence, a ‘horse’ for one problem may not be
one for another:
• ‘Reproduce ground truth by XYZ’ vs.
• ‘Reproduce ground truth by any means’
Bob Sturm
Content Validity
47
• Extent to which the experimental units re
fl
ect and represent the
elements of the domain under study.
Criterion Validity
48
• Extent to which results of an experiment are correlated with
those of other experiments already known to be valid.
• Concurrent: how does a new test/measurement compare against
a validated test/measurement?
• Predictive: how well does a test/measurement predict a future
outcome?
This Is a Validated Instrument
49
• ‘Big Five’ Personality
• Openness
• Conscientiousness
• Extraversion
• Agreeableness
• Neuroticism
This Is Not a Validated Instrument
50
• Myers-Briggs
This Is Not a Validated Instrument
51
Liem et al. - Psychology Meets Machine Learning:
Interdisciplinary Perspectives on Algorithmic Job
Candidate Screening
https://repository.tudelft.nl/islandora/object/
uuid%3Ab27e837a-4844-4745-b56d-3efe94c61f0f
Consequences
52
Liem et al. - Psychology Meets Machine Learning:
Interdisciplinary Perspectives on Algorithmic Job
Candidate Screening
https://repository.tudelft.nl/islandora/object/
uuid%3Ab27e837a-4844-4745-b56d-3efe94c61f0f
Another Problematic Example
53
Wu & Zhang - Automated Inference on
Criminality using Face Images
https://arxiv.org/abs/1611.04135
Challenges of Multimedia Data
• Raw data: many numbers
• 44.1 kHz audio: 44,100 measurements per second
• Record me for 45 mins: 119,070,000 measurements
• RGB images: 224 x 224 x 3 pixels = 150,528 intensity values
54
Challenges of Multimedia Data
55
Can’t Trust the Feeling?
How Open Data Reveals Unexpected
Behavior of High-Level Music Descriptors
Cynthia C. S. Liem and Chris Mostert
56
https://archives.ismir.net/ismir2020/paper/000137.pdf
Automatic Music Description
• Critical to content-based music information retrieval
• Only way for non-content owners to perform large-scale research
• Leading to Grander Statements on the Nature of Music
57
But Can We Trust the Descriptors?
• Successful performance reported in papers.
• How does this extend to ‘in-the-wild’ situations?
58
AcousticBrainz
• Community locally computes descriptor values, using open-
source Essentia library.
• Submissions (with metadata) collected per MusicBrainz
Recording ID.
• High-level descriptors are machine learning-based, and include
classi
fi
er con
fi
dence.
59
AcousticBrainz
• Anyone can submit anything…so we don’t know what the
output should be?
• In psychology and software engineering,‘testing’ can go beyond
‘known truths’, exploiting known relationships.
60
Multiple Recording Submissions
• Inspired by software testing (derived oracles / differential testing)
• If only the codec changes, songs remain semantically equivalent.
• One would assume
classify_c(my_preprocessing(m)) ==
classify_c(your_preprocessing(m))
61
Not Quite!
62
‘Constructs Known To Relate’
• Inspired by psychological testing (construct validity)
• Same input is run through multiple classi
fi
ers, targeting the same
concept.
63
‘Constructs Known To Relate’
64
• genre_rosamerica classi
fi
er was 90.74 % accurate on
rock.
• genre_tzanetakis classi
fi
er was 60 % accurate on rock.
• Pearson correlation between (genre_rosamerica, rock)
and (genre_tzanetakis, rock) classi
fi
cations in
Acousticbrainz:
‘Constructs Known To Relate’
65
• genre_rosamerica classi
fi
er was 90.74 % accurate on
rock.
• genre_tzanetakis classi
fi
er was 60 % accurate on rock.
• Pearson correlation between (genre_rosamerica, rock)
and (genre_tzanetakis, rock) classi
fi
cations in
Acousticbrainz: -0.07
‘Constructs Known To Relate’
66
Strange Con
fi
dence Distributions
67
Strange Con
fi
dence Distributions
68
• Peak vs. non-peak distributional differences are especially large
for bit rate, codec and low-level extractor software versions.
• We hardly consider these in high-level descriptor evaluation!
What Can We Do?
69
Better Articulation of Underlying Assumptions
70
• Are there any assumptions of underlying distributions, and are
they actually met?
• What is ‘the universe’ that should be represented?
Better Awareness & Standards on
Measurement and Annotation
71
• https://conjointly.com/kb/measurement-in-research/
• Aroyo & Welty -Truth Is a Lie: CrowdTruth and the Seven Myths of Human
Annotation https://ojs.aaai.org/index.php/aimagazine/article/view/2564
• Welty, Paritosh & Aroyo,“Metrology for AI: From Benchmarks to Instruments”,
https://arxiv.org/abs/1911.01875
• Jacobs and Wallach,“Measurement and Fairness”, https://dl.acm.org/doi/
10.1145/3442188.3445901
Better Documentation
72
• Often inspired by data provenance in databases
• Complements to Data Protection Impact Assessments
• Gebru et al., Datasheets for Datasets, https://arxiv.org/abs/1803.09010
• Jo & Gebru, Lessons from Archives: strategies for collecting
sociocultural data in machine learning, https://dl.acm.org/doi/abs/
10.1145/3351095.3372829
Stronger Requirements
73
• “The AI should classify images of dogs”
vs.
• “The system should return true for photographs containing household-dogs.
Other similar species, such as wolves, should return false. Images that contain
dogs, but other items as well, should return true.”
• Ahmad et al.,What’s up with Requirements Engineering for Arti
fi
cial
Intelligence Systems? https://raw.githubusercontent.com/nzjohng/publications/
master/papers/re2021_1.pdf
• More in upcoming lectures
Automated Tooling
74
• Northcutt et al., Con
fi
dent Learning: Estimating Uncertainty in
Dataset Labels, https://jair.org/index.php/jair/article/view/
12125/26676 | https://github.com/cleanlab/cleanlab
• Breck et al., DataValidation for Machine Learning, https://
mlsys.org/Conferences/2019/doc/2019/167.pdf
Be Aware of Researcher Degrees of Freedom
• We have some
fl
exibility in data collection and analysis (e.g.
choices of normalization, hyperparameters, etc.)
• This may actually affect results and
fi
nal conclusions!
75
Simmons et al. - False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting
Anything as Signi
fi
cant https://journals.sagepub.com/doi/10.1177/0956797611417632
McFee et al. - Open-Source Practices for Music Signal Processing Research: Recommendations forTransparent,
Sustainable, and Reproducible Audio Research https://sinc-lab.com/
fi
les/mcfee2019opensource.pdf
Kim et al. - Beyond Explicit Reports: Comparing Data-Driven Approaches to Studying Underlying Dimensions of
Music Preference https://dl.acm.org/doi/10.1145/3320435.3320462
Liem and Panichella - Run, Forest, Run? On Randomization and Reproducibility in Predictive Software Engineering
https://arxiv.org/abs/2012.08387
Further Translations of Testing Concepts?
76
• Software: Coverage? Input diversity? Edge cases?
• Psychology: Further equivalents to validity assessment?
Articulation of Desired Policy?
77
• To be discussed in upcoming lectures on fairness
For Now - Think of These Questions in
Connection to the Assignment Dataset
78
• What would make for a ‘better’ or ‘worse’ dataset?
• If you could test this data more thoroughly, what would you test
for?

More Related Content

PDF
群衆の知を引き出すための機械学習(第4回ステアラボ人工知能セミナー)
PPTX
DeepLearning
PPTX
Voginip lezing 2015: Classificeren zonder voorbeelden
PPTX
ObjRecog2-17 (1).pptx
PDF
Intelligent Multimedia Recommendation
PPTX
Ai use cases
PPTX
Large scale object recognition (AMMAI presentation)
PPTX
Strata London - Deep Learning 05-2015
群衆の知を引き出すための機械学習(第4回ステアラボ人工知能セミナー)
DeepLearning
Voginip lezing 2015: Classificeren zonder voorbeelden
ObjRecog2-17 (1).pptx
Intelligent Multimedia Recommendation
Ai use cases
Large scale object recognition (AMMAI presentation)
Strata London - Deep Learning 05-2015

Similar to 02 - Data validation and validity deze keer (20)

PDF
Can Exposure, Noise and Compression affect Image Recognition? An Assessment o...
PPTX
Overview of Machine Learning and Feature Engineering
PDF
[212]big models without big data using domain specific deep networks in data-...
PPTX
Deep Learning - a Path from Big Data Indexing to Robotic Applications
PDF
AISF19 - Unleash Computer Vision at the Edge
PPTX
Enhancing Vision Models for Fine-Grained Classification
PDF
Applications of Search-based Software Testing to Trustworthy Artificial Intel...
PDF
Automated Testing and Safety Analysis of Deep Neural Networks
PDF
Overblik over kunstig intelligens og digital billedanalyse
PDF
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
PDF
Deep and Young Vision Learning at UPC BarcelonaTech (NIPS 2016)
PDF
The Philosophical Aspects of Data Modelling
PDF
Dl surface statistical_regularities_vs_high_level_concepts_draft_v0.1
PPT
Cvpr2007 object category recognition p0 - introduction
PPTX
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
PDF
Image Classification and Annotation Using Deep Learning
PDF
Introduction talk to Computer Vision
PPTX
Transparency in ML and AI (humble views from a concerned academic)
PDF
Journal_IEEE_2023.pdf
Can Exposure, Noise and Compression affect Image Recognition? An Assessment o...
Overview of Machine Learning and Feature Engineering
[212]big models without big data using domain specific deep networks in data-...
Deep Learning - a Path from Big Data Indexing to Robotic Applications
AISF19 - Unleash Computer Vision at the Edge
Enhancing Vision Models for Fine-Grained Classification
Applications of Search-based Software Testing to Trustworthy Artificial Intel...
Automated Testing and Safety Analysis of Deep Neural Networks
Overblik over kunstig intelligens og digital billedanalyse
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...
Deep and Young Vision Learning at UPC BarcelonaTech (NIPS 2016)
The Philosophical Aspects of Data Modelling
Dl surface statistical_regularities_vs_high_level_concepts_draft_v0.1
Cvpr2007 object category recognition p0 - introduction
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Image Classification and Annotation Using Deep Learning
Introduction talk to Computer Vision
Transparency in ML and AI (humble views from a concerned academic)
Journal_IEEE_2023.pdf
Ad

Recently uploaded (20)

PDF
medical staffing services at VALiNTRY
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
L1 - Introduction to python Backend.pptx
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Digital Strategies for Manufacturing Companies
PDF
Nekopoi APK 2025 free lastest update
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PDF
Digital Systems & Binary Numbers (comprehensive )
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
assetexplorer- product-overview - presentation
PDF
System and Network Administraation Chapter 3
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
medical staffing services at VALiNTRY
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Odoo Companies in India – Driving Business Transformation.pdf
Design an Analysis of Algorithms II-SECS-1021-03
Design an Analysis of Algorithms I-SECS-1021-03
How to Migrate SBCGlobal Email to Yahoo Easily
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
wealthsignaloriginal-com-DS-text-... (1).pdf
Understanding Forklifts - TECH EHS Solution
L1 - Introduction to python Backend.pptx
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Digital Strategies for Manufacturing Companies
Nekopoi APK 2025 free lastest update
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
Digital Systems & Binary Numbers (comprehensive )
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
CHAPTER 2 - PM Management and IT Context
assetexplorer- product-overview - presentation
System and Network Administraation Chapter 3
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Ad

02 - Data validation and validity deze keer

  • 1. Software Testing and Engineering for AI Systems (DSAIT4015) Lecturers: Cynthia Liem and Annibale Panichella  1
  • 2. Logistics • Please enroll your project group (4 people) on Brightspace • Let us know if we need to instantiate more groups 2
  • 3. Data Validation and Validity Lecturers: Cynthia Liem 3
  • 4. What Can Go Wrong With Data? 4
  • 5. What Can Go Wrong With Data? 5 I will not so much speak about database schema violations, but rather about gaps between data and human interpretation
  • 8. Is This a Dog? 8 Examples by Leonhard Applis Source: https://nl.pinterest.com/pin/806003664560130745/ Source: https://www.istockphoto.com/nl/foto/wolf-pup-gm474625522-64803037
  • 9. Is This a Dog? 9 Examples by Leonhard Applis
  • 10. Is This a Dog? 10 Examples by Leonhard Applis
  • 11. Is This a Dog? 11 Examples by Leonhard Applis Source: https://disney.fandom.com/wiki/Goofy
  • 12. Oracle Issues in Machine Learning and Where To Find Them Cynthia C. S. Liem and Annibale Panichella 12 https://dl.acm.org/doi/abs/10.1145/3387940.3391490
  • 13. Use Case: Visual Object Recognition 13 Technology Review, 2014 Quartz, 2017
  • 14. Visual Object Recognition 14 • Standardization of image dimensions • [R,G,B] pixel intensities for • vector of ground truth class probabilities, maximum likelihood optimization • models will output an estimated probability vector x y ̂ y P( = gold fi sh) = 0.0 P( = beagle) = 1.0 P( = volcano) = 0.0 P( = shower curtain) = 0.0 … x x x x x y f(x)
  • 15. Visual Object Recognition 15 x y • Standardization of image dimensions • [R,G,B] pixel intensities for • vector of ground truth class probabilities, maximum likelihood optimization • models will output an estimated probability vector x y ̂ y P( = gold fi sh) = 0.0 P( = beagle) = 1.0 P( = volcano) = 0.0 P( = shower curtain) = 0.0 … x x x x f(x)
  • 16. Visual Object Recognition 16 x y • Standardization of image dimensions • [R,G,B] pixel intensities for • vector of ground truth class probabilities, maximum likelihood optimization • models will output an estimated probability vector x y ̂ y P( = gold fi sh) = 0.0 P( = beagle) = 1.0 P( = volcano) = 0.0 P( = shower curtain) = 0.0 … x x x x f(x)
  • 17. Visual Object Recognition 17 x y • Standardization of image dimensions • [R,G,B] pixel intensities for • vector of ground truth class probabilities, maximum likelihood optimization • models will output an estimated probability vector x y ̂ y P( = gold fi sh) = 0.0 P( = beagle) = 1.0 P( = volcano) = 0.0 P( = shower curtain) = 0.0 … x x x x f(x)
  • 18. Visual Object Recognition • Standardization of image dimensions • [R,G,B] pixel intensities for • vector of ground truth class probabilities, maximum likelihood optimization • models will output an estimated probability vector x y ̂ y 18 x y P( = gold fi sh) = 0.0 P( = beagle) = 1.0 P( = volcano) = 0.0 P( = shower curtain) = 0.0 … x x x x f(x)
  • 19. 19 Con fi dent prediction 0,00 0,25 0,50 0,75 1,00 Gold fi sh Beagle Volcano Shower Curtain Looking at ̂ y Prediction not clear-cut 0,00 0,25 0,50 0,75 1,00 Gold fi sh Beagle Volcano Shower Curtain
  • 20. 20 Shannon Entropy H( ̂ y) = − ∑ i P(y = i|x)log2(P(y = i|x))
  • 21. 21 Low Entropy 0,00 0,25 0,50 0,75 1,00 Gold fi sh Beagle Volcano Shower Curtain Looking at ̂ y High Entropy 0,00 0,25 0,50 0,75 1,00 Gold fi sh Beagle Volcano Shower Curtain
  • 22. 22 Figure from https://openscience.com/wordnet-open-access-data- in-linguistics/ • Labels in object recognition are not independent • Pictures can contains multiple objects • Semantic relations Semantic Information
  • 23. Wordnet 23 https://wordnet.princeton.edu • Synonyms: pairs of class labels that have the same meaning. • Homonyms: pairs of class labels that are spelled and pronounced the same, but that have different meanings • Meronyms: pairs of class labels linked by a part-of relation
  • 24. ImageNet 24 • Large-scale hierarchical image database • Crowdsourced annotation: Is there an [X] in this image? • ImageNet Large-ScaleVisual Recognition Challenge (ILSVRC) 2012 • ‘The’ object recognition benchmark challenge • New models benchmarked ‘on ImageNet’: •Trained on ILSVRC2012 training set • Evaluated on ILSVRC2012 validation set (top-1 and top-5 accuracy) • Well-performing model weights often released for use in transfer learning
  • 25. ImageNet 25 • Large-scale hierarchical image database • Crowdsourced annotation: Is there an [X] in this image? • ImageNet Large-ScaleVisual Recognition Challenge (ILSVRC) 2012 • ‘The’ object recognition benchmark challenge • New models benchmarked ‘on ImageNet’: •Trained on ILSVRC2012 training set • Evaluated on ILSVRC2012 validation set (top-1 and top-5 accuracy) • Well-performing model weights often released for use in transfer learning
  • 26. Setup • 4 ‘classical’ deep architectures (VGG16,VGG19, ResNet50, ResNet101) • pre-trained on ILSVRC2012, weights released through Keras • predictions run on al 50,000 ILSVRC2012 validation images • application of original pre-processing methods • we use our heuristics to surface striking outliers 26 VGG16 VGG19 ResNet50 ResNet101 90.0% top-5 accuracy 90.1% top-5 accuracy 92.1% top-5 accuracy 92.8% top-5 accuracy
  • 27. High Entropy: Kneepad 27 • None of the models recognized the ground truth class in the top-5 • All models consistently showed high entropy in ̂ y
  • 28. High Entropy: Kneepad • None of the models recognized the ground truth class in the top-5 • All models consistently showed high entropy in • Due to standardization, only the middle part of the image is offered for prediction ̂ y 28
  • 29. Low Entropy: Bucket 29 • None of the models recognized the ground truth class in the top-5 • All models consistently were convinced (class probability of 1.0) that this image should be labeled as baseball
  • 30. Low Entropy: Bucket • None of the models recognized the ground truth class in the top-5 • All models consistently were convinced (class probability of 1.0) that this image should be labeled as baseball • Shortcoming of single-class labeling 30
  • 31. Synonyms: Laptop • Frequent top-1 confusions between laptop and notebook • Looking at class probabilities, models do not ‘see’ synonym classes as close together 31 Figure 4: Top-5 classi!cations for laptop ima (a) Original (b) Cropped vgg16 vgg19 notebook (0.7222) notebook (0.7327) laptop (0.1866) laptop (0.1178) desktop computer (0.0244) desktop computer (0.04 space bar (0.0097) space bar (0.0243) solar dish (0.0092) hand-held computer (0.0 (c) P Figure 5: Top-5 classi!cations for laptop ima 4.3 Good performance vs. visual understanding Our analysis surfaces various oracle issues, that globally hint at assum sub-s 0,00 0,23 0,45 0,68 0,90 vgg16 vgg19 ResNet50 ResNet101 notebook laptop
  • 32. The World View Depicted in ILSVRC2012 • Not representative of the real world • > 100 sub-species of dogs (5 cats) • 1 red wine (no white wine) • 1 Granny Smith (no other apples) • 1 carbonara (no other pasta) 32
  • 33. Carbonara 33 The ImageNet view: equivalent exemplars The Italian view
  • 34. ImageNet’s Origins 34 Shankar et al. - No classi fi cation without representation: Assessing geodiversity issues in open data sets for the developing world https://arxiv.org/abs/1711.08536
  • 37. What Is the Representative Sample? •Criticism in psychology: samples often drawn entirely from Western, Educated, Industrialized, Rich, and Democratic (WEIRD) societies •“our review of the comparative database from across the behavioral sciences suggests both that there is substantial variability in experimental results across populations and that WEIRD subjects are particularly unusual compared with the rest of the species” •“The fi ndings suggest that members of WEIRD societies, including young children, are among the least representative populations one could fi nd for generalizing about humans.” 37 https://www.cambridge.org/core/journals/behavioral-and-brain-sciences/article/abs/weirdest-people-in-the-world/ BF84F7517D56AFF7B7EB58411A554C17
  • 38. Non-Visual (and Stereotypical) Concepts 38 Bad Person, Call Girl, Drug Addict, Closet Queen, Convict, Crazy, Failure, Flop, Fucker, Hypocrite, Jezebel, Kleptomaniac, Loser, Melancholic, Nonperson, Pervert, Prima Donna, Schizophrenic, Second-Rater, Spinster, Streetwalker, Stud,Tosser, Unskilled Person, Wanton,Waverer, and Wimp Is there an [X] in this image? Crawford & Paglan - excavating.ai
  • 40. Psychometrics 40 • Measuring constructs, which are not directly observable • The measurement instrument is also known as a psychological test (warning: vocabulary clash with software)
  • 41. An Instrument Is Sound… 41 • …if it is valid • …and if it is reliable
  • 42. Reliability 42 • Internal consistency • Test-retest reliability • In case of subjective tests: • Inter-rater reliability • Intra-rater reliability
  • 44. Construct Validity 44 • Extent to which variables of an experiment correspond to the theoretical meaning of the concept they purport to measure.
  • 45. A Famous Violation of Construct Validity 45
  • 46. ‘Horse’ Systems 46 • Do not actually address the problem they appear to be solving • Only a ‘horse’ in relation to a speci fi c problem • Hence, a ‘horse’ for one problem may not be one for another: • ‘Reproduce ground truth by XYZ’ vs. • ‘Reproduce ground truth by any means’ Bob Sturm
  • 47. Content Validity 47 • Extent to which the experimental units re fl ect and represent the elements of the domain under study.
  • 48. Criterion Validity 48 • Extent to which results of an experiment are correlated with those of other experiments already known to be valid. • Concurrent: how does a new test/measurement compare against a validated test/measurement? • Predictive: how well does a test/measurement predict a future outcome?
  • 49. This Is a Validated Instrument 49 • ‘Big Five’ Personality • Openness • Conscientiousness • Extraversion • Agreeableness • Neuroticism
  • 50. This Is Not a Validated Instrument 50 • Myers-Briggs
  • 51. This Is Not a Validated Instrument 51 Liem et al. - Psychology Meets Machine Learning: Interdisciplinary Perspectives on Algorithmic Job Candidate Screening https://repository.tudelft.nl/islandora/object/ uuid%3Ab27e837a-4844-4745-b56d-3efe94c61f0f
  • 52. Consequences 52 Liem et al. - Psychology Meets Machine Learning: Interdisciplinary Perspectives on Algorithmic Job Candidate Screening https://repository.tudelft.nl/islandora/object/ uuid%3Ab27e837a-4844-4745-b56d-3efe94c61f0f
  • 53. Another Problematic Example 53 Wu & Zhang - Automated Inference on Criminality using Face Images https://arxiv.org/abs/1611.04135
  • 54. Challenges of Multimedia Data • Raw data: many numbers • 44.1 kHz audio: 44,100 measurements per second • Record me for 45 mins: 119,070,000 measurements • RGB images: 224 x 224 x 3 pixels = 150,528 intensity values 54
  • 56. Can’t Trust the Feeling? How Open Data Reveals Unexpected Behavior of High-Level Music Descriptors Cynthia C. S. Liem and Chris Mostert 56 https://archives.ismir.net/ismir2020/paper/000137.pdf
  • 57. Automatic Music Description • Critical to content-based music information retrieval • Only way for non-content owners to perform large-scale research • Leading to Grander Statements on the Nature of Music 57
  • 58. But Can We Trust the Descriptors? • Successful performance reported in papers. • How does this extend to ‘in-the-wild’ situations? 58
  • 59. AcousticBrainz • Community locally computes descriptor values, using open- source Essentia library. • Submissions (with metadata) collected per MusicBrainz Recording ID. • High-level descriptors are machine learning-based, and include classi fi er con fi dence. 59
  • 60. AcousticBrainz • Anyone can submit anything…so we don’t know what the output should be? • In psychology and software engineering,‘testing’ can go beyond ‘known truths’, exploiting known relationships. 60
  • 61. Multiple Recording Submissions • Inspired by software testing (derived oracles / differential testing) • If only the codec changes, songs remain semantically equivalent. • One would assume classify_c(my_preprocessing(m)) == classify_c(your_preprocessing(m)) 61
  • 63. ‘Constructs Known To Relate’ • Inspired by psychological testing (construct validity) • Same input is run through multiple classi fi ers, targeting the same concept. 63
  • 64. ‘Constructs Known To Relate’ 64 • genre_rosamerica classi fi er was 90.74 % accurate on rock. • genre_tzanetakis classi fi er was 60 % accurate on rock. • Pearson correlation between (genre_rosamerica, rock) and (genre_tzanetakis, rock) classi fi cations in Acousticbrainz:
  • 65. ‘Constructs Known To Relate’ 65 • genre_rosamerica classi fi er was 90.74 % accurate on rock. • genre_tzanetakis classi fi er was 60 % accurate on rock. • Pearson correlation between (genre_rosamerica, rock) and (genre_tzanetakis, rock) classi fi cations in Acousticbrainz: -0.07
  • 66. ‘Constructs Known To Relate’ 66
  • 68. Strange Con fi dence Distributions 68 • Peak vs. non-peak distributional differences are especially large for bit rate, codec and low-level extractor software versions. • We hardly consider these in high-level descriptor evaluation!
  • 69. What Can We Do? 69
  • 70. Better Articulation of Underlying Assumptions 70 • Are there any assumptions of underlying distributions, and are they actually met? • What is ‘the universe’ that should be represented?
  • 71. Better Awareness & Standards on Measurement and Annotation 71 • https://conjointly.com/kb/measurement-in-research/ • Aroyo & Welty -Truth Is a Lie: CrowdTruth and the Seven Myths of Human Annotation https://ojs.aaai.org/index.php/aimagazine/article/view/2564 • Welty, Paritosh & Aroyo,“Metrology for AI: From Benchmarks to Instruments”, https://arxiv.org/abs/1911.01875 • Jacobs and Wallach,“Measurement and Fairness”, https://dl.acm.org/doi/ 10.1145/3442188.3445901
  • 72. Better Documentation 72 • Often inspired by data provenance in databases • Complements to Data Protection Impact Assessments • Gebru et al., Datasheets for Datasets, https://arxiv.org/abs/1803.09010 • Jo & Gebru, Lessons from Archives: strategies for collecting sociocultural data in machine learning, https://dl.acm.org/doi/abs/ 10.1145/3351095.3372829
  • 73. Stronger Requirements 73 • “The AI should classify images of dogs” vs. • “The system should return true for photographs containing household-dogs. Other similar species, such as wolves, should return false. Images that contain dogs, but other items as well, should return true.” • Ahmad et al.,What’s up with Requirements Engineering for Arti fi cial Intelligence Systems? https://raw.githubusercontent.com/nzjohng/publications/ master/papers/re2021_1.pdf • More in upcoming lectures
  • 74. Automated Tooling 74 • Northcutt et al., Con fi dent Learning: Estimating Uncertainty in Dataset Labels, https://jair.org/index.php/jair/article/view/ 12125/26676 | https://github.com/cleanlab/cleanlab • Breck et al., DataValidation for Machine Learning, https:// mlsys.org/Conferences/2019/doc/2019/167.pdf
  • 75. Be Aware of Researcher Degrees of Freedom • We have some fl exibility in data collection and analysis (e.g. choices of normalization, hyperparameters, etc.) • This may actually affect results and fi nal conclusions! 75 Simmons et al. - False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Signi fi cant https://journals.sagepub.com/doi/10.1177/0956797611417632 McFee et al. - Open-Source Practices for Music Signal Processing Research: Recommendations forTransparent, Sustainable, and Reproducible Audio Research https://sinc-lab.com/ fi les/mcfee2019opensource.pdf Kim et al. - Beyond Explicit Reports: Comparing Data-Driven Approaches to Studying Underlying Dimensions of Music Preference https://dl.acm.org/doi/10.1145/3320435.3320462 Liem and Panichella - Run, Forest, Run? On Randomization and Reproducibility in Predictive Software Engineering https://arxiv.org/abs/2012.08387
  • 76. Further Translations of Testing Concepts? 76 • Software: Coverage? Input diversity? Edge cases? • Psychology: Further equivalents to validity assessment?
  • 77. Articulation of Desired Policy? 77 • To be discussed in upcoming lectures on fairness
  • 78. For Now - Think of These Questions in Connection to the Assignment Dataset 78 • What would make for a ‘better’ or ‘worse’ dataset? • If you could test this data more thoroughly, what would you test for?