02 - Data validation and validity deze keer

Software Testing and Engineering
for AI Systems (DSAIT4015)
Lecturers: Cynthia Liem and Annibale Panichella

1

Logistics
• Please enroll your project group (4 people) on Brightspace
• Let us know if we need to instantiate more groups
2

Data Validation and Validity
Lecturers: Cynthia Liem
3

What Can Go Wrong With Data?
4

What Can Go Wrong With Data?
5
I will not so much speak about database schema violations,
but rather about gaps between data and human interpretation

Pre-process
Labels
Optimized
Model
Data ML Training
Application

Pre-process
Labels
Optimized
Model
Data ML Training
Problem
Decisions

Is This a Dog?
8
Examples by Leonhard Applis
Source: https://nl.pinterest.com/pin/806003664560130745/ Source: https://www.istockphoto.com/nl/foto/wolf-pup-gm474625522-64803037

Is This a Dog?
9

Is This a Dog?
10

Is This a Dog?
11
Source: https://disney.fandom.com/wiki/Goofy

Oracle Issues in Machine
Learning and Where To Find Them
Cynthia C. S. Liem and Annibale Panichella
12
https://dl.acm.org/doi/abs/10.1145/3387940.3391490

Use Case: Visual Object Recognition
13
Technology Review, 2014
Quartz, 2017

Visual Object Recognition
14
• Standardization of image dimensions
• [R,G,B] pixel intensities for
• vector of ground truth class probabilities, maximum likelihood optimization
• models will output an estimated probability vector
x
y
̂
y
P( = gold
fi
sh) = 0.0
P( = beagle) = 1.0
P( = volcano) = 0.0
P( = shower curtain) = 0.0
…
x
x
x
x
x y
f(x)

15
x y
x
y
̂
y
P( = gold
fi
sh) = 0.0
P( = beagle) = 1.0
P( = volcano) = 0.0
…
x
x
x
x
f(x)

16
x y
x
y
̂
y
P( = gold
fi
sh) = 0.0
P( = beagle) = 1.0
P( = volcano) = 0.0
…
x
x
x
x
f(x)

17
x y
x
y
̂
y
P( = gold
fi
sh) = 0.0
P( = beagle) = 1.0
P( = volcano) = 0.0
…
x
x
x
x
f(x)

x
y
̂
y
18
x y
P( = gold
fi
sh) = 0.0
P( = beagle) = 1.0
P( = volcano) = 0.0
…
x
x
x
x
f(x)

19
Con
fi
dent prediction
0,00
0,25
0,50
0,75
1,00
Gold
fi
sh
Beagle
Volcano
Shower
Curtain
Looking at ̂
y
Prediction not clear-cut
0,00
0,25
0,50
0,75
1,00
Gold
fi
sh
Beagle
Volcano
Shower
Curtain

20
Shannon Entropy
H( ̂
y) = −
∑
i
P(y = i|x)log2(P(y = i|x))

21
Low Entropy
0,00
0,25
0,50
0,75
1,00
Gold
fi
sh
Beagle
Volcano
Shower
Curtain
Looking at ̂
y
High Entropy
0,00
0,25
0,50
0,75
1,00
Gold
fi
sh
Beagle
Volcano
Shower
Curtain

22
Figure from https://openscience.com/wordnet-open-access-data-
in-linguistics/
• Labels in object recognition
are not independent
• Pictures can contains multiple
objects
• Semantic relations
Semantic Information

Wordnet
23
https://wordnet.princeton.edu
• Synonyms: pairs of class labels that
have the same meaning.
• Homonyms: pairs of class labels
that are spelled and pronounced the
same, but that have different meanings
• Meronyms: pairs of class labels
linked by a part-of relation

ImageNet
24
• Large-scale hierarchical image database
• Crowdsourced annotation: Is there an [X] in this image?
• ImageNet Large-ScaleVisual Recognition Challenge (ILSVRC)
2012
• ‘The’ object recognition benchmark challenge
• New models benchmarked ‘on ImageNet’:
•Trained on ILSVRC2012 training set
• Evaluated on ILSVRC2012 validation set
(top-1 and top-5 accuracy)
• Well-performing model weights often released for use
in transfer learning

ImageNet
25
• Large-scale hierarchical image database
• Crowdsourced annotation: Is there an [X] in this image?
• ImageNet Large-ScaleVisual Recognition Challenge (ILSVRC)
2012
• ‘The’ object recognition benchmark challenge
• New models benchmarked ‘on ImageNet’:
•Trained on ILSVRC2012 training set
• Evaluated on ILSVRC2012 validation set
(top-1 and top-5 accuracy)
• Well-performing model weights often released for use
in transfer learning

Setup
• 4 ‘classical’ deep architectures (VGG16,VGG19, ResNet50, ResNet101)
• pre-trained on ILSVRC2012, weights released through Keras
• predictions run on al 50,000 ILSVRC2012 validation images
• application of original pre-processing methods
• we use our heuristics to surface striking outliers
26
VGG16 VGG19 ResNet50 ResNet101
90.0% top-5
accuracy
90.1% top-5
accuracy
92.1% top-5
accuracy
92.8% top-5
accuracy

High Entropy: Kneepad
27
• None of the models recognized
the ground truth class in the top-5
• All models consistently showed
high entropy in ̂
y

High Entropy: Kneepad
• None of the models recognized
the ground truth class in the top-5
• All models consistently showed
high entropy in
• Due to standardization, only the
middle part of the image is offered
for prediction
̂
y
28

Low Entropy: Bucket
29
• None of the models recognized the
ground truth class in the top-5
• All models consistently were convinced
(class probability of 1.0) that this image
should be labeled as baseball

Low Entropy: Bucket
• None of the models recognized the
ground truth class in the top-5
• All models consistently were convinced
(class probability of 1.0) that this image
should be labeled as baseball
• Shortcoming of single-class labeling
30

Synonyms: Laptop
• Frequent top-1 confusions between laptop and notebook
• Looking at class probabilities, models do not ‘see’ synonym classes as close together
31
Figure 4: Top-5 classi!cations for laptop ima
(a) Original (b) Cropped
vgg16 vgg19
notebook (0.7222) notebook (0.7327)
laptop (0.1866) laptop (0.1178)
desktop computer (0.0244) desktop computer (0.04
space bar (0.0097) space bar (0.0243)
solar dish (0.0092) hand-held computer (0.0
(c) P
Figure 5: Top-5 classi!cations for laptop ima
4.3 Good performance vs. visual understanding
Our analysis surfaces various oracle issues, that globally hint at
assum
sub-s
0,00
0,23
0,45
0,68
0,90
vgg16
vgg19
ResNet50
ResNet101
notebook laptop

The World View Depicted in ILSVRC2012
• Not representative of the real world
• > 100 sub-species of dogs (5 cats)
• 1 red wine (no white wine)
• 1 Granny Smith (no other apples)
• 1 carbonara (no other pasta)
32

Carbonara
33
The ImageNet view:
equivalent exemplars
The Italian view

ImageNet’s Origins
34
Shankar et al. - No classi
fi
cation without representation:
Assessing geodiversity issues in open data sets for the developing world
https://arxiv.org/abs/1711.08536

What Is the Representative Sample?
•Criticism in psychology: samples often drawn entirely from Western, Educated,
Industrialized, Rich, and Democratic (WEIRD) societies
•“our review of the comparative database from across the behavioral sciences suggests
both that there is substantial variability in experimental results across populations and that
WEIRD subjects are particularly unusual compared with the rest of the species”
•“The
fi
ndings suggest that members of WEIRD societies, including young children, are
among the least representative populations one could
fi
nd for generalizing about humans.”
37
https://www.cambridge.org/core/journals/behavioral-and-brain-sciences/article/abs/weirdest-people-in-the-world/
BF84F7517D56AFF7B7EB58411A554C17

Non-Visual (and Stereotypical) Concepts
38
Bad Person, Call Girl, Drug Addict, Closet
Queen, Convict, Crazy, Failure, Flop, Fucker,
Hypocrite, Jezebel, Kleptomaniac, Loser,
Melancholic, Nonperson, Pervert, Prima
Donna, Schizophrenic, Second-Rater, Spinster,
Streetwalker, Stud,Tosser, Unskilled Person,
Wanton,Waverer, and Wimp
Is there an [X] in this image?
Crawford & Paglan - excavating.ai

Psychometrics
40
• Measuring constructs, which are not directly observable
• The measurement instrument is also known as a
psychological test (warning: vocabulary clash with software)

An Instrument Is Sound…
41
• …if it is valid
• …and if it is reliable

Reliability
42
• Internal consistency
• Test-retest reliability
• In case of subjective tests:
• Inter-rater reliability
• Intra-rater reliability

Construct Validity
44
• Extent to which variables of an experiment correspond to the
theoretical meaning of the concept they purport to measure.

A Famous Violation of Construct Validity
45

‘Horse’ Systems
46
• Do not actually address the problem they
appear to be solving
• Only a ‘horse’ in relation to a speci
fi
c problem
• Hence, a ‘horse’ for one problem may not be
one for another:
• ‘Reproduce ground truth by XYZ’ vs.
• ‘Reproduce ground truth by any means’
Bob Sturm

Content Validity
47
• Extent to which the experimental units re
fl
ect and represent the
elements of the domain under study.

Criterion Validity
48
• Extent to which results of an experiment are correlated with
those of other experiments already known to be valid.
• Concurrent: how does a new test/measurement compare against
a validated test/measurement?
• Predictive: how well does a test/measurement predict a future
outcome?

This Is a Validated Instrument
49
• ‘Big Five’ Personality
• Openness
• Conscientiousness
• Extraversion
• Agreeableness
• Neuroticism

This Is Not a Validated Instrument
50
• Myers-Briggs

This Is Not a Validated Instrument
51
Liem et al. - Psychology Meets Machine Learning:
Interdisciplinary Perspectives on Algorithmic Job
Candidate Screening
https://repository.tudelft.nl/islandora/object/
uuid%3Ab27e837a-4844-4745-b56d-3efe94c61f0f

Consequences
52
Liem et al. - Psychology Meets Machine Learning:
Interdisciplinary Perspectives on Algorithmic Job
Candidate Screening
https://repository.tudelft.nl/islandora/object/
uuid%3Ab27e837a-4844-4745-b56d-3efe94c61f0f

Another Problematic Example
53
Wu & Zhang - Automated Inference on
Criminality using Face Images

Challenges of Multimedia Data
• Raw data: many numbers
• 44.1 kHz audio: 44,100 measurements per second
• Record me for 45 mins: 119,070,000 measurements
• RGB images: 224 x 224 x 3 pixels = 150,528 intensity values
54

Challenges of Multimedia Data
55

Can’t Trust the Feeling?
How Open Data Reveals Unexpected
Behavior of High-Level Music Descriptors
Cynthia C. S. Liem and Chris Mostert
56
https://archives.ismir.net/ismir2020/paper/000137.pdf

Automatic Music Description
• Critical to content-based music information retrieval
• Only way for non-content owners to perform large-scale research
• Leading to Grander Statements on the Nature of Music
57

But Can We Trust the Descriptors?
• Successful performance reported in papers.
• How does this extend to ‘in-the-wild’ situations?
58

AcousticBrainz
• Community locally computes descriptor values, using open-
source Essentia library.
• Submissions (with metadata) collected per MusicBrainz
Recording ID.
• High-level descriptors are machine learning-based, and include
classi
fi
er con
fi
dence.
59

AcousticBrainz
• Anyone can submit anything…so we don’t know what the
output should be?
• In psychology and software engineering,‘testing’ can go beyond
‘known truths’, exploiting known relationships.
60

Multiple Recording Submissions
• Inspired by software testing (derived oracles / differential testing)
• If only the codec changes, songs remain semantically equivalent.
• One would assume
classify_c(my_preprocessing(m)) ==
classify_c(your_preprocessing(m))
61

‘Constructs Known To Relate’
• Inspired by psychological testing (construct validity)
• Same input is run through multiple classi
fi
ers, targeting the same
concept.
63

64
• genre_rosamerica classi
fi
er was 90.74 % accurate on
rock.
• genre_tzanetakis classi
fi
er was 60 % accurate on rock.
• Pearson correlation between (genre_rosamerica, rock)
and (genre_tzanetakis, rock) classi
fi
cations in
Acousticbrainz:

65
• genre_rosamerica classi
fi
er was 90.74 % accurate on
rock.
• genre_tzanetakis classi
fi
er was 60 % accurate on rock.
• Pearson correlation between (genre_rosamerica, rock)
and (genre_tzanetakis, rock) classi
fi
cations in
Acousticbrainz: -0.07

66

Strange Con
fi
dence Distributions
67

Strange Con
fi
dence Distributions
68
• Peak vs. non-peak distributional differences are especially large
for bit rate, codec and low-level extractor software versions.
• We hardly consider these in high-level descriptor evaluation!

Better Articulation of Underlying Assumptions
70
• Are there any assumptions of underlying distributions, and are
they actually met?
• What is ‘the universe’ that should be represented?

Better Awareness & Standards on
Measurement and Annotation
71
• https://conjointly.com/kb/measurement-in-research/
• Aroyo & Welty -Truth Is a Lie: CrowdTruth and the Seven Myths of Human
Annotation https://ojs.aaai.org/index.php/aimagazine/article/view/2564
• Welty, Paritosh & Aroyo,“Metrology for AI: From Benchmarks to Instruments”,
• Jacobs and Wallach,“Measurement and Fairness”, https://dl.acm.org/doi/
10.1145/3442188.3445901

Better Documentation
72
• Often inspired by data provenance in databases
• Complements to Data Protection Impact Assessments
• Gebru et al., Datasheets for Datasets, https://arxiv.org/abs/1803.09010
• Jo & Gebru, Lessons from Archives: strategies for collecting
sociocultural data in machine learning, https://dl.acm.org/doi/abs/
10.1145/3351095.3372829

Stronger Requirements
73
• “The AI should classify images of dogs”
vs.
• “The system should return true for photographs containing household-dogs.
Other similar species, such as wolves, should return false. Images that contain
dogs, but other items as well, should return true.”
• Ahmad et al.,What’s up with Requirements Engineering for Arti
fi
cial
Intelligence Systems? https://raw.githubusercontent.com/nzjohng/publications/
master/papers/re2021_1.pdf
• More in upcoming lectures

Automated Tooling
74
• Northcutt et al., Con
fi
dent Learning: Estimating Uncertainty in
Dataset Labels, https://jair.org/index.php/jair/article/view/
12125/26676 | https://github.com/cleanlab/cleanlab
• Breck et al., DataValidation for Machine Learning, https://
mlsys.org/Conferences/2019/doc/2019/167.pdf

Be Aware of Researcher Degrees of Freedom
• We have some
fl
exibility in data collection and analysis (e.g.
choices of normalization, hyperparameters, etc.)
• This may actually affect results and
fi
nal conclusions!
75
Simmons et al. - False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting
Anything as Signi
fi
cant https://journals.sagepub.com/doi/10.1177/0956797611417632
McFee et al. - Open-Source Practices for Music Signal Processing Research: Recommendations forTransparent,
Sustainable, and Reproducible Audio Research https://sinc-lab.com/
fi
les/mcfee2019opensource.pdf
Kim et al. - Beyond Explicit Reports: Comparing Data-Driven Approaches to Studying Underlying Dimensions of
Music Preference https://dl.acm.org/doi/10.1145/3320435.3320462
Liem and Panichella - Run, Forest, Run? On Randomization and Reproducibility in Predictive Software Engineering

Further Translations of Testing Concepts?
76
• Software: Coverage? Input diversity? Edge cases?
• Psychology: Further equivalents to validity assessment?

Articulation of Desired Policy?
77
• To be discussed in upcoming lectures on fairness

For Now - Think of These Questions in
Connection to the Assignment Dataset
78
• What would make for a ‘better’ or ‘worse’ dataset?
• If you could test this data more thoroughly, what would you test
for?

02 - Data validation and validity deze keer

More Related Content

Similar to 02 - Data validation and validity deze keer (20)

Recently uploaded (20)

02 - Data validation and validity deze keer