5. What Can Go Wrong With Data?
5
I will not so much speak about database schema violations,
but rather about gaps between data and human interpretation
8. Is This a Dog?
8
Examples by Leonhard Applis
Source: https://nl.pinterest.com/pin/806003664560130745/ Source: https://www.istockphoto.com/nl/foto/wolf-pup-gm474625522-64803037
11. Is This a Dog?
11
Examples by Leonhard Applis
Source: https://disney.fandom.com/wiki/Goofy
12. Oracle Issues in Machine
Learning and Where To Find Them
Cynthia C. S. Liem and Annibale Panichella
12
https://dl.acm.org/doi/abs/10.1145/3387940.3391490
14. Visual Object Recognition
14
• Standardization of image dimensions
• [R,G,B] pixel intensities for
• vector of ground truth class probabilities, maximum likelihood optimization
• models will output an estimated probability vector
x
y
̂
y
P( = gold
fi
sh) = 0.0
P( = beagle) = 1.0
P( = volcano) = 0.0
P( = shower curtain) = 0.0
…
x
x
x
x
x y
f(x)
15. Visual Object Recognition
15
x y
• Standardization of image dimensions
• [R,G,B] pixel intensities for
• vector of ground truth class probabilities, maximum likelihood optimization
• models will output an estimated probability vector
x
y
̂
y
P( = gold
fi
sh) = 0.0
P( = beagle) = 1.0
P( = volcano) = 0.0
P( = shower curtain) = 0.0
…
x
x
x
x
f(x)
16. Visual Object Recognition
16
x y
• Standardization of image dimensions
• [R,G,B] pixel intensities for
• vector of ground truth class probabilities, maximum likelihood optimization
• models will output an estimated probability vector
x
y
̂
y
P( = gold
fi
sh) = 0.0
P( = beagle) = 1.0
P( = volcano) = 0.0
P( = shower curtain) = 0.0
…
x
x
x
x
f(x)
17. Visual Object Recognition
17
x y
• Standardization of image dimensions
• [R,G,B] pixel intensities for
• vector of ground truth class probabilities, maximum likelihood optimization
• models will output an estimated probability vector
x
y
̂
y
P( = gold
fi
sh) = 0.0
P( = beagle) = 1.0
P( = volcano) = 0.0
P( = shower curtain) = 0.0
…
x
x
x
x
f(x)
18. Visual Object Recognition
• Standardization of image dimensions
• [R,G,B] pixel intensities for
• vector of ground truth class probabilities, maximum likelihood optimization
• models will output an estimated probability vector
x
y
̂
y
18
x y
P( = gold
fi
sh) = 0.0
P( = beagle) = 1.0
P( = volcano) = 0.0
P( = shower curtain) = 0.0
…
x
x
x
x
f(x)
24. ImageNet
24
• Large-scale hierarchical image database
• Crowdsourced annotation: Is there an [X] in this image?
• ImageNet Large-ScaleVisual Recognition Challenge (ILSVRC)
2012
• ‘The’ object recognition benchmark challenge
• New models benchmarked ‘on ImageNet’:
•Trained on ILSVRC2012 training set
• Evaluated on ILSVRC2012 validation set
(top-1 and top-5 accuracy)
• Well-performing model weights often released for use
in transfer learning
25. ImageNet
25
• Large-scale hierarchical image database
• Crowdsourced annotation: Is there an [X] in this image?
• ImageNet Large-ScaleVisual Recognition Challenge (ILSVRC)
2012
• ‘The’ object recognition benchmark challenge
• New models benchmarked ‘on ImageNet’:
•Trained on ILSVRC2012 training set
• Evaluated on ILSVRC2012 validation set
(top-1 and top-5 accuracy)
• Well-performing model weights often released for use
in transfer learning
26. Setup
• 4 ‘classical’ deep architectures (VGG16,VGG19, ResNet50, ResNet101)
• pre-trained on ILSVRC2012, weights released through Keras
• predictions run on al 50,000 ILSVRC2012 validation images
• application of original pre-processing methods
• we use our heuristics to surface striking outliers
26
VGG16 VGG19 ResNet50 ResNet101
90.0% top-5
accuracy
90.1% top-5
accuracy
92.1% top-5
accuracy
92.8% top-5
accuracy
27. High Entropy: Kneepad
27
• None of the models recognized
the ground truth class in the top-5
• All models consistently showed
high entropy in ̂
y
28. High Entropy: Kneepad
• None of the models recognized
the ground truth class in the top-5
• All models consistently showed
high entropy in
• Due to standardization, only the
middle part of the image is offered
for prediction
̂
y
28
29. Low Entropy: Bucket
29
• None of the models recognized the
ground truth class in the top-5
• All models consistently were convinced
(class probability of 1.0) that this image
should be labeled as baseball
30. Low Entropy: Bucket
• None of the models recognized the
ground truth class in the top-5
• All models consistently were convinced
(class probability of 1.0) that this image
should be labeled as baseball
• Shortcoming of single-class labeling
30
31. Synonyms: Laptop
• Frequent top-1 confusions between laptop and notebook
• Looking at class probabilities, models do not ‘see’ synonym classes as close together
31
Figure 4: Top-5 classi!cations for laptop ima
(a) Original (b) Cropped
vgg16 vgg19
notebook (0.7222) notebook (0.7327)
laptop (0.1866) laptop (0.1178)
desktop computer (0.0244) desktop computer (0.04
space bar (0.0097) space bar (0.0243)
solar dish (0.0092) hand-held computer (0.0
(c) P
Figure 5: Top-5 classi!cations for laptop ima
4.3 Good performance vs. visual understanding
Our analysis surfaces various oracle issues, that globally hint at
assum
sub-s
0,00
0,23
0,45
0,68
0,90
vgg16
vgg19
ResNet50
ResNet101
notebook laptop
32. The World View Depicted in ILSVRC2012
• Not representative of the real world
• > 100 sub-species of dogs (5 cats)
• 1 red wine (no white wine)
• 1 Granny Smith (no other apples)
• 1 carbonara (no other pasta)
32
34. ImageNet’s Origins
34
Shankar et al. - No classi
fi
cation without representation:
Assessing geodiversity issues in open data sets for the developing world
https://arxiv.org/abs/1711.08536
37. What Is the Representative Sample?
•Criticism in psychology: samples often drawn entirely from Western, Educated,
Industrialized, Rich, and Democratic (WEIRD) societies
•“our review of the comparative database from across the behavioral sciences suggests
both that there is substantial variability in experimental results across populations and that
WEIRD subjects are particularly unusual compared with the rest of the species”
•“The
fi
ndings suggest that members of WEIRD societies, including young children, are
among the least representative populations one could
fi
nd for generalizing about humans.”
37
https://www.cambridge.org/core/journals/behavioral-and-brain-sciences/article/abs/weirdest-people-in-the-world/
BF84F7517D56AFF7B7EB58411A554C17
38. Non-Visual (and Stereotypical) Concepts
38
Bad Person, Call Girl, Drug Addict, Closet
Queen, Convict, Crazy, Failure, Flop, Fucker,
Hypocrite, Jezebel, Kleptomaniac, Loser,
Melancholic, Nonperson, Pervert, Prima
Donna, Schizophrenic, Second-Rater, Spinster,
Streetwalker, Stud,Tosser, Unskilled Person,
Wanton,Waverer, and Wimp
Is there an [X] in this image?
Crawford & Paglan - excavating.ai
40. Psychometrics
40
• Measuring constructs, which are not directly observable
• The measurement instrument is also known as a
psychological test (warning: vocabulary clash with software)
41. An Instrument Is Sound…
41
• …if it is valid
• …and if it is reliable
46. ‘Horse’ Systems
46
• Do not actually address the problem they
appear to be solving
• Only a ‘horse’ in relation to a speci
fi
c problem
• Hence, a ‘horse’ for one problem may not be
one for another:
• ‘Reproduce ground truth by XYZ’ vs.
• ‘Reproduce ground truth by any means’
Bob Sturm
47. Content Validity
47
• Extent to which the experimental units re
fl
ect and represent the
elements of the domain under study.
48. Criterion Validity
48
• Extent to which results of an experiment are correlated with
those of other experiments already known to be valid.
• Concurrent: how does a new test/measurement compare against
a validated test/measurement?
• Predictive: how well does a test/measurement predict a future
outcome?
49. This Is a Validated Instrument
49
• ‘Big Five’ Personality
• Openness
• Conscientiousness
• Extraversion
• Agreeableness
• Neuroticism
50. This Is Not a Validated Instrument
50
• Myers-Briggs
51. This Is Not a Validated Instrument
51
Liem et al. - Psychology Meets Machine Learning:
Interdisciplinary Perspectives on Algorithmic Job
Candidate Screening
https://repository.tudelft.nl/islandora/object/
uuid%3Ab27e837a-4844-4745-b56d-3efe94c61f0f
52. Consequences
52
Liem et al. - Psychology Meets Machine Learning:
Interdisciplinary Perspectives on Algorithmic Job
Candidate Screening
https://repository.tudelft.nl/islandora/object/
uuid%3Ab27e837a-4844-4745-b56d-3efe94c61f0f
53. Another Problematic Example
53
Wu & Zhang - Automated Inference on
Criminality using Face Images
https://arxiv.org/abs/1611.04135
54. Challenges of Multimedia Data
• Raw data: many numbers
• 44.1 kHz audio: 44,100 measurements per second
• Record me for 45 mins: 119,070,000 measurements
• RGB images: 224 x 224 x 3 pixels = 150,528 intensity values
54
56. Can’t Trust the Feeling?
How Open Data Reveals Unexpected
Behavior of High-Level Music Descriptors
Cynthia C. S. Liem and Chris Mostert
56
https://archives.ismir.net/ismir2020/paper/000137.pdf
57. Automatic Music Description
• Critical to content-based music information retrieval
• Only way for non-content owners to perform large-scale research
• Leading to Grander Statements on the Nature of Music
57
58. But Can We Trust the Descriptors?
• Successful performance reported in papers.
• How does this extend to ‘in-the-wild’ situations?
58
59. AcousticBrainz
• Community locally computes descriptor values, using open-
source Essentia library.
• Submissions (with metadata) collected per MusicBrainz
Recording ID.
• High-level descriptors are machine learning-based, and include
classi
fi
er con
fi
dence.
59
60. AcousticBrainz
• Anyone can submit anything…so we don’t know what the
output should be?
• In psychology and software engineering,‘testing’ can go beyond
‘known truths’, exploiting known relationships.
60
61. Multiple Recording Submissions
• Inspired by software testing (derived oracles / differential testing)
• If only the codec changes, songs remain semantically equivalent.
• One would assume
classify_c(my_preprocessing(m)) ==
classify_c(your_preprocessing(m))
61
63. ‘Constructs Known To Relate’
• Inspired by psychological testing (construct validity)
• Same input is run through multiple classi
fi
ers, targeting the same
concept.
63
64. ‘Constructs Known To Relate’
64
• genre_rosamerica classi
fi
er was 90.74 % accurate on
rock.
• genre_tzanetakis classi
fi
er was 60 % accurate on rock.
• Pearson correlation between (genre_rosamerica, rock)
and (genre_tzanetakis, rock) classi
fi
cations in
Acousticbrainz:
65. ‘Constructs Known To Relate’
65
• genre_rosamerica classi
fi
er was 90.74 % accurate on
rock.
• genre_tzanetakis classi
fi
er was 60 % accurate on rock.
• Pearson correlation between (genre_rosamerica, rock)
and (genre_tzanetakis, rock) classi
fi
cations in
Acousticbrainz: -0.07
68. Strange Con
fi
dence Distributions
68
• Peak vs. non-peak distributional differences are especially large
for bit rate, codec and low-level extractor software versions.
• We hardly consider these in high-level descriptor evaluation!
70. Better Articulation of Underlying Assumptions
70
• Are there any assumptions of underlying distributions, and are
they actually met?
• What is ‘the universe’ that should be represented?
71. Better Awareness & Standards on
Measurement and Annotation
71
• https://conjointly.com/kb/measurement-in-research/
• Aroyo & Welty -Truth Is a Lie: CrowdTruth and the Seven Myths of Human
Annotation https://ojs.aaai.org/index.php/aimagazine/article/view/2564
• Welty, Paritosh & Aroyo,“Metrology for AI: From Benchmarks to Instruments”,
https://arxiv.org/abs/1911.01875
• Jacobs and Wallach,“Measurement and Fairness”, https://dl.acm.org/doi/
10.1145/3442188.3445901
72. Better Documentation
72
• Often inspired by data provenance in databases
• Complements to Data Protection Impact Assessments
• Gebru et al., Datasheets for Datasets, https://arxiv.org/abs/1803.09010
• Jo & Gebru, Lessons from Archives: strategies for collecting
sociocultural data in machine learning, https://dl.acm.org/doi/abs/
10.1145/3351095.3372829
73. Stronger Requirements
73
• “The AI should classify images of dogs”
vs.
• “The system should return true for photographs containing household-dogs.
Other similar species, such as wolves, should return false. Images that contain
dogs, but other items as well, should return true.”
• Ahmad et al.,What’s up with Requirements Engineering for Arti
fi
cial
Intelligence Systems? https://raw.githubusercontent.com/nzjohng/publications/
master/papers/re2021_1.pdf
• More in upcoming lectures
74. Automated Tooling
74
• Northcutt et al., Con
fi
dent Learning: Estimating Uncertainty in
Dataset Labels, https://jair.org/index.php/jair/article/view/
12125/26676 | https://github.com/cleanlab/cleanlab
• Breck et al., DataValidation for Machine Learning, https://
mlsys.org/Conferences/2019/doc/2019/167.pdf
75. Be Aware of Researcher Degrees of Freedom
• We have some
fl
exibility in data collection and analysis (e.g.
choices of normalization, hyperparameters, etc.)
• This may actually affect results and
fi
nal conclusions!
75
Simmons et al. - False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting
Anything as Signi
fi
cant https://journals.sagepub.com/doi/10.1177/0956797611417632
McFee et al. - Open-Source Practices for Music Signal Processing Research: Recommendations forTransparent,
Sustainable, and Reproducible Audio Research https://sinc-lab.com/
fi
les/mcfee2019opensource.pdf
Kim et al. - Beyond Explicit Reports: Comparing Data-Driven Approaches to Studying Underlying Dimensions of
Music Preference https://dl.acm.org/doi/10.1145/3320435.3320462
Liem and Panichella - Run, Forest, Run? On Randomization and Reproducibility in Predictive Software Engineering
https://arxiv.org/abs/2012.08387
76. Further Translations of Testing Concepts?
76
• Software: Coverage? Input diversity? Edge cases?
• Psychology: Further equivalents to validity assessment?
78. For Now - Think of These Questions in
Connection to the Assignment Dataset
78
• What would make for a ‘better’ or ‘worse’ dataset?
• If you could test this data more thoroughly, what would you test
for?