Machine Learning with Small Data
John C. Liu, Ph.D. CFA
June 18, 2019
Twitter: @drjohncliu
Disclaimer
THE INFORMATION SET FORTH HEREIN HAS BEEN OBTAINED OR DERIVED FROM SOURCES GENERALLY
AVAILABLE TO THE PUBLIC AND BELIEVED BY THE AUTHOR TO BE RELIABLE, BUT THE AUTHOR DOES NOT MAKE
ANY REPRESENTATION OR WARRANTY, EXPRESS OR IMPLIED, AS TO ITS ACCURACY OR COMPLETENESS. THE
INFORMATION IS FOR EDUCATIONAL PURPOSES ONLY AND IS NOT INTENDED TO BE USED AS THE BASIS OF ANY
BUSINESS OR INVESTMENT DECISIONS BY ANY PERSON OR ENTITY. ALL OF THE INFORMATION CONTAINED IN
THE PRESENTATION IS SUBJECT TO FURTHER MODIFICATION AND ANY AND ALL FORECASTS, PROJECTIONS OR
FORWARD-LOOKING STATEMENTS CONTAINED HEREIN SHALL NOT BE RELIED UPON AS FACTS NOR RELIED
UPON AS ANY REPRESENTATION OF FUTURE RESULTS WHICH MAY MATERIALLY VARY FROM SUCH
PROJECTIONS AND FORECASTS.
Roadmap
• Introduction
• Big Data Revolution
• What about Small Data?
• Dealing with Reality
– Semantic/Contextualized Representations
– Experimental Design
– Adversarial Data Generation
• Conclusion
Big Data
Source: Bernard Marr & Co.
Deep Learning
Source: NVIDIA
Data is the New Oil
Source: James Corbett
More Data = Better Models
Source: Andrew Ng
What’s Wrong With this
Picture
Train Set?
Source: The Simpsons
Data Annotation is Expensive
Source: Jia, Yangqing. 2014). Learning Semantic Image Representations at a Large Scale.
Annotator (Dis)Agreement?
Source: Stephen Yip & Chintan Parmar
Annotation = Bottleneck
Source:physiconet.org
• 14 million images
• 20,000 categories
• 25 Human Years to annotate!
Source: Li Fei-Fei. (2010). ImageNet: Crowdsourcing, benchmarking & other cool things
Reality = Small Annotated
Data
Source: NASA/JPL/UCSD/JSC
Ways to Deal with Small Data
• AWS Mechanical Turk (e.g., ImageNet)
• CrowdFlower/Figure8/Appen
• Hire SMEs
• Data Augmentation/Synthetic Generation (SMOTE)
Synthetic Minority Oversampling
Nearest
Neighbor
Algorithm
Source: Bart Baesens
Anything
Else?
Photograph: Andrea Shea
Not All Data is Created Equal
https://pypi.org/project/imbalanced-learn/
Source: Rishabh Misra
Training a Cat/Dog Classifier
• Which training samples are more useful?
Photograph:American Kennel Club
Photograph:Atchoumfan
Photograph:Sujoy Roychowdhury
Oncology Text Classifier
Which training samples are more useful?
1. Left medial foot and ankle pain and swelling. Plantar
metatarsal pain for 5 weeks. No known trauma.
2. Dorsal right medial upper back pain for 10 weeks. Right
parotid mass.
3. History pancreatic cancer. Status post aortic
chemotherapy and Whipple procedure
Points Near Decision
Boundary
Maximum
Entropy
Machine Learning with Small Data
What Data Scientists Should Care Most About
Kid Saw This in a Toy Store
Tiger
Photograph:Nat & Jules Brown
At the Zoo a Few Weeks Later
Tiger
Photograph:Skip O’Rourke
Inductive Transfer Learning
• Learning new tasks using knowledge learned from other
tasks
Source: Dipanjan Sarkar
Semantic Image Representations
Source: Jia, Yangqing. 2014). Learning Semantic Image Representations at a Large Scale.
Word Embeddings
Corpus Docs Sentences
Words
Vectors
Word embeddings encode semantic
relationships learned from corpus.
Word2Vec Context too
Narrow
Source: Kamath, Uday, Liu, John & Whitaker, James. (2019). Deep Learning for NLP and Speech Recognition.
Neural Language Model
Source: Kamath, Uday, Liu, John & Whitaker, James. (2019). Deep Learning for NLP and Speech Recognition.
ELMO
Embeddings from Language Models
– Bidirectional Language Models (forward & backward)
– Using LSTMs
– Concatenate hidden layers
Source: Karan Purohit
Concept Embeddings
• RDF2Vec
Source: Kamath, Uday, Liu, John & Whitaker, James. (2019). Deep Learning for NLP and Speech Recognition.
Did We Solve the Tiger Problem?
• Generalize with only a single label? (One-Shot Learning)
• If I described a lion, would you recognize one if you never
ever saw one? (Zero-Shot Learning)
• Did the chicken come before the egg, or vice versa?
(Causality)
THE WORLD
IS NOT
RANDOM
INHERENT STRUCTURE EXISTS
Source: NASA/JPL/UCSD/JSC
Not Random
• Each CIFAR-10 image = 32x32 pixels by 3x256 colors
• Number of possible permutations = 786432!
Source: Krizhevsky, Alex. (2009). Learning Multiple Layers of Features from Tiny Images.
Not a Possible Permutation
Source: Goodfellow, Ian. (2016). Generative Adversarial Nettworks.
How many Laws of Physics are
sufficient to describe motion?
Photograph: Richard Jognston
Bayesian Networks
Factorizing
Joint PDF
Source: Sato, Renato and Sato, Graziela. (2015). Probabilistic graphic models applied to identification of diseases.
Adversarial Data Generation
Source: Mino, Ajkel & Spanakis, Gerasimos. (2018). LoGAN: Generating Logos with a Generative Adversarial Neural Network Conditioned on color.
Last Word
Photograph: Gregor Schmidt
My New Book
A comprehensive resource that
builds up from elementary deep
learning, text, and speech
principles to advanced state-of-
the-art neural architectures.
On Amazon, BN, Springer
https://www.amazon.com/Deep-Learning-
NLP-Speech-Recognition/dp/3030145956
Thank you.
AI/ML Solutions to Solve Business Problems

More Related Content

PDF
IoT and the Future of Work
PDF
Griffey "Artificial Intelligence and Machine Learning in Libraries"
PDF
Introduction to the ethics of machine learning
PDF
Stay Safe and Healthy with Computer Vision
PPT
Knowledge Representation in the Age of Deep Learning, Watson, and the Semanti...
PDF
Unravel COVID-19 From a Systems Thinking Lens
PDF
Extreme Danger of bias in Artificial Inteligence
PPTX
Technology for everyone - AI ethics and Bias
IoT and the Future of Work
Griffey "Artificial Intelligence and Machine Learning in Libraries"
Introduction to the ethics of machine learning
Stay Safe and Healthy with Computer Vision
Knowledge Representation in the Age of Deep Learning, Watson, and the Semanti...
Unravel COVID-19 From a Systems Thinking Lens
Extreme Danger of bias in Artificial Inteligence
Technology for everyone - AI ethics and Bias

What's hot (6)

PPTX
Ethical Considerations in the Design of Artificial Intelligence
PDF
From Human Intelligence to Machine Intelligence
DOCX
Margaret Hamilton
PDF
Breakout 1. Research and Development, including Technical Performance.
PDF
Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...
PPTX
Data Models And Details About Open Data
Ethical Considerations in the Design of Artificial Intelligence
From Human Intelligence to Machine Intelligence
Margaret Hamilton
Breakout 1. Research and Development, including Technical Performance.
Towards Fairer Datasets: Filtering and Balancing the Distribution of the Peop...
Data Models And Details About Open Data
Ad

Similar to Machine Learning with Small Data (20)

PDF
Fairness in Machine Learning @Codemotion
PDF
인공지능은 의료를 어떻게 혁신할 것인가 (ver 2)
PPTX
A Blind Date With (Big) Data: Student Data in (Higher) Education
PPTX
Overview of Data Science and AI
PPTX
PPTX
Biomedical Data Science: We Are Not Alone
PPTX
Slima explainable deep learning using fuzzy logic human ist u fribourg ver 17...
PPTX
Ntegra 20231003 v3.pptx
PPT
Health Care Collaboration & Community in Virtual Worlds & Second Life
PPTX
Univ of Miami CTSI: Citizen science seminar; Oct 2014
PPTX
Ml in genomics
PDF
The Edge Group Quito Lima - july 2014
PDF
Don't blindly trust your ML System, it may change your life (Azzurra Ragone, ...
PPTX
Fairness in Machine Learning
PPTX
Frankie Rybicki slide set for Deep Learning in Radiology / Medicine
PDF
Addressing privacy concerns_in_the_age_of_federated_data_access
PPTX
Student data: the missing link in solving the student departure puzzle?
PDF
Machine learning in medicine: calm down
PPTX
Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...
PDF
6.X Claim Testing - Collective Learning
Fairness in Machine Learning @Codemotion
인공지능은 의료를 어떻게 혁신할 것인가 (ver 2)
A Blind Date With (Big) Data: Student Data in (Higher) Education
Overview of Data Science and AI
Biomedical Data Science: We Are Not Alone
Slima explainable deep learning using fuzzy logic human ist u fribourg ver 17...
Ntegra 20231003 v3.pptx
Health Care Collaboration & Community in Virtual Worlds & Second Life
Univ of Miami CTSI: Citizen science seminar; Oct 2014
Ml in genomics
The Edge Group Quito Lima - july 2014
Don't blindly trust your ML System, it may change your life (Azzurra Ragone, ...
Fairness in Machine Learning
Frankie Rybicki slide set for Deep Learning in Radiology / Medicine
Addressing privacy concerns_in_the_age_of_federated_data_access
Student data: the missing link in solving the student departure puzzle?
Machine learning in medicine: calm down
Inauguration Function - Ohio Center of Excellence in Knowledge-Enabled Comput...
6.X Claim Testing - Collective Learning
Ad

More from John Liu (15)

PDF
Kubeflow and Data Science in Kubernetes
PPTX
Artificial Intelligence As a Service
PDF
Data Analytics in Computational Law
PDF
AI & Machine Learning: Business Transformation
PDF
DeepREM
PDF
Social Network Analysis for Healthcare
PDF
Healthy Competition: How Adversarial Reasoning is Leading the Next Wave of In...
PPTX
Sentiment-Driven Financial Intelligence
PPTX
A Way Forward
PDF
I2P and the Dark Web
PPTX
Beyond Machine Learning: The New Generation of Learning Algorithms Coming to ...
PDF
Behavioral Analytics for Financial Intelligence
PDF
Naive Bayes for the Superbowl
PDF
Neural Networks in the Wild: Handwriting Recognition
PDF
Role of Data Science in ERM @ Nashville Analytics Summit Sep 2014
Kubeflow and Data Science in Kubernetes
Artificial Intelligence As a Service
Data Analytics in Computational Law
AI & Machine Learning: Business Transformation
DeepREM
Social Network Analysis for Healthcare
Healthy Competition: How Adversarial Reasoning is Leading the Next Wave of In...
Sentiment-Driven Financial Intelligence
A Way Forward
I2P and the Dark Web
Beyond Machine Learning: The New Generation of Learning Algorithms Coming to ...
Behavioral Analytics for Financial Intelligence
Naive Bayes for the Superbowl
Neural Networks in the Wild: Handwriting Recognition
Role of Data Science in ERM @ Nashville Analytics Summit Sep 2014

Recently uploaded (20)

PDF
Microsoft Core Cloud Services powerpoint
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PPTX
chrmotography.pptx food anaylysis techni
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PDF
Navigating the Thai Supplements Landscape.pdf
PPTX
SET 1 Compulsory MNH machine learning intro
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPT
Predictive modeling basics in data cleaning process
PPTX
IMPACT OF LANDSLIDE.....................
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
Steganography Project Steganography Project .pptx
PPTX
A Complete Guide to Streamlining Business Processes
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PPT
statistic analysis for study - data collection
PDF
Transcultural that can help you someday.
PPTX
modul_python (1).pptx for professional and student
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PPTX
Introduction to Inferential Statistics.pptx
PDF
Introduction to the R Programming Language
PPTX
Business_Capability_Map_Collection__pptx
Microsoft Core Cloud Services powerpoint
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
chrmotography.pptx food anaylysis techni
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
Navigating the Thai Supplements Landscape.pdf
SET 1 Compulsory MNH machine learning intro
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Predictive modeling basics in data cleaning process
IMPACT OF LANDSLIDE.....................
STERILIZATION AND DISINFECTION-1.ppthhhbx
Steganography Project Steganography Project .pptx
A Complete Guide to Streamlining Business Processes
Pilar Kemerdekaan dan Identi Bangsa.pptx
statistic analysis for study - data collection
Transcultural that can help you someday.
modul_python (1).pptx for professional and student
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
Introduction to Inferential Statistics.pptx
Introduction to the R Programming Language
Business_Capability_Map_Collection__pptx

Machine Learning with Small Data

Editor's Notes

  • #7: 2006, Clive Humby
  • #34: 786k per image