SlideShare a Scribd company logo
Materials Data in ActionMaterials Data in Action
Max Hutchinson,Max Hutchinson,
Scientific Software Eng.
ONE DOES NOT SIMPLY...ONE DOES NOT SIMPLY...
APPLY OFF THE SHELF MLAPPLY OFF THE SHELF ML
TOOLS TO MATERIALSTOOLS TO MATERIALS
DISCOVERYDISCOVERY
ARTIFICIAL INTELLIGENCE FOR MAT. SCI.ARTIFICIAL INTELLIGENCE FOR MAT. SCI.
8 AUGUST 2018, NIST8 AUGUST 2018, NIST
Bryce Meredig,Bryce Meredig,
Chief Science Officer
What is materials informatics?What is materials informatics?
 
What makes it particularly challenging?What makes it particularly challenging?
  
Can we do anything about it?Can we do anything about it?
LET'S TRY TO MACHINELET'S TRY TO MACHINE
LEARN A NOBEL PRIZELEARN A NOBEL PRIZE
CASE STUDY: HIGH-T SUPERCONDUCTORSCASE STUDY: HIGH-T SUPERCONDUCTORS
Pia Jensen Ray. Figure 2.4 in Master's thesis, "Structural investigation of La(2-x)Sr(x)CuO(4+y) - Following staging as a function of temperature". Niels Bohr Institute, Faculty of Science,
University of Copenhagen. Copenhagen, Denmark, November 2015. DOI:10.6084/m9.figshare.2075680.v2
CASE STUDY: HIGH-T SUPERCONDUCTORSCASE STUDY: HIGH-T SUPERCONDUCTORS
CASE STUDY: HIGH-T SUPERCONDUCTORSCASE STUDY: HIGH-T SUPERCONDUCTORS
Cross-validated RMSE for T ≈c 10K
CAN WE PREDICT HIGH-TCAN WE PREDICT HIGH-T
SUPERCONDUCTIVITY?!?SUPERCONDUCTIVITY?!?
(spoiler alert:) no
LEAVE ONE CLUSTER OUT (LOCO) CVLEAVE ONE CLUSTER OUT (LOCO) CV
Nominal k-fold cross validations assumes independence of samples from the input space
This is almost never true in materials informatics: individual data sources have
selection biases and different data sources draw from different distributions
LOCO CV groups the data before computing train/test splits
The groups are inferred via clustering rather than being dictated by a domain expert
"Can machine learning identify the next high-temperature superconductor? Examining
extrapolation performance for materials discovery."
B. Meredig, ..., M. Hutchinson, ..., B. Gibbons, J. Hattrick-Simpers, A. Mehta, L. Ward
CASE STUDY: HIGH-T SUPERCONDUCTORSCASE STUDY: HIGH-T SUPERCONDUCTORS
The model can't "extrapolate" across material classes (clusters).
LOW CROSS-VALIDATIONLOW CROSS-VALIDATION
ERROR IS INSUFFICIENTERROR IS INSUFFICIENT
PossiblePossible
MaterialsMaterials
InformaticsInformatics
ResearchResearch
ProgramProgram
1. Collect data
2. Train an approximate ML model
3. Validate the ML model
If insufficiently accurate, back to (1)
4. Optimize or screen over materials using the ML model
5. ...
6. Profit
A large portion of the literature focuses on collection, training,
and validation in support of screening.
CAN WE DISCOVER NEWCAN WE DISCOVER NEW
MATERIALS?MATERIALS?
(spoiler alert): yes
DESIGN OFDESIGN OF
EXPERIMENTS,EXPERIMENTS,
SEQUENTIALSEQUENTIAL
LEARNING,LEARNING,
AND "FUELS"AND "FUELS"
1. Collect data
2. Train an approximate ML model
3. Design an experiment
4. Conduct the experiment
If quality is insufficient, append and back to (2)
5. ...
6. Profit
Modeling Experiment
Designs
Informs
DESIGNING THE NEXT EXPERIMENTDESIGNING THE NEXT EXPERIMENT
Maximum Expected                           
  
 
 
Maximum Likelihood of
Improvement (MLI)
 
 
Maximum Uncertainty                        
x ∗ p(x; θ) dx∫−∞
∞
[ ]
p(x; θ) dx∫α
∞
[ ]
(x − ) dx∫−∞
∞
[ xˉ 2
]
BENCHMARK: DESIGN ON EXPLICIT LISTBENCHMARK: DESIGN ON EXPLICIT LIST
9x
2x
REAL WORLD EXAMPLE:  ADAPT @ MINESREAL WORLD EXAMPLE:  ADAPT @ MINES
https://www.additivemanufacturing.media/articles/how-machine-learning-is-moving-am-beyond-trial-and-error/
DATA DRIVEN MODELINGDATA DRIVEN MODELING
DELIVERS DISCOVERYDELIVERS DISCOVERY
FASTERFASTER
WHAT ABOUT THEWHAT ABOUT THE
MACHINE LEARNING?MACHINE LEARNING?
““SimplySimply downloading and ‘applying’downloading and ‘applying’
open-source software to your dataopen-source software to your data
won’t work. AI needs to be customizedwon’t work. AI needs to be customized
to your business context and data.”to your business context and data.”
 
Andrew Ng in Harvard Business Review 
(Stanford, Google Brain, Coursera, Baidu)
MATERIALS INFORMATICS CONTEXTMATERIALS INFORMATICS CONTEXT
Labels are scarce and expensive
Typical dataset sizes are 100-1000 labels
Preparing a sample is often more difficult than measuring it
Different labels have low marginal costs
We've been doing physics, chemistry, and materials science for hundreds of years
There are (not always accurate) sources of computational data
We have some priors for which labels are related
We have some priors for what some relationships look like
PHYSICAL RELATIONSHIPS PHYSICAL RELATIONSHIPS 
Materials science has Process-
Structure-Property (PSP) relationships
Process Structure Property
Structure
Properties
Performance
Processing
Characterization
PHYSICAL RELATIONSHIPS PHYSICAL RELATIONSHIPS 
Physics, mathematics, and engineering
think about multi-scale modeling
Micro Meso Macro
https://www.nas.nasa.gov/SC14/demos/demo26.html
GRAPHICAL MODELS: DOMAIN-AWARE MODELINGGRAPHICAL MODELS: DOMAIN-AWARE MODELING
Inputs & Features
Featurization
Empirical Relation
Computational Data
Machine Learning
Quantity of Interest
GRAPHICAL MODELS: TRANSFER LEARNINGGRAPHICAL MODELS: TRANSFER LEARNING
M. Hutchinson, E. Antono, B. Gibbons, S. Paradiso, J. Ling, B. Meredig
Overcoming data scarcity with transfer learning, https://arxiv.org/pdf/1711.05099.pdf
"B" is a plentiful latent variable
DFT band gap
Hydrogen splitting react. rate
Indentation hardness
 
"A" is a scarce or expensive label 
Color
NO splitting reaction rate       
Ultimate tensile strength
GRAPHICAL MODELS: TRANSFER LEARNINGGRAPHICAL MODELS: TRANSFER LEARNING
Simple example:
Adding yield strength
information to a fatigue
strength design increases
experimental efficiency
M. Hutchinson, E. Antono, B. Gibbons, S. Paradiso, J. Ling, B. Meredig
Overcoming data scarcity with transfer learning, https://arxiv.org/pdf/1711.05099.pdf
WHERE DOES THE UNCERTAINTY COME FROM?WHERE DOES THE UNCERTAINTY COME FROM?
Jackknife methods capture uncertainty with respect to finite sample size.
Computational cost is independent of the size of the feature space.
We add an explicit bias term trained on the out-of-bag errors
WHERE DOES THE UNCERTAINTY COME FROM?WHERE DOES THE UNCERTAINTY COME FROM?
WHERE DOES THE UNCERTAINTY COME FROM?WHERE DOES THE UNCERTAINTY COME FROM?
(PROBABALISTIC) GRAPHICAL MODELS(PROBABALISTIC) GRAPHICAL MODELS
Inputs & Features
Featurization
Empirical Relation
Computational Data
Machine Learning
Quantity of Interest
THANKTHANK
YOU!YOU! Job listings: citrine.io/jobs
Newsletter:
citrination.org/publications_talks/ddms-newsletter/
Literature review:
citrination.org/learn/citrines-literature-review/

More Related Content

PPTX
Big Data, Big Computing, AI, and Environmental Science
PPT
A New Partnership for Cross-Scale, Cross-Domain eScience
PDF
IPython Notebooks - Hacia los papers ejecutables
ODP
Introduction NL-HUG (April)
PDF
Digital Science: Towards the executable paper
PPTX
ML in materials discovery
PDF
VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...
PDF
Digital Science: Reproducibility and Visibility in Astronomy
Big Data, Big Computing, AI, and Environmental Science
A New Partnership for Cross-Scale, Cross-Domain eScience
IPython Notebooks - Hacia los papers ejecutables
Introduction NL-HUG (April)
Digital Science: Towards the executable paper
ML in materials discovery
VIZBI 2015 Tutorial: Cytoscape, IPython, Docker, and Reproducible Network Dat...
Digital Science: Reproducibility and Visibility in Astronomy

Similar to Materials Data in Action (20)

PDF
ML Basic Concepts.pdf
PDF
DevX: Python for Data Science
PPTX
DevelopingDataScienceProfession
PDF
Software tools for high-throughput materials data generation and data mining
PDF
Künstlich intelligent?
PDF
UMich CI Days: Scaling a code in the human dimension
PDF
Introduction to the Artificial Intelligence and Computer Vision revolution
PDF
Smart Data Webinar: Choosing the Right Data Management Architecture for Cogni...
PDF
20181212 ibm aot
PDF
La résolution de problèmes à l'aide de graphes
PDF
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
PPT
Data science training institute in hyderabad
PPTX
Deep Learning for Developers (expanded version, 12/2017)
PDF
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
PDF
Introduction to PySpark
PDF
IIPGH Webinar 1: Getting Started With Data Science
PDF
Augury and Omens Aside, Part 1:
 The Business Case for Apache Mesos
PDF
Cheminformatics Software Development: Case Studies
PDF
Security Analytics: The Promise of Artificial Intelligence, Machine Learning,...
PPTX
2014-09-30 (@hansbos) InnovatingNL
ML Basic Concepts.pdf
DevX: Python for Data Science
DevelopingDataScienceProfession
Software tools for high-throughput materials data generation and data mining
Künstlich intelligent?
UMich CI Days: Scaling a code in the human dimension
Introduction to the Artificial Intelligence and Computer Vision revolution
Smart Data Webinar: Choosing the Right Data Management Architecture for Cogni...
20181212 ibm aot
La résolution de problèmes à l'aide de graphes
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Data science training institute in hyderabad
Deep Learning for Developers (expanded version, 12/2017)
Data Science - An emerging Stream of Science with its Spreading Reach & Impact
Introduction to PySpark
IIPGH Webinar 1: Getting Started With Data Science
Augury and Omens Aside, Part 1:
 The Business Case for Apache Mesos
Cheminformatics Software Development: Case Studies
Security Analytics: The Promise of Artificial Intelligence, Machine Learning,...
2014-09-30 (@hansbos) InnovatingNL
Ad

More from aimsnist (20)

PDF
Enabling Data Science Methods for Catalyst Design and Discovery
PDF
A Framework and Infrastructure for Uncertainty Quantification and Management ...
PDF
Predicting local atomic structures from X-ray absorption spectroscopy using t...
PDF
Smart Metrics for High Performance Material Design
PDF
Graphs, Environments, and Machine Learning for Materials Science
PDF
A Machine Learning Framework for Materials Knowledge Systems
PDF
When The New Science Is In The Outliers
PDF
The MGI and AI
PDF
Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...
PDF
How to Leverage Artificial Intelligence to Accelerate Data Collection and Ana...
PDF
Coupling AI with HiTp experiments to Discover Metallic Glasses Faster
PDF
“Materials Informatics and Big Data: Realization of 4th Paradigm of Science i...
PDF
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
PDF
Autonomous experimental phase diagram acquisition
PDF
Classical force fields as physics-based neural networks
PDF
Pathways Towards a Hierarchical Discovery of Materials
PDF
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Data
PDF
Polymer Genome: An Informatics Platform for Polymer Dielectrics Discovery and...
PDF
2D/3D Materials screening and genetic algorithm with ML model
PDF
Combinatorial Experimentation and Machine Learning for Materials Discovery
Enabling Data Science Methods for Catalyst Design and Discovery
A Framework and Infrastructure for Uncertainty Quantification and Management ...
Predicting local atomic structures from X-ray absorption spectroscopy using t...
Smart Metrics for High Performance Material Design
Graphs, Environments, and Machine Learning for Materials Science
A Machine Learning Framework for Materials Knowledge Systems
When The New Science Is In The Outliers
The MGI and AI
Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...
How to Leverage Artificial Intelligence to Accelerate Data Collection and Ana...
Coupling AI with HiTp experiments to Discover Metallic Glasses Faster
“Materials Informatics and Big Data: Realization of 4th Paradigm of Science i...
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Autonomous experimental phase diagram acquisition
Classical force fields as physics-based neural networks
Pathways Towards a Hierarchical Discovery of Materials
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Data
Polymer Genome: An Informatics Platform for Polymer Dielectrics Discovery and...
2D/3D Materials screening and genetic algorithm with ML model
Combinatorial Experimentation and Machine Learning for Materials Discovery
Ad

Recently uploaded (20)

PPTX
Current and future trends in Computer Vision.pptx
PDF
PPT on Performance Review to get promotions
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPT
Occupational Health and Safety Management System
PPTX
Nature of X-rays, X- Ray Equipment, Fluoroscopy
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PDF
Exploratory_Data_Analysis_Fundamentals.pdf
PPT
A5_DistSysCh1.ppt_INTRODUCTION TO DISTRIBUTED SYSTEMS
PDF
Soil Improvement Techniques Note - Rabbi
PPTX
Artificial Intelligence
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
Abrasive, erosive and cavitation wear.pdf
PDF
Integrating Fractal Dimension and Time Series Analysis for Optimized Hyperspe...
PDF
737-MAX_SRG.pdf student reference guides
PDF
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
PDF
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
PPTX
UNIT - 3 Total quality Management .pptx
PPTX
Fundamentals of Mechanical Engineering.pptx
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
introduction to high performance computing
Current and future trends in Computer Vision.pptx
PPT on Performance Review to get promotions
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Occupational Health and Safety Management System
Nature of X-rays, X- Ray Equipment, Fluoroscopy
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
Exploratory_Data_Analysis_Fundamentals.pdf
A5_DistSysCh1.ppt_INTRODUCTION TO DISTRIBUTED SYSTEMS
Soil Improvement Techniques Note - Rabbi
Artificial Intelligence
Automation-in-Manufacturing-Chapter-Introduction.pdf
Abrasive, erosive and cavitation wear.pdf
Integrating Fractal Dimension and Time Series Analysis for Optimized Hyperspe...
737-MAX_SRG.pdf student reference guides
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
UNIT - 3 Total quality Management .pptx
Fundamentals of Mechanical Engineering.pptx
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
introduction to high performance computing

Materials Data in Action