SlideShare a Scribd company logo
Data dissemination and material informatics at LBNL
Anubhav Jain, Energy Technologies Area, Berkeley Lab
Contact: ajain@lbl.gov
Data Dissemination
Several “apps”, including the
PDApp (which helps assess
synthesis and stability)
provide domain-specific
views of the data. Plots and
analyses from such apps are
some of the most heavily
used and cited aspects of the
Materials Project.
Materials Informatics
Data on over 500,000 porous
materials was contributed by
the Nanoporous Materials
Genome Center. Users can
create custom plots based on
properties of interest as well as
obtain detailed information
about specific materials.
A Representational State Transfer
Application Programming
Interface (REST API) allows users
to download large quantities of
data using many popular
programming languages. This
method uses HTTP requests as a
mechanism of accessing data and is
employed by many of the top
internet companies, including
Google, Dropbox, and Twitter.
Unified View
Collabora'ng	PIs:	Kris&n	Persson,	Gerbrand	Ceder,	Dan	Gunter,	Shyue	Ping	Ong	(UCSD),	
Geoffroy	Hau&er	(U.	Catholique	de	Louvain),	Anthony	Gamst	(UCSD),	Karsten	Jacobsen	(DTU)	
Although the Materials Project is largely
an open database, a sandboxing system
allows for core data to be stored
separately from data generated for
proprietary or external projects. Access
management is controlled via an API.
The Materials Project
MPContribs
MIDAS: Materials Informatics & Data Analysis Software
Materials data mining
Automated materials design
A “science gateway” exposes data to the research community and facilitates knowledge
extraction. The Materials Project (www.materialsproject.org) shares simulation data on
hundreds of thousands of materials for a community of 22,000 researchers; millions of data
points have been downloaded and used in the research articles of its users. A recently
introduced feature, MPContribs, accepts data contributions from the user community.
Historically, all data in the Materials
Project was generated at LBNL via high-
throughput computing and density
functional theory (DFT) calculations.
MPContribs is a new feature that allows
users to contribute their own data sets and
disseminate them via Materials Project.
Online editors and plotting tools allow
users to control the look and presentation
of their data. In some instances, apps can
combine user data with Materials Project
core data to create a joint computation-
experiment analysis, as shown in the phase
diagram on the right.
We are building a materials data mining platform called MIDAS (Materials Informatics &
Data Analysis Software). MIDAS can retrieve data from several materials databases that have
REST APIs or from the JSON format, format that data into a Pandas DataFrame object,
automatically generate possible descriptors for the data, and run machine learning
algorithms through scikit-learn. Visualization tools are provided through adapters to the
plot.ly library. This system will be scaled for “big data analysis” in the future through Apache
Spark and similar toolkits such as the FireWorks workflow software developed at LBNL.
Examples of user-
contributed data
Left: data pipeline for MIDAS 
Right: example of an interactive plot
that can be generated through MIDAS
via the plot.ly toolkit.
Using data sets from the Materials Project,
we have derived structure-property
relationships that relate fundamental
descriptors such as composition, density,
and coordination to output properties such
as bulk modulus, shear modulus, and the
electronic character of the valence and
conduction bands. Crucial to this effort has
been the development of relevant
descriptor combinations as well as new
machine learning approaches.
In this machine learning predictor for the bulk and shear
moduli of a material, we use Hölder means to improve
the predictive power of descriptors and develop a local
linear regression method that fits the tails of distributions
(i.e., extreme values) accurately without overfitting.
This diagram shows the pairwise likelihood for the
electronic state on the y-axis to form a greater
contribution to the valence or conduction band character
versus the state on the x-axis. For example, d orbitals in
Cu1+ are the most likely to form the VB. For this study,
we repurposed algorithms used in ranking sports teams
to rank electronic orbitals.
Materials discovery today is largely performed through targeted experiments based on
researcher intuition. However, such intuition is difficult to obtain within the context of high-
throughput and combinatorial studies in which tens of thousands of data points may be
collected. We are developing automated optimization routines that couple forward models
(e.g. density functional theory calculations) with inverse optimizers (e.g., genetic algorithms
or Gaussian processes) to build a fully automated materials discovery system.
Left: schematic diagram showing
the integration of typical high-
throughput screening coupled with
an automatic optimization routine.

Right: performance of genetic
algorithms (GAs) in uncovering
promising perovskite water splitting
materials versus random search
(black) and chemical rules (orange).
Download	this	poster:	hHp://www.slideshare.net/anubhavster

More Related Content

PDF
Capturing and leveraging materials science knowledge from millions of journal...
PDF
Machine learning for materials design: opportunities, challenges, and methods
PDF
Software tools for high-throughput materials data generation and data mining
PDF
Computational Materials Design and Data Dissemination through the Materials P...
PDF
Computational materials design with high-throughput and machine learning methods
PDF
Conducting and Enabling Data-Driven Research Through the Materials Project
PDF
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
PDF
Software tools for calculating materials properties in high-throughput (pymat...
Capturing and leveraging materials science knowledge from millions of journal...
Machine learning for materials design: opportunities, challenges, and methods
Software tools for high-throughput materials data generation and data mining
Computational Materials Design and Data Dissemination through the Materials P...
Computational materials design with high-throughput and machine learning methods
Conducting and Enabling Data-Driven Research Through the Materials Project
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software tools for calculating materials properties in high-throughput (pymat...

What's hot (20)

PDF
Atomate: a tool for rapid high-throughput computing and materials discovery
PDF
The Materials Project: Experiences from running a million computational scien...
PDF
Discovering advanced materials for energy applications (with high-throughput ...
PDF
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
PDF
Open Source Tools for Materials Informatics
PDF
Materials discovery through theory, computation, and machine learning
PDF
Materials Project computation and database infrastructure
PDF
Automated Machine Learning Applied to Diverse Materials Design Problems
PDF
Combining density functional theory calculations, supercomputing, and data-dr...
PDF
Software tools, crystal descriptors, and machine learning applied to material...
PDF
Software tools for data-driven research and their application to thermoelectr...
PDF
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
PDF
DuraMat Data Analytics
PDF
Combining density functional theory calculations, supercomputing, and data-dr...
PDF
Methods, tools, and examples (Part II): High-throughput computation and machi...
PDF
DuraMat Data Management and Analytics
PDF
Discovering advanced materials for energy applications by mining the scientif...
PDF
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
PDF
Prediction and Experimental Validation of New Bulk Thermoelectrics Compositio...
PDF
Density functional theory calculations and data mining for new thermoelectric...
Atomate: a tool for rapid high-throughput computing and materials discovery
The Materials Project: Experiences from running a million computational scien...
Discovering advanced materials for energy applications (with high-throughput ...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Open Source Tools for Materials Informatics
Materials discovery through theory, computation, and machine learning
Materials Project computation and database infrastructure
Automated Machine Learning Applied to Diverse Materials Design Problems
Combining density functional theory calculations, supercomputing, and data-dr...
Software tools, crystal descriptors, and machine learning applied to material...
Software tools for data-driven research and their application to thermoelectr...
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
DuraMat Data Analytics
Combining density functional theory calculations, supercomputing, and data-dr...
Methods, tools, and examples (Part II): High-throughput computation and machi...
DuraMat Data Management and Analytics
Discovering advanced materials for energy applications by mining the scientif...
The DuraMat Data Hub and Analytics Capability: A Resource for Solar PV Data
Prediction and Experimental Validation of New Bulk Thermoelectrics Compositio...
Density functional theory calculations and data mining for new thermoelectric...
Ad

Similar to Data dissemination and materials informatics at LBNL (20)

PDF
Discovering and Exploring New Materials through the Materials Project
PDF
The Materials Project: Applications to energy storage and functional materia...
PDF
Discovering new functional materials for clean energy and beyond using high-t...
PDF
Software tools, crystal descriptors, and machine learning applied to material...
PDF
Open-source tools for generating and analyzing large materials data sets
PDF
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
PDF
The Materials Project: A Community Data Resource for Accelerating New Materia...
PDF
NANO266 - Lecture 12 - High-throughput computational materials design
PDF
How Web APIs and Data Centric Tools Power the Materials Project (PyData SV 2013)
PDF
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
PDF
Overview of accelerated materials design efforts in the Hacking Materials res...
PDF
ICME Workshop Jul 2014 - The Materials Project
PDF
Materials informatics
PDF
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
PDF
2D/3D Materials screening and genetic algorithm with ML model
DOC
Poster Abstracts
PPTX
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
PDF
Materials Data in the 21st Century: From Mishmash to Moneyball
PDF
A Data Ecosystem to Support Machine Learning in Materials Science
Discovering and Exploring New Materials through the Materials Project
The Materials Project: Applications to energy storage and functional materia...
Discovering new functional materials for clean energy and beyond using high-t...
Software tools, crystal descriptors, and machine learning applied to material...
Open-source tools for generating and analyzing large materials data sets
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
The Materials Project: A Community Data Resource for Accelerating New Materia...
NANO266 - Lecture 12 - High-throughput computational materials design
How Web APIs and Data Centric Tools Power the Materials Project (PyData SV 2013)
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
Overview of accelerated materials design efforts in the Hacking Materials res...
ICME Workshop Jul 2014 - The Materials Project
Materials informatics
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
2D/3D Materials screening and genetic algorithm with ML model
Poster Abstracts
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data in the 21st Century: From Mishmash to Moneyball
A Data Ecosystem to Support Machine Learning in Materials Science
Ad

More from Anubhav Jain (20)

PDF
A Career at a U.S. National Lab: Perspective from a Mid-Career Scientist
PDF
Research opportunities in materials design using AI/ML
PDF
Accelerating materials discovery with big data and machine learning
PDF
Predicting the Synthesizability of Inorganic Materials: Convex Hulls, Literat...
PDF
Discovering advanced materials for energy applications: theory, high-throughp...
PDF
Applications of Large Language Models in Materials Discovery and Design
PDF
An AI-driven closed-loop facility for materials synthesis
PDF
Best practices for DuraMat software dissemination
PDF
Best practices for DuraMat software dissemination
PDF
Available methods for predicting materials synthesizability using computation...
PDF
Efficient methods for accurately calculating thermoelectric properties – elec...
PDF
Natural Language Processing for Data Extraction and Synthesizability Predicti...
PDF
Machine Learning for Catalyst Design
PDF
Natural language processing for extracting synthesis recipes and applications...
PDF
Accelerating New Materials Design with Supercomputing and Machine Learning
PDF
DuraMat CO1 Central Data Resource: How it started, how it’s going …
PDF
The Materials Project
PDF
Evaluating Chemical Composition and Crystal Structure Representations using t...
PDF
Perspectives on chemical composition and crystal structure representations fr...
PDF
Machine Learning Platform for Catalyst Design
A Career at a U.S. National Lab: Perspective from a Mid-Career Scientist
Research opportunities in materials design using AI/ML
Accelerating materials discovery with big data and machine learning
Predicting the Synthesizability of Inorganic Materials: Convex Hulls, Literat...
Discovering advanced materials for energy applications: theory, high-throughp...
Applications of Large Language Models in Materials Discovery and Design
An AI-driven closed-loop facility for materials synthesis
Best practices for DuraMat software dissemination
Best practices for DuraMat software dissemination
Available methods for predicting materials synthesizability using computation...
Efficient methods for accurately calculating thermoelectric properties – elec...
Natural Language Processing for Data Extraction and Synthesizability Predicti...
Machine Learning for Catalyst Design
Natural language processing for extracting synthesis recipes and applications...
Accelerating New Materials Design with Supercomputing and Machine Learning
DuraMat CO1 Central Data Resource: How it started, how it’s going …
The Materials Project
Evaluating Chemical Composition and Crystal Structure Representations using t...
Perspectives on chemical composition and crystal structure representations fr...
Machine Learning Platform for Catalyst Design

Recently uploaded (20)

PDF
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
PPTX
ap-psych-ch-1-introduction-to-psychology-presentation.pptx
PPTX
Fluid dynamics vivavoce presentation of prakash
PPTX
Seminar Hypertension and Kidney diseases.pptx
PPTX
Hypertension_Training_materials_English_2024[1] (1).pptx
PDF
S2 SOIL BY TR. OKION.pdf based on the new lower secondary curriculum
PPT
Animal tissues, epithelial, muscle, connective, nervous tissue
PPTX
SCIENCE 4 Q2W5 PPT.pptx Lesson About Plnts and animals and their habitat
PDF
lecture 2026 of Sjogren's syndrome l .pdf
PPTX
perinatal infections 2-171220190027.pptx
PPT
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PPTX
Substance Disorders- part different drugs change body
PDF
CHAPTER 2 The Chemical Basis of Life Lecture Outline.pdf
PDF
BET Eukaryotic signal Transduction BET Eukaryotic signal Transduction.pdf
PPT
6.1 High Risk New Born. Padetric health ppt
PPT
Mutation in dna of bacteria and repairss
PPTX
Biomechanics of the Hip - Basic Science.pptx
PDF
Unit 5 Preparations, Reactions, Properties and Isomersim of Organic Compounds...
PPTX
PMR- PPT.pptx for students and doctors tt
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
ap-psych-ch-1-introduction-to-psychology-presentation.pptx
Fluid dynamics vivavoce presentation of prakash
Seminar Hypertension and Kidney diseases.pptx
Hypertension_Training_materials_English_2024[1] (1).pptx
S2 SOIL BY TR. OKION.pdf based on the new lower secondary curriculum
Animal tissues, epithelial, muscle, connective, nervous tissue
SCIENCE 4 Q2W5 PPT.pptx Lesson About Plnts and animals and their habitat
lecture 2026 of Sjogren's syndrome l .pdf
perinatal infections 2-171220190027.pptx
1. INTRODUCTION TO EPIDEMIOLOGY.pptx for community medicine
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
Substance Disorders- part different drugs change body
CHAPTER 2 The Chemical Basis of Life Lecture Outline.pdf
BET Eukaryotic signal Transduction BET Eukaryotic signal Transduction.pdf
6.1 High Risk New Born. Padetric health ppt
Mutation in dna of bacteria and repairss
Biomechanics of the Hip - Basic Science.pptx
Unit 5 Preparations, Reactions, Properties and Isomersim of Organic Compounds...
PMR- PPT.pptx for students and doctors tt

Data dissemination and materials informatics at LBNL

  • 1. Data dissemination and material informatics at LBNL Anubhav Jain, Energy Technologies Area, Berkeley Lab Contact: ajain@lbl.gov Data Dissemination Several “apps”, including the PDApp (which helps assess synthesis and stability) provide domain-specific views of the data. Plots and analyses from such apps are some of the most heavily used and cited aspects of the Materials Project. Materials Informatics Data on over 500,000 porous materials was contributed by the Nanoporous Materials Genome Center. Users can create custom plots based on properties of interest as well as obtain detailed information about specific materials. A Representational State Transfer Application Programming Interface (REST API) allows users to download large quantities of data using many popular programming languages. This method uses HTTP requests as a mechanism of accessing data and is employed by many of the top internet companies, including Google, Dropbox, and Twitter. Unified View Collabora'ng PIs: Kris&n Persson, Gerbrand Ceder, Dan Gunter, Shyue Ping Ong (UCSD), Geoffroy Hau&er (U. Catholique de Louvain), Anthony Gamst (UCSD), Karsten Jacobsen (DTU) Although the Materials Project is largely an open database, a sandboxing system allows for core data to be stored separately from data generated for proprietary or external projects. Access management is controlled via an API. The Materials Project MPContribs MIDAS: Materials Informatics & Data Analysis Software Materials data mining Automated materials design A “science gateway” exposes data to the research community and facilitates knowledge extraction. The Materials Project (www.materialsproject.org) shares simulation data on hundreds of thousands of materials for a community of 22,000 researchers; millions of data points have been downloaded and used in the research articles of its users. A recently introduced feature, MPContribs, accepts data contributions from the user community. Historically, all data in the Materials Project was generated at LBNL via high- throughput computing and density functional theory (DFT) calculations. MPContribs is a new feature that allows users to contribute their own data sets and disseminate them via Materials Project. Online editors and plotting tools allow users to control the look and presentation of their data. In some instances, apps can combine user data with Materials Project core data to create a joint computation- experiment analysis, as shown in the phase diagram on the right. We are building a materials data mining platform called MIDAS (Materials Informatics & Data Analysis Software). MIDAS can retrieve data from several materials databases that have REST APIs or from the JSON format, format that data into a Pandas DataFrame object, automatically generate possible descriptors for the data, and run machine learning algorithms through scikit-learn. Visualization tools are provided through adapters to the plot.ly library. This system will be scaled for “big data analysis” in the future through Apache Spark and similar toolkits such as the FireWorks workflow software developed at LBNL. Examples of user- contributed data Left: data pipeline for MIDAS Right: example of an interactive plot that can be generated through MIDAS via the plot.ly toolkit. Using data sets from the Materials Project, we have derived structure-property relationships that relate fundamental descriptors such as composition, density, and coordination to output properties such as bulk modulus, shear modulus, and the electronic character of the valence and conduction bands. Crucial to this effort has been the development of relevant descriptor combinations as well as new machine learning approaches. In this machine learning predictor for the bulk and shear moduli of a material, we use Hölder means to improve the predictive power of descriptors and develop a local linear regression method that fits the tails of distributions (i.e., extreme values) accurately without overfitting. This diagram shows the pairwise likelihood for the electronic state on the y-axis to form a greater contribution to the valence or conduction band character versus the state on the x-axis. For example, d orbitals in Cu1+ are the most likely to form the VB. For this study, we repurposed algorithms used in ranking sports teams to rank electronic orbitals. Materials discovery today is largely performed through targeted experiments based on researcher intuition. However, such intuition is difficult to obtain within the context of high- throughput and combinatorial studies in which tens of thousands of data points may be collected. We are developing automated optimization routines that couple forward models (e.g. density functional theory calculations) with inverse optimizers (e.g., genetic algorithms or Gaussian processes) to build a fully automated materials discovery system. Left: schematic diagram showing the integration of typical high- throughput screening coupled with an automatic optimization routine. Right: performance of genetic algorithms (GAs) in uncovering promising perovskite water splitting materials versus random search (black) and chemical rules (orange). Download this poster: hHp://www.slideshare.net/anubhavster