ChemNLP
A Natural Language Processing based Library
for Materials Chemistry Text Data
Kamal Choudhary
https://jarvis.nist.gov/
NIST, Gaithersburg, MD, USA
Polymer group
7/13/2023
1
Joint Automated Repository for Various Integrated Simulations
Outline
2
• Introduction
• AI for Materials
• JARVIS
• NLP basics
• ChemNLP
• Datasets
• TextClassification
• TokenClassification
• WebApp
• TextSummarization
• TextGeneration
• Integrating DFT database
• JARVIS-Leaderboard/benchmarking
• Hands-on
• Summary
Electronic structure
DFT,DMFT,
TB,QMC
Quantum
Computation
AtomQC
Force-Field
JARVIS-FF
ALIGNN-FF
AI/ML
CFID
ALIGNN
AtomVision
ChemNLP
AI for Materials Science
3
Established: January 2017
Published: >40 articles
Users: >20000+ users worldwide
Materials: >80000, millions of properties
Events:
• Quantum Matters in Materials Science (QMMS)
• Artificial Intelligence for Materials Science (AIMS)
• JARVIS-School
User-comments:
• “There are many different theoretical levels on which you can
approach the field. JARVIS is unusual in that it spans more levels
than other databases.”
• “A pure gold-mine for the data-quality effort…”
• Thanks for your generous sharing. Your works inspire me a lot.
• “You guys are doing something really beneficial…”
• “I find JARVIS-DFT very useful for my research…”
JARVIS: Databases, Tools, Events, Outreach
4
https://jarvis.nist.gov
Requires login credentials, free registration
Updates
• 80,000 materials
• QMC, tight binding, ALIGNN, ALIGNN-FF,
• AtomVision, ChemNLP, JARVIS-Leaderboard
• Quantum Computation algorithms
• Superconductors (bulk and 2D), magnetic topological mats.
Recent Updates to JARVIS
Tools
Used for hands-on session!
ChemNLP
Text classification & Token classification
Text summarization & Text-generation
Conventional NLP: TFIDF
https://www.kaggle.com/code/ashoksrinivas/nlp-with-tfidf-neural-networks
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
https://i.stack.imgur.com/mtmP6.png
Transformers & “Attention Is All You Need”
Much better than RNN, LSTM etc.
Attention: extremely long-term memory
https://towardsdatascience.com/illustrated-guide-to-transformers-step-by-step-explanation-f74876522bc0
ChemNLP Datasets
Exploratory data analysis (EDA)
ChemNLP: A Natural Language Processing based Library for Materials Chemistry Text Data
Accuracy
Text classification
Named Entity Recognition/Token classification
87 % F1 score
ChemNLP Webpage for Composition Search
https://jarvis.nist.gov/jarvischemnlp/
ChemNLP
Abstractive summarization (Abstract to Title)
Google’s T5-225million transformer model
ROGUE-1 score: 46.5 %
Text generation (Title to Abstract)
GPT2-medium LLM model
ROGUE-1 score: 32 %
Without fine tuning: 26 %
ROUGE:
Recall-Oriented Understudy for Gisting Evaluation
T5: Text-to-Text Transfer Transformer
Other test cases
19
ChemNLP for supercondutors
Confusion matrix for text classification (137927 articles) • arXiv cond-mat.supr-con and JARVIS-SuperconDB
• Venn diagram for chemical formula
ChatGPT response
JARVIS-Leaderboard: Large Scale Benchmark
Challenges in materials science community:
• Reproducibility
• Transparency
• Validation
• Fidelity
• Data vs. metadata
• What is the ground truth/reference to compare our models to?
How does this change depending on the model?
• Synergy of computational and experimental databases
Community
effort to tackle
challenges:
https://pages.nist.gov/jarvis_leaderboard/
JARVIS-Leaderboard: Contributors
• Growing list of collaborators
• Multi-institutional effort
• Contributions are welcomed and
encouraged from community!
JARVIS-Leaderboard: Methods and Data
Types of Data:
• Atomic structure (Molecule, Crystal)
• Material Property (Bandgap, bulk modulus)
• Images (Microscopy: SEM, TEM, STM)
• Spectra (Diffraction: X-ray, Neutron, PL)
• Text (Research articles, notebooks, blogs)
• Eigensolver (Quantum Computation algorithms)
1) Electronic Structure
2) Artificial Intelligence
3) Force Field
4) Quantum Computation
5) Experiment
JARVIS-Leaderboard: Benchmarks
Contributions
1) Electronic Structure
2) Artificial Intelligence
3) Force Field
4) Quantum Computation
5) Experiment
Benchmarks (reference point)
1) Experiment/s
2) Test dataset
3) Electronic Structure
4) Analytical results
5) Other Experiments
Error metrics
*Benchmarks must be well-defined with an associated DOI
JARVIS-Leaderboard: Snapshot
JARVIS-Leaderboard: Snapshot
Hands-on session notebooks (later)
Natural Language Processing [44,45]
1. ChemNLP example (Part I)
2. ChemNLP example (Part II)
JARVIS-Leaderboard [5]
Analyzing benchmarks in the JARVIS-Leaderboard
27
Summary
• NIST-JARVIS infrastructure with multiple components
• ChemNLP for solids currently, expand to polymers…
• Several events to engage (sign-up today & Demo!)
• Continuously growing, contribute, collaborate…
https://jarvis.nist.gov
https://github.com/usnistgov/jarvis
https://github.com/usnistgov/alignn
https://github.com/usnistgov/atomvision
https://github.com/usnistgov/chemnlp
https://github.com/usnistgov/atomqc
https://github.com/usnistgov/jarvis_leaderboard
Email: kamal.choudhary@nist.gov,
@dr_k_choudhary
@knc6
Slides:https://www.slideshare.net/KAMALCHOUDHARY4

More Related Content

PDF
Word2vecの理論背景
PDF
200604material ozaki
PDF
[DL輪読会]マテリアルズインフォマティクスにおける深層学習の応用
PPTX
[DL輪読会]The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Se...
PDF
汎用ニューラルネットワークポテンシャル「PFP」による材料探索_MRS-J2021招待講演_2021/12/15
PDF
MLP-Mixer: An all-MLP Architecture for Vision
PDF
汎用なNeural Network Potential「Matlantis」を使った新素材探索_2022応用物理学会_2022/3/22
PDF
汎化性能測定
Word2vecの理論背景
200604material ozaki
[DL輪読会]マテリアルズインフォマティクスにおける深層学習の応用
[DL輪読会]The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Se...
汎用ニューラルネットワークポテンシャル「PFP」による材料探索_MRS-J2021招待講演_2021/12/15
MLP-Mixer: An all-MLP Architecture for Vision
汎用なNeural Network Potential「Matlantis」を使った新素材探索_2022応用物理学会_2022/3/22
汎化性能測定

What's hot (20)

PPTX
【DL輪読会】Reflash Dropout in Image Super-Resolution
PPTX
ニューロテクノロジーの課題と未来
PDF
SSII2022 [OS1-01] AI時代のチームビルディング
PPTX
[DL輪読会]Focal Loss for Dense Object Detection
PDF
[기초개념] Graph Convolutional Network (GCN)
PPTX
[DL輪読会]Graph Convolutional Policy Network for Goal-Directed Molecular Graph G...
PDF
AIによる効率的危険斜面抽出システムの開発について
PDF
PDF
リクルート式 自然言語処理技術の適応事例紹介
PDF
[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision
PDF
Elix_はじめてのAI創薬_2022-04-01.pdf
PDF
深層学習時代の自然言語処理
PDF
Efficient Det
PDF
汎用なNeural Network Potential「Matlantis」を使った新素材探索_浅野_JACI先端化学・材料技術部会 高選択性反応分科会主...
PDF
Deep learning for molecules, introduction to chainer chemistry
PDF
CMSI計算科学技術特論B(8) オーダーN法1
PPTX
Go-ICP: グローバル最適(Globally optimal) なICPの解説
PDF
機械学習チュートリアル@Jubatus Casual Talks
PDF
Deep learningの発展と化学反応への応用 - 日本化学会第101春季大会(2021)
PDF
Verilog-HDL Tutorial (5)
【DL輪読会】Reflash Dropout in Image Super-Resolution
ニューロテクノロジーの課題と未来
SSII2022 [OS1-01] AI時代のチームビルディング
[DL輪読会]Focal Loss for Dense Object Detection
[기초개념] Graph Convolutional Network (GCN)
[DL輪読会]Graph Convolutional Policy Network for Goal-Directed Molecular Graph G...
AIによる効率的危険斜面抽出システムの開発について
リクルート式 自然言語処理技術の適応事例紹介
[DL輪読会]Learning Transferable Visual Models From Natural Language Supervision
Elix_はじめてのAI創薬_2022-04-01.pdf
深層学習時代の自然言語処理
Efficient Det
汎用なNeural Network Potential「Matlantis」を使った新素材探索_浅野_JACI先端化学・材料技術部会 高選択性反応分科会主...
Deep learning for molecules, introduction to chainer chemistry
CMSI計算科学技術特論B(8) オーダーN法1
Go-ICP: グローバル最適(Globally optimal) なICPの解説
機械学習チュートリアル@Jubatus Casual Talks
Deep learningの発展と化学反応への応用 - 日本化学会第101春季大会(2021)
Verilog-HDL Tutorial (5)
Ad

Similar to ChemNLP: A Natural Language Processing based Library for Materials Chemistry Text Data (20)

PDF
Recent Advancements in the NIST-JARVIS Infrastructure
PDF
Applications of Large Language Models in Materials Discovery and Design
PDF
Exploring New Frontiers in Inverse Materials Design with Graph Neural Network...
PDF
Applications of Natural Language Processing to Materials Design
PDF
Accelerating materials design through natural language processing
PDF
Progress in Natural Language Processing of Materials Science Text
PDF
Materials design using knowledge from millions of journal articles via natura...
PDF
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
PDF
Physics inspired artificial intelligence/machine learning
PDF
Natural Language Processing for Materials Design - What Can We Extract From t...
PDF
Discovering advanced materials for energy applications by mining the scientif...
PDF
Materials Design in the Age of Deep Learning and Quantum Computation
PDF
2D/3D Materials screening and genetic algorithm with ML model
PDF
Capturing and leveraging materials science knowledge from millions of journal...
PDF
Discovering new functional materials for clean energy and beyond using high-t...
PDF
Machine learning for materials design: opportunities, challenges, and methods
PPTX
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. Frey
PDF
Open Source Tools for Materials Informatics
PDF
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
PPT
NLP 2020: What Works and What's Next
Recent Advancements in the NIST-JARVIS Infrastructure
Applications of Large Language Models in Materials Discovery and Design
Exploring New Frontiers in Inverse Materials Design with Graph Neural Network...
Applications of Natural Language Processing to Materials Design
Accelerating materials design through natural language processing
Progress in Natural Language Processing of Materials Science Text
Materials design using knowledge from millions of journal articles via natura...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Physics inspired artificial intelligence/machine learning
Natural Language Processing for Materials Design - What Can We Extract From t...
Discovering advanced materials for energy applications by mining the scientif...
Materials Design in the Age of Deep Learning and Quantum Computation
2D/3D Materials screening and genetic algorithm with ML model
Capturing and leveraging materials science knowledge from millions of journal...
Discovering new functional materials for clean energy and beyond using high-t...
Machine learning for materials design: opportunities, challenges, and methods
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. Frey
Open Source Tools for Materials Informatics
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
NLP 2020: What Works and What's Next
Ad

More from KAMAL CHOUDHARY (9)

PDF
NIST-JARVIS infrastructure for Improved Materials Design
PDF
Quantum Computation for Predicting Electron and Phonon Properties of Solids
PDF
Smart Metrics for High Performance Material Design
PDF
Database of Topological Materials and Spin-orbit Spillage
PDF
Elastic properties of bulk and low-dimensional materials using Van der Waals ...
PDF
High-throughput discovery of low-dimensional and topologically non-trivial ma...
PDF
Accelerated Materials Discovery & Characterization with Classical, Quantum an...
PDF
Computational Database for 3D and 2D materials to accelerate discovery
PDF
Computational Discovery of Two-Dimensional Materials, Evaluation of Force-Fie...
NIST-JARVIS infrastructure for Improved Materials Design
Quantum Computation for Predicting Electron and Phonon Properties of Solids
Smart Metrics for High Performance Material Design
Database of Topological Materials and Spin-orbit Spillage
Elastic properties of bulk and low-dimensional materials using Van der Waals ...
High-throughput discovery of low-dimensional and topologically non-trivial ma...
Accelerated Materials Discovery & Characterization with Classical, Quantum an...
Computational Database for 3D and 2D materials to accelerate discovery
Computational Discovery of Two-Dimensional Materials, Evaluation of Force-Fie...

Recently uploaded (20)

PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
STKI Israel Market Study 2025 version august
PDF
Architecture types and enterprise applications.pdf
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PPTX
Modernising the Digital Integration Hub
PDF
Unlock new opportunities with location data.pdf
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Hybrid model detection and classification of lung cancer
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
August Patch Tuesday
PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
DP Operators-handbook-extract for the Mautical Institute
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
sustainability-14-14877-v2.pddhzftheheeeee
Hindi spoken digit analysis for native and non-native speakers
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
STKI Israel Market Study 2025 version august
Architecture types and enterprise applications.pdf
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
observCloud-Native Containerability and monitoring.pptx
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Modernising the Digital Integration Hub
Unlock new opportunities with location data.pdf
Final SEM Unit 1 for mit wpu at pune .pptx
A comparative study of natural language inference in Swahili using monolingua...
Hybrid model detection and classification of lung cancer
A novel scalable deep ensemble learning framework for big data classification...
August Patch Tuesday
A review of recent deep learning applications in wood surface defect identifi...
Univ-Connecticut-ChatGPT-Presentaion.pdf
DP Operators-handbook-extract for the Mautical Institute

ChemNLP: A Natural Language Processing based Library for Materials Chemistry Text Data

  • 1. ChemNLP A Natural Language Processing based Library for Materials Chemistry Text Data Kamal Choudhary https://jarvis.nist.gov/ NIST, Gaithersburg, MD, USA Polymer group 7/13/2023 1 Joint Automated Repository for Various Integrated Simulations
  • 2. Outline 2 • Introduction • AI for Materials • JARVIS • NLP basics • ChemNLP • Datasets • TextClassification • TokenClassification • WebApp • TextSummarization • TextGeneration • Integrating DFT database • JARVIS-Leaderboard/benchmarking • Hands-on • Summary Electronic structure DFT,DMFT, TB,QMC Quantum Computation AtomQC Force-Field JARVIS-FF ALIGNN-FF AI/ML CFID ALIGNN AtomVision ChemNLP
  • 3. AI for Materials Science 3
  • 4. Established: January 2017 Published: >40 articles Users: >20000+ users worldwide Materials: >80000, millions of properties Events: • Quantum Matters in Materials Science (QMMS) • Artificial Intelligence for Materials Science (AIMS) • JARVIS-School User-comments: • “There are many different theoretical levels on which you can approach the field. JARVIS is unusual in that it spans more levels than other databases.” • “A pure gold-mine for the data-quality effort…” • Thanks for your generous sharing. Your works inspire me a lot. • “You guys are doing something really beneficial…” • “I find JARVIS-DFT very useful for my research…” JARVIS: Databases, Tools, Events, Outreach 4 https://jarvis.nist.gov Requires login credentials, free registration
  • 5. Updates • 80,000 materials • QMC, tight binding, ALIGNN, ALIGNN-FF, • AtomVision, ChemNLP, JARVIS-Leaderboard • Quantum Computation algorithms • Superconductors (bulk and 2D), magnetic topological mats. Recent Updates to JARVIS
  • 8. Text classification & Token classification
  • 9. Text summarization & Text-generation
  • 11. Transformers & “Attention Is All You Need” Much better than RNN, LSTM etc. Attention: extremely long-term memory https://towardsdatascience.com/illustrated-guide-to-transformers-step-by-step-explanation-f74876522bc0
  • 15. Named Entity Recognition/Token classification 87 % F1 score
  • 16. ChemNLP Webpage for Composition Search https://jarvis.nist.gov/jarvischemnlp/
  • 17. ChemNLP Abstractive summarization (Abstract to Title) Google’s T5-225million transformer model ROGUE-1 score: 46.5 % Text generation (Title to Abstract) GPT2-medium LLM model ROGUE-1 score: 32 % Without fine tuning: 26 % ROUGE: Recall-Oriented Understudy for Gisting Evaluation T5: Text-to-Text Transfer Transformer
  • 19. 19 ChemNLP for supercondutors Confusion matrix for text classification (137927 articles) • arXiv cond-mat.supr-con and JARVIS-SuperconDB • Venn diagram for chemical formula ChatGPT response
  • 20. JARVIS-Leaderboard: Large Scale Benchmark Challenges in materials science community: • Reproducibility • Transparency • Validation • Fidelity • Data vs. metadata • What is the ground truth/reference to compare our models to? How does this change depending on the model? • Synergy of computational and experimental databases Community effort to tackle challenges: https://pages.nist.gov/jarvis_leaderboard/
  • 21. JARVIS-Leaderboard: Contributors • Growing list of collaborators • Multi-institutional effort • Contributions are welcomed and encouraged from community!
  • 22. JARVIS-Leaderboard: Methods and Data Types of Data: • Atomic structure (Molecule, Crystal) • Material Property (Bandgap, bulk modulus) • Images (Microscopy: SEM, TEM, STM) • Spectra (Diffraction: X-ray, Neutron, PL) • Text (Research articles, notebooks, blogs) • Eigensolver (Quantum Computation algorithms) 1) Electronic Structure 2) Artificial Intelligence 3) Force Field 4) Quantum Computation 5) Experiment
  • 23. JARVIS-Leaderboard: Benchmarks Contributions 1) Electronic Structure 2) Artificial Intelligence 3) Force Field 4) Quantum Computation 5) Experiment Benchmarks (reference point) 1) Experiment/s 2) Test dataset 3) Electronic Structure 4) Analytical results 5) Other Experiments Error metrics *Benchmarks must be well-defined with an associated DOI
  • 26. Hands-on session notebooks (later) Natural Language Processing [44,45] 1. ChemNLP example (Part I) 2. ChemNLP example (Part II) JARVIS-Leaderboard [5] Analyzing benchmarks in the JARVIS-Leaderboard
  • 27. 27 Summary • NIST-JARVIS infrastructure with multiple components • ChemNLP for solids currently, expand to polymers… • Several events to engage (sign-up today & Demo!) • Continuously growing, contribute, collaborate… https://jarvis.nist.gov https://github.com/usnistgov/jarvis https://github.com/usnistgov/alignn https://github.com/usnistgov/atomvision https://github.com/usnistgov/chemnlp https://github.com/usnistgov/atomqc https://github.com/usnistgov/jarvis_leaderboard Email: kamal.choudhary@nist.gov, @dr_k_choudhary @knc6 Slides:https://www.slideshare.net/KAMALCHOUDHARY4