SlideShare a Scribd company logo
The PubChemQC Project 
A big data construction by first-principles 
calculations of molecules 
中田真秀(NAKATA Maho) 
ACCC RIKEN 
maho@riken.jp 
2014/12/3 10:35-11:05 
JST CREST International Symposium on Post 
Petescale System Software
Background 
• Atoms and molecules are all composed of matter. 
• The dream of theoretical chemist: do chemistry 
without experiment! 
• On computers  
• We treat big data in chemistry! 
– Chemical space is really huge! 
• The number of candidates for drugs 
1060 
http://onlinelibrary.wiley.com/doi/10.1002/wcms.1104/ 
abstract) 
• Cf. Exa: 1018
Current status of computational 
chemistry 
• Relatively good agreements with experiments. 
• Can explain nature in many cases. 
– Many good quantum chemistry programs are 
available! 
– “DFT B3LYP 6-31G*” calculations rule! 
• We want to lead chemistry 
– We only explain what happened.
Difference between experiment and 
calculation/theory 
• Finding interesting phenomena or problem 
– How we convert from CO2 to O2? N2+H2 to NH3? 
– How to synthesize a compound? 
• Design a key chemical reaction. 
• Calculations 
• Experiments 
– Analyze 
• Analysis of results 
• Propose new experiments 
Only One Difference
Difference between experiment and 
calculation/theory 
• No difference as science 
• Most important thing is curiosity! 
New insights from 
big data and 
my sensitivity! 
Unfortunately, not so many easy-to-use 
big data for chemistry
Googling molecule 
+ 
Give you recommended molecules!
What are needed for Googling molecule? 
1. Types, kinds, variety of molecules 
– # of molecules are infinity; but cover important ones 
2. Required properties of molecules 
– Molecular structure, energy, UV excitation energy, 
dipole moment 
3. Getting properties of molecules by calculation? 
– Accuracy of calculation, and computer resources… 
4. Coding or Encoding molecule 
– IUPAC nomenclature is not suitable 
– Do not think about graph theory
Databases for lists of molecules 
• PubChem: 50,000,000 molecules listed, made by NIH, 
public domain, no curating (imported from catalogs, 
etc), can obtain via ftp. 
• ChemSpider : 28,000,000 entries, better curating, no 
ftp. Restricted for redistribution, download 
• Web-GDB13 : 900,000,000 entries, just generated by 
combinatorics. No 
• Zinc, CheMBL, DrugBank … 
• CAS : 70,000,000 molecules, proprietary 
• Nikkaji: 6,000,000, proprietary 
We use for source of molecules
The PubChem
Ex. A molecule listed in PubChem
Database for molecular properties by 
experiments 
• We must do some experiments for obtaining 
molecular properties. 
– No free comprehensive database is known so far. 
– Pharmaceutical companies do O(1,000,000) 
experiments for high throughput screening. 
• Experiments cost huge! 
– Time consuming, large facilities, costs, hazardous 
We do not do experiments!
Database for molecular properties by computer 
calculation 
• Golden Standard method “Density functional 
theory (B3LYP functional) + 6-31g(d) basis set” 
– Accuracy is quite satisfactory (1-10kcal/mol) for 
biological systems, organic chemistry. 
– Good implementations are available. 
– Costs less (fast, just super computer, no hazardous) 
– Time for calculations becomes less 
• Intel Core i7 (esp. SandyBridge) is very fast. 
• Still we need huge resources, though. 
We calculate by computer instead!
What is a molecule? 
No rigorous definition for a molecule 
3D coordinates 
Hard to understand 
but regours 
Easy to understand 
But many coner cases 
Propionaldehyde 
wavefunction 
Common name 
IUPAC 
nomencleature 
Structure 
Wikipediaより
What is a molecule? 
• No rigorous definition for “what is a molecule” 
• nomenclature 
– 3D coordinates for nucleus 
– Structural formula 
– IUPAC nomenclature 
– Higher abstraction or less abstraction? 
• Better molecular encoding method? 
– Easy to understand for human 
– Easy to understand for computer as well 
– Can describe most cases, and less corner cases. 
– Compromise between dream and reality
Encoding molecule : SMILES 
Encoding molecule 
IUPAC nomenclature 
tert-butyl N-[(2S,3S,5S)-5-[[4-[(1-benzyltetrazol-5-yl) 
methoxy]phenyl]methyl]-3-hydroxy-6-[[(1S,2R)- 
2-hydroxy-2,3-dihydro-1H-inden-1-yl]amino]- 
6-oxo-1-phenylhexan-2-yl]carbamate 
We can encode molecule 
• SMILES 
CN(C)CCOC12CCC(C3C1CCCC3)C4=CC=CC=C24 
• InChI Made by IUPAC 
InChI=1S/C20H29NO/c1-21(2)13-14-22-20-12-11 
-15(16-7-3-5-9-18(16)20)17-8-4-6-10-19(17)20/ 
h3,5,7,9,15,17,19H,4,6,8,10-14H2,1-2H3 
… 
SMILES is a good encoding method for molecules
What is SMILES? 
• Simplified Molecular Input Line Entry System 
– A linear representation of molecule using ASCII. 
– Conformation is also encoded 
– Human readable, and also machine readable. 
– Almost one-to-one mapping between a molecule and 
SMILES via universal SMILES 
• David Weininger at USEPA Mid-Continent Ecology Division Laboratory invented SMILES 
• InChI by IUPAC 
– International Chemical Identifier : open standard (non proprietary) 
– NM O’Boyle invented “Universal SMILES” via InChI
Example by SMILES 
http://en.wikipedia.org/wiki/SMILES 
分子構造SMILES 
Nitrogen molecule N≡N N#N 
copper sulfate Cu2+ SO42- [Cu+2].[O-]S(=O)(=O)[O-] 
oenanthotoxin CCC[C@@H](O)CCC=CC=C 
C#CC#CC=CCO 
Vitamin B1 OCCc1c(C)[n+](=cs1)Cc2cnc(C 
)nc(N)2 
Aflatoxin B1 O1C=C[C@H]([C@H]1O2)c3c 
2cc(OC)c4c3OC(=O)C5=C4CC 
C(=O)5
Some corner cases 
Two different SMILES for Ferrocene 
• C12C3C4C5C1[Fe]23451234C5C1C2C3C45 
• [CH-]1C=CC=C1.[CH-]1C=CC=C1.[Fe+2]
Now its my turn
Construction of ab initio chemical 
database 
• Molecular information is from PubChem 
• Properties are calculated from the first principle using 
computer 
– Many program packages are available 
– DFT (B3LYP) 
– 6-31G(d) basis set and geometry optimization 
– Excited states calculation by TD-DFT 6-31G+(d) 
– Best for organic molecules or bio molecules 
• Molecular encoding : SMILES / InChI 
• Huge computer resources 
• Dream come true 
– Google like search engine for chemistry
The PubChemQC Project 
• http://pubchemqc.riken.jp/ 
• A open database for molecules 
– Public domain 
• Ab initio (The first principle) calculation of 
molecular properties of PubChem 
• 2014/1/15: 13,000 molecules 
• 2014/7/29 : 155,792 molecules 
• 2014/10/30 : 906,798 molecules 
• 2014/12/3 : 1,137,286 molecules
The PubChemQC project 
http://pubchemqc.riken.jp/ 
WIP: no search engine, just data
PubChemQC 
http://pubchemqc.riken.jp/
PubChemQC 
http://pubchemqc.riken.jp/
Related works 
• Related works 
– NIST Web Book 
• http://webbook.nist.gov/chemistry/ 
• Small numbers of molecules. Comparing many methods 
– Harvard Clean Energy Project 
• http://cleanenergy.molecularspace.org/ 
• 25,000,000 (?), molecules for photo devices made by 
combinatrics 
– Sugimoto et al :2013CBI symposium poster 
• Almost same as our database, currently not open to the 
public(now??)
How we do? 
• Generate initial 3D conformation by OpenBABEL 
– SDF contains 3D conformation but we don’t use. 
– OpenBABEL –h (add hydrogen) --gen3d (generation of 3d 
coordinate) 
• Ab initio calculation by GAMESS+firefly 
– Using Gaussian can lead to a political problem(?) 
– PM3 optimization 
– Hartree-Fock/STO-6G geometry optimization 
– Firefly+GAMESS geometry optimization in B3LYP/6-31G* 
– Ten excitation energies by TDDFT/6-31G+* (no geom 
optimization)
How we do? 
• Heavily using OpenBABEL 
• Extraction Molecular information 
– Sort by molecular weight of PubChem compouds 
– OpenBABEL 
• Encoded by SMILES 
– Isomeric smiles: 3D conformation retained 
– OC[C@@H](O1)[C@@H](O)[C@H](O)[C@@H](O)[C@ 
@H](O)1 
– CCC[C@@H](O)CCC=CC=CC#CC#CC=CCO 
– CC(=O)OCCC(/C)=CC[C@H](C(C)=C)CCC=C
Our way to pubchem Compound to 
quantum chemistry calculation 
aflatoxin 
O1C=C[C@H]([C@H]1O2)c3c2cc(OC)c4c3OC(=O)C5=C4CCC(=O)5 
Ab initio calculation by 
OpenBABEL
Final results will be 
• Uploaded to http://pubchemqc.riken.jp/ 
• Currently we upload 
– input file (ground / excited state) 
– Output file (ground / excited state) 
– Final geometry in Mol file
Scaling of computation 
• Embarrassingly parallel for each molecule 
• Very roughly speaking, required time for 
calculation scales like N^4 
– N : molecular weight 
• Problems are very hard (complexity theory) 
– Hartree-Fock calculation 
– DFT (b3lyp) calculation 
– geometry optimization 
• Practically many molecules can be solved 
efficiently
Computer Resources 
• RICC : Intel Xeon 5570 Westmere, 2.93GHz 8 
cores/node) x 1000 
– 1000-10000 molecules/day (MW 160) 
– Heavily depend on conditions of other users 
– Time limit: 8 hours 
• Quest : Intel Core2 duo (1.6GHz/node) x 700 
– 3000-8000 molecules / day (MW 160) 
– 100-1000 molecules / day (MW 200-300) 
– Time limit: 20 hours 
• Some compounds fail to calculate are ignored for 
this time.
Computer Resources 
• Storage 
– Approx. 500GB for 1,000,000 molecules (xz 
compressed) 
– Approx. 20 TB for 40,000,000 molecules (xz 
compressed)
Molecular weight and Lipinski Rule 
• Lipinski’s five rule (Pfizer's rule of five): rule of 
thumb for drug discovery 
• No more than 5 hydrogen bond donors 
• Not more than 10 hydrogen bond acceptors 
• A molecular mass less than 500 daltons 
• An octanol-water partition coefficient log P not greater than 5 
• Molecular weight should be smaller than 500 is 
very good for computational chemistry 
– For routine calculations without experimental data 
other than molecular formula 
– If larger than 500, secondary or higher structure 
becomes important. E.g., protein
Molecular Weight distribution at 
PubChem 
Lipinski limit MW=500 
We are still here 
30,000,000 molecules 
(excluding mixtures)
How long it will take to finish? 
• For drug design, we need to calculate all 
molecules of MW < 500 
• Total 30,000,000 molecules 
– This number may increase in the future 
• Current (2014/12/4) 1,100,000 molecules 
– Only 3% 
• 10,000 molecules/day -> 8.2years
How long it will take to finish? 
• 10+ years? No, maybe far less. 
• 25 years ago (1990) computers are so slow 
– Even ab initio calculations are very difficult on 
486DX@25MHz or 
68000@10MHz
Outlook, prospect, hope… 
• Far better in silico screening 
– Less or no experiment is necessary 
• Even more faster calculation using machine learning 
– 10,000 molecules / second ? 
– Using our data as learning set. 
– Not difficult for bio or organic molecules 
– Far better initial guess 
• Database for chemical reaction 
– Precise calculation is required 
– GRRM method + machine learning (?) 
• Geometry optimization for Protein (PDB) 
– Only X ray crystal structures are available 
http://pubchemqc.riken.jp/
Difficulties in this project 
• Parameters needed for calculations varies by 
molecules 
• Properties can be different by initial guess 
• Computer Resources 
– Raspberry Pi? NVIDIA Jetson? Bonic? 
• Molecular encoding never ends 
– SMILES or InChI is not complete 
– Some corner cases may be chemically interesting.

More Related Content

PPTX
Kobeworkshop pubchemqc project
PDF
QuantumChemistry500
PDF
Core Objective 1: Highlights from the Central Data Resource
PDF
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
PDF
Early Application experiences on Summit
PDF
A Generate-Test-Aggregate Parallel Programming Library on Spark
PDF
大強子計算網格與OSS
Kobeworkshop pubchemqc project
QuantumChemistry500
Core Objective 1: Highlights from the Central Data Resource
Targeting GPUs using OpenMP Directives on Summit with GenASiS: A Simple and...
Early Application experiences on Summit
A Generate-Test-Aggregate Parallel Programming Library on Spark
大強子計算網格與OSS

What's hot (15)

PDF
Towards Exascale Simulations of Stellar Explosions with FLASH
PDF
Runtime Performance Optimizations for an OpenFOAM Simulation
PDF
News from NNPDF: new data and fits with intrinsic charm
PPTX
20190314 cern register v3
PDF
強化学習の分散アーキテクチャ変遷
PDF
The World Wide Distributed Computing Architecture of the LHC Datagrid
PDF
第13回 配信講義 計算科学技術特論A(2021)
PDF
A comparison of molecular dynamics simulations using GROMACS with GPU and CPU
PDF
IIBMP2019 講演資料「オープンソースで始める深層学習」
PDF
PyTorch 튜토리얼 (Touch to PyTorch)
PDF
HTCC poster for CERN Openlab opendays 2015
PDF
Implementation of linear regression and logistic regression on Spark
PDF
Deep learning for molecules, introduction to chainer chemistry
PDF
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
PDF
쉽게 설명하는 GAN (What is this? Gum? It's GAN.)
Towards Exascale Simulations of Stellar Explosions with FLASH
Runtime Performance Optimizations for an OpenFOAM Simulation
News from NNPDF: new data and fits with intrinsic charm
20190314 cern register v3
強化学習の分散アーキテクチャ変遷
The World Wide Distributed Computing Architecture of the LHC Datagrid
第13回 配信講義 計算科学技術特論A(2021)
A comparison of molecular dynamics simulations using GROMACS with GPU and CPU
IIBMP2019 講演資料「オープンソースで始める深層学習」
PyTorch 튜토리얼 (Touch to PyTorch)
HTCC poster for CERN Openlab opendays 2015
Implementation of linear regression and logistic regression on Spark
Deep learning for molecules, introduction to chainer chemistry
PFN Summer Internship 2019 / Kenshin Abe: Extension of Chainer-Chemistry for ...
쉽게 설명하는 GAN (What is this? Gum? It's GAN.)
Ad

Viewers also liked (12)

PPTX
Dynamics CRM 2013 Customization and Configuration
PPT
Impacto De Las Tics En La Cultura De La Mediacion A Distancia Para La Educaci...
PPTX
B100 board presentation
PPT
Introducción a la biotecnología
PPTX
Sånn akkurat passe mye arkitektur
PDF
Comentarios de Piedad Córdoba a columna de Vladdo
PDF
AcreditacióN 9 SesióN
PDF
Tracxn Research — Online Photos Landscape, October 2016
DOC
9 icms instructions_for_authors
PDF
Ipsos Consumer Confidence Index April 2013
PDF
Online marketing and distribution
TXT
인천펜션 노보텔호텔
Dynamics CRM 2013 Customization and Configuration
Impacto De Las Tics En La Cultura De La Mediacion A Distancia Para La Educaci...
B100 board presentation
Introducción a la biotecnología
Sånn akkurat passe mye arkitektur
Comentarios de Piedad Córdoba a columna de Vladdo
AcreditacióN 9 SesióN
Tracxn Research — Online Photos Landscape, October 2016
9 icms instructions_for_authors
Ipsos Consumer Confidence Index April 2013
Online marketing and distribution
인천펜션 노보텔호텔
Ad

Similar to The PubChemQC Project (20)

PDF
Is 20TB really Big Data?
PDF
Substructure Search Face-off
PPT
Digitizing documents to provide a public spectroscopy database
PDF
HPC Applications of Materials Modeling
PPTX
Morgan osg user school 2016 07-29 dist
PPTX
OOW-IMC-final
PPTX
Representing Chemicals Digitally: An overview of Cheminformatics
PDF
Discovering advanced materials for energy applications (with high-throughput ...
PDF
Efficient matching of multiple chemical subgraphs
PDF
The physics of computational drug discovery
PDF
1371 silver[1]
PPT
03j_nov18_n2.pptClassification of Parallel Computers.pptx
PPTX
Approaches for extraction and digital chromatography of chemical data
PDF
Predicting the Synthesizability of Inorganic Materials: Convex Hulls, Literat...
PDF
Coordination InChI (2019)
PDF
Computer-Assisted Structure Elucidation (CloudMet 2017)
PDF
Computational Chemistry: From Theory to Practice
PPTX
Overview of cheminformatics
PDF
Exploring Practices in Machine Learning and Machine Discovery for Heterogeneo...
PDF
01-10 Exploring new high potential 2D materials - Angioni.pdf
Is 20TB really Big Data?
Substructure Search Face-off
Digitizing documents to provide a public spectroscopy database
HPC Applications of Materials Modeling
Morgan osg user school 2016 07-29 dist
OOW-IMC-final
Representing Chemicals Digitally: An overview of Cheminformatics
Discovering advanced materials for energy applications (with high-throughput ...
Efficient matching of multiple chemical subgraphs
The physics of computational drug discovery
1371 silver[1]
03j_nov18_n2.pptClassification of Parallel Computers.pptx
Approaches for extraction and digital chromatography of chemical data
Predicting the Synthesizability of Inorganic Materials: Convex Hulls, Literat...
Coordination InChI (2019)
Computer-Assisted Structure Elucidation (CloudMet 2017)
Computational Chemistry: From Theory to Practice
Overview of cheminformatics
Exploring Practices in Machine Learning and Machine Discovery for Heterogeneo...
01-10 Exploring new high potential 2D materials - Angioni.pdf

More from Maho Nakata (20)

PDF
quantum chemistry on quantum computer handson by Q# (2019/8/4@MDR Hongo, Tokyo)
PDF
Lie-Trotter-Suzuki分解、特にフラクタル分解について
PDF
LiHのポテンシャルエネルギー曲面 を量子コンピュータで行う Q#+位相推定編
PPTX
Q#による量子化学計算 : 水素分子の位相推定について
PPTX
量子コンピュータの量子化学計算への応用の現状と展望
PPTX
qubitによる波動関数の虚時間発展のシミュレーション: a review
PDF
Openfermionを使った分子の計算 part I
PPTX
量子コンピュータで量子化学のfullCIが超高速になる(かも
PDF
20180723 量子コンピュータの量子化学への応用; Bravyi-Kitaev基底の実装
PPTX
第11回分子科学 2017/9/17 Pubchemqcプロジェクト
PPTX
計算化学実習講座:第二回
PPTX
計算化学実習講座:第一回
PPTX
HOKUSAIのベンチマーク 理研シンポジウム 中田分
PPTX
為替取引(FX)でのtickdataの加工とMySQLで管理
PPTX
為替のTickdataをDukascopyからダウンロードする
PPTX
HPCS2015 pythonを用いた量子化学プログラムの開発と応用
PDF
HPCS2015 大規模量子化学計算プログラムSMASHの開発と公開(石村)
DOCX
3Dプリンタ導入記 タンパク質の模型をプリントする
PPTX
立教大学化学実験3 SMILESを中心とした高度な分子モデリング 2014/7/1
PPTX
The PubchemQC project
quantum chemistry on quantum computer handson by Q# (2019/8/4@MDR Hongo, Tokyo)
Lie-Trotter-Suzuki分解、特にフラクタル分解について
LiHのポテンシャルエネルギー曲面 を量子コンピュータで行う Q#+位相推定編
Q#による量子化学計算 : 水素分子の位相推定について
量子コンピュータの量子化学計算への応用の現状と展望
qubitによる波動関数の虚時間発展のシミュレーション: a review
Openfermionを使った分子の計算 part I
量子コンピュータで量子化学のfullCIが超高速になる(かも
20180723 量子コンピュータの量子化学への応用; Bravyi-Kitaev基底の実装
第11回分子科学 2017/9/17 Pubchemqcプロジェクト
計算化学実習講座:第二回
計算化学実習講座:第一回
HOKUSAIのベンチマーク 理研シンポジウム 中田分
為替取引(FX)でのtickdataの加工とMySQLで管理
為替のTickdataをDukascopyからダウンロードする
HPCS2015 pythonを用いた量子化学プログラムの開発と応用
HPCS2015 大規模量子化学計算プログラムSMASHの開発と公開(石村)
3Dプリンタ導入記 タンパク質の模型をプリントする
立教大学化学実験3 SMILESを中心とした高度な分子モデリング 2014/7/1
The PubchemQC project

Recently uploaded (20)

PPTX
Imaging of parasitic D. Case Discussions.pptx
PPTX
Acid Base Disorders educational power point.pptx
PPTX
History and examination of abdomen, & pelvis .pptx
PPTX
Respiratory drugs, drugs acting on the respi system
PDF
Khadir.pdf Acacia catechu drug Ayurvedic medicine
PPTX
CEREBROVASCULAR DISORDER.POWERPOINT PRESENTATIONx
PDF
NEET PG 2025 | 200 High-Yield Recall Topics Across All Subjects
PPTX
1 General Principles of Radiotherapy.pptx
PDF
Therapeutic Potential of Citrus Flavonoids in Metabolic Inflammation and Ins...
PPTX
POLYCYSTIC OVARIAN SYNDROME.pptx by Dr( med) Charles Amoateng
PPTX
Electromyography (EMG) in Physiotherapy: Principles, Procedure & Clinical App...
PPTX
surgery guide for USMLE step 2-part 1.pptx
PPTX
Uterus anatomy embryology, and clinical aspects
PPTX
Chapter-1-The-Human-Body-Orientation-Edited-55-slides.pptx
PPTX
ACID BASE management, base deficit correction
PPT
Obstructive sleep apnea in orthodontics treatment
DOCX
NEET PG 2025 | Pharmacology Recall: 20 High-Yield Questions Simplified
PPTX
Important Obstetric Emergency that must be recognised
PPTX
JUVENILE NASOPHARYNGEAL ANGIOFIBROMA.pptx
PPTX
CME 2 Acute Chest Pain preentation for education
Imaging of parasitic D. Case Discussions.pptx
Acid Base Disorders educational power point.pptx
History and examination of abdomen, & pelvis .pptx
Respiratory drugs, drugs acting on the respi system
Khadir.pdf Acacia catechu drug Ayurvedic medicine
CEREBROVASCULAR DISORDER.POWERPOINT PRESENTATIONx
NEET PG 2025 | 200 High-Yield Recall Topics Across All Subjects
1 General Principles of Radiotherapy.pptx
Therapeutic Potential of Citrus Flavonoids in Metabolic Inflammation and Ins...
POLYCYSTIC OVARIAN SYNDROME.pptx by Dr( med) Charles Amoateng
Electromyography (EMG) in Physiotherapy: Principles, Procedure & Clinical App...
surgery guide for USMLE step 2-part 1.pptx
Uterus anatomy embryology, and clinical aspects
Chapter-1-The-Human-Body-Orientation-Edited-55-slides.pptx
ACID BASE management, base deficit correction
Obstructive sleep apnea in orthodontics treatment
NEET PG 2025 | Pharmacology Recall: 20 High-Yield Questions Simplified
Important Obstetric Emergency that must be recognised
JUVENILE NASOPHARYNGEAL ANGIOFIBROMA.pptx
CME 2 Acute Chest Pain preentation for education

The PubChemQC Project

  • 1. The PubChemQC Project A big data construction by first-principles calculations of molecules 中田真秀(NAKATA Maho) ACCC RIKEN maho@riken.jp 2014/12/3 10:35-11:05 JST CREST International Symposium on Post Petescale System Software
  • 2. Background • Atoms and molecules are all composed of matter. • The dream of theoretical chemist: do chemistry without experiment! • On computers  • We treat big data in chemistry! – Chemical space is really huge! • The number of candidates for drugs 1060 http://onlinelibrary.wiley.com/doi/10.1002/wcms.1104/ abstract) • Cf. Exa: 1018
  • 3. Current status of computational chemistry • Relatively good agreements with experiments. • Can explain nature in many cases. – Many good quantum chemistry programs are available! – “DFT B3LYP 6-31G*” calculations rule! • We want to lead chemistry – We only explain what happened.
  • 4. Difference between experiment and calculation/theory • Finding interesting phenomena or problem – How we convert from CO2 to O2? N2+H2 to NH3? – How to synthesize a compound? • Design a key chemical reaction. • Calculations • Experiments – Analyze • Analysis of results • Propose new experiments Only One Difference
  • 5. Difference between experiment and calculation/theory • No difference as science • Most important thing is curiosity! New insights from big data and my sensitivity! Unfortunately, not so many easy-to-use big data for chemistry
  • 6. Googling molecule + Give you recommended molecules!
  • 7. What are needed for Googling molecule? 1. Types, kinds, variety of molecules – # of molecules are infinity; but cover important ones 2. Required properties of molecules – Molecular structure, energy, UV excitation energy, dipole moment 3. Getting properties of molecules by calculation? – Accuracy of calculation, and computer resources… 4. Coding or Encoding molecule – IUPAC nomenclature is not suitable – Do not think about graph theory
  • 8. Databases for lists of molecules • PubChem: 50,000,000 molecules listed, made by NIH, public domain, no curating (imported from catalogs, etc), can obtain via ftp. • ChemSpider : 28,000,000 entries, better curating, no ftp. Restricted for redistribution, download • Web-GDB13 : 900,000,000 entries, just generated by combinatorics. No • Zinc, CheMBL, DrugBank … • CAS : 70,000,000 molecules, proprietary • Nikkaji: 6,000,000, proprietary We use for source of molecules
  • 10. Ex. A molecule listed in PubChem
  • 11. Database for molecular properties by experiments • We must do some experiments for obtaining molecular properties. – No free comprehensive database is known so far. – Pharmaceutical companies do O(1,000,000) experiments for high throughput screening. • Experiments cost huge! – Time consuming, large facilities, costs, hazardous We do not do experiments!
  • 12. Database for molecular properties by computer calculation • Golden Standard method “Density functional theory (B3LYP functional) + 6-31g(d) basis set” – Accuracy is quite satisfactory (1-10kcal/mol) for biological systems, organic chemistry. – Good implementations are available. – Costs less (fast, just super computer, no hazardous) – Time for calculations becomes less • Intel Core i7 (esp. SandyBridge) is very fast. • Still we need huge resources, though. We calculate by computer instead!
  • 13. What is a molecule? No rigorous definition for a molecule 3D coordinates Hard to understand but regours Easy to understand But many coner cases Propionaldehyde wavefunction Common name IUPAC nomencleature Structure Wikipediaより
  • 14. What is a molecule? • No rigorous definition for “what is a molecule” • nomenclature – 3D coordinates for nucleus – Structural formula – IUPAC nomenclature – Higher abstraction or less abstraction? • Better molecular encoding method? – Easy to understand for human – Easy to understand for computer as well – Can describe most cases, and less corner cases. – Compromise between dream and reality
  • 15. Encoding molecule : SMILES Encoding molecule IUPAC nomenclature tert-butyl N-[(2S,3S,5S)-5-[[4-[(1-benzyltetrazol-5-yl) methoxy]phenyl]methyl]-3-hydroxy-6-[[(1S,2R)- 2-hydroxy-2,3-dihydro-1H-inden-1-yl]amino]- 6-oxo-1-phenylhexan-2-yl]carbamate We can encode molecule • SMILES CN(C)CCOC12CCC(C3C1CCCC3)C4=CC=CC=C24 • InChI Made by IUPAC InChI=1S/C20H29NO/c1-21(2)13-14-22-20-12-11 -15(16-7-3-5-9-18(16)20)17-8-4-6-10-19(17)20/ h3,5,7,9,15,17,19H,4,6,8,10-14H2,1-2H3 … SMILES is a good encoding method for molecules
  • 16. What is SMILES? • Simplified Molecular Input Line Entry System – A linear representation of molecule using ASCII. – Conformation is also encoded – Human readable, and also machine readable. – Almost one-to-one mapping between a molecule and SMILES via universal SMILES • David Weininger at USEPA Mid-Continent Ecology Division Laboratory invented SMILES • InChI by IUPAC – International Chemical Identifier : open standard (non proprietary) – NM O’Boyle invented “Universal SMILES” via InChI
  • 17. Example by SMILES http://en.wikipedia.org/wiki/SMILES 分子構造SMILES Nitrogen molecule N≡N N#N copper sulfate Cu2+ SO42- [Cu+2].[O-]S(=O)(=O)[O-] oenanthotoxin CCC[C@@H](O)CCC=CC=C C#CC#CC=CCO Vitamin B1 OCCc1c(C)[n+](=cs1)Cc2cnc(C )nc(N)2 Aflatoxin B1 O1C=C[C@H]([C@H]1O2)c3c 2cc(OC)c4c3OC(=O)C5=C4CC C(=O)5
  • 18. Some corner cases Two different SMILES for Ferrocene • C12C3C4C5C1[Fe]23451234C5C1C2C3C45 • [CH-]1C=CC=C1.[CH-]1C=CC=C1.[Fe+2]
  • 19. Now its my turn
  • 20. Construction of ab initio chemical database • Molecular information is from PubChem • Properties are calculated from the first principle using computer – Many program packages are available – DFT (B3LYP) – 6-31G(d) basis set and geometry optimization – Excited states calculation by TD-DFT 6-31G+(d) – Best for organic molecules or bio molecules • Molecular encoding : SMILES / InChI • Huge computer resources • Dream come true – Google like search engine for chemistry
  • 21. The PubChemQC Project • http://pubchemqc.riken.jp/ • A open database for molecules – Public domain • Ab initio (The first principle) calculation of molecular properties of PubChem • 2014/1/15: 13,000 molecules • 2014/7/29 : 155,792 molecules • 2014/10/30 : 906,798 molecules • 2014/12/3 : 1,137,286 molecules
  • 22. The PubChemQC project http://pubchemqc.riken.jp/ WIP: no search engine, just data
  • 25. Related works • Related works – NIST Web Book • http://webbook.nist.gov/chemistry/ • Small numbers of molecules. Comparing many methods – Harvard Clean Energy Project • http://cleanenergy.molecularspace.org/ • 25,000,000 (?), molecules for photo devices made by combinatrics – Sugimoto et al :2013CBI symposium poster • Almost same as our database, currently not open to the public(now??)
  • 26. How we do? • Generate initial 3D conformation by OpenBABEL – SDF contains 3D conformation but we don’t use. – OpenBABEL –h (add hydrogen) --gen3d (generation of 3d coordinate) • Ab initio calculation by GAMESS+firefly – Using Gaussian can lead to a political problem(?) – PM3 optimization – Hartree-Fock/STO-6G geometry optimization – Firefly+GAMESS geometry optimization in B3LYP/6-31G* – Ten excitation energies by TDDFT/6-31G+* (no geom optimization)
  • 27. How we do? • Heavily using OpenBABEL • Extraction Molecular information – Sort by molecular weight of PubChem compouds – OpenBABEL • Encoded by SMILES – Isomeric smiles: 3D conformation retained – OC[C@@H](O1)[C@@H](O)[C@H](O)[C@@H](O)[C@ @H](O)1 – CCC[C@@H](O)CCC=CC=CC#CC#CC=CCO – CC(=O)OCCC(/C)=CC[C@H](C(C)=C)CCC=C
  • 28. Our way to pubchem Compound to quantum chemistry calculation aflatoxin O1C=C[C@H]([C@H]1O2)c3c2cc(OC)c4c3OC(=O)C5=C4CCC(=O)5 Ab initio calculation by OpenBABEL
  • 29. Final results will be • Uploaded to http://pubchemqc.riken.jp/ • Currently we upload – input file (ground / excited state) – Output file (ground / excited state) – Final geometry in Mol file
  • 30. Scaling of computation • Embarrassingly parallel for each molecule • Very roughly speaking, required time for calculation scales like N^4 – N : molecular weight • Problems are very hard (complexity theory) – Hartree-Fock calculation – DFT (b3lyp) calculation – geometry optimization • Practically many molecules can be solved efficiently
  • 31. Computer Resources • RICC : Intel Xeon 5570 Westmere, 2.93GHz 8 cores/node) x 1000 – 1000-10000 molecules/day (MW 160) – Heavily depend on conditions of other users – Time limit: 8 hours • Quest : Intel Core2 duo (1.6GHz/node) x 700 – 3000-8000 molecules / day (MW 160) – 100-1000 molecules / day (MW 200-300) – Time limit: 20 hours • Some compounds fail to calculate are ignored for this time.
  • 32. Computer Resources • Storage – Approx. 500GB for 1,000,000 molecules (xz compressed) – Approx. 20 TB for 40,000,000 molecules (xz compressed)
  • 33. Molecular weight and Lipinski Rule • Lipinski’s five rule (Pfizer's rule of five): rule of thumb for drug discovery • No more than 5 hydrogen bond donors • Not more than 10 hydrogen bond acceptors • A molecular mass less than 500 daltons • An octanol-water partition coefficient log P not greater than 5 • Molecular weight should be smaller than 500 is very good for computational chemistry – For routine calculations without experimental data other than molecular formula – If larger than 500, secondary or higher structure becomes important. E.g., protein
  • 34. Molecular Weight distribution at PubChem Lipinski limit MW=500 We are still here 30,000,000 molecules (excluding mixtures)
  • 35. How long it will take to finish? • For drug design, we need to calculate all molecules of MW < 500 • Total 30,000,000 molecules – This number may increase in the future • Current (2014/12/4) 1,100,000 molecules – Only 3% • 10,000 molecules/day -> 8.2years
  • 36. How long it will take to finish? • 10+ years? No, maybe far less. • 25 years ago (1990) computers are so slow – Even ab initio calculations are very difficult on 486DX@25MHz or 68000@10MHz
  • 37. Outlook, prospect, hope… • Far better in silico screening – Less or no experiment is necessary • Even more faster calculation using machine learning – 10,000 molecules / second ? – Using our data as learning set. – Not difficult for bio or organic molecules – Far better initial guess • Database for chemical reaction – Precise calculation is required – GRRM method + machine learning (?) • Geometry optimization for Protein (PDB) – Only X ray crystal structures are available http://pubchemqc.riken.jp/
  • 38. Difficulties in this project • Parameters needed for calculations varies by molecules • Properties can be different by initial guess • Computer Resources – Raspberry Pi? NVIDIA Jetson? Bonic? • Molecular encoding never ends – SMILES or InChI is not complete – Some corner cases may be chemically interesting.