SlideShare a Scribd company logo
Text-mining to produce large chemistry
datasets for community access
Valery Tkachenko1, Aileen Day1, Daniel Lowe2, Igor
Tetko3, Carlos Coba4 , Antony Williams5
1 Royal Society of Chemistry, UK
2 NextMove Software, UK
3 HelmholtzZentrum München, Germany
4 Mestrelab Research, Santiago de Compostela, Spain
5 EPA, US
ACS Fall 2015
Boston, MA
August 17th 2015
Text mining to produce large chemistry datasets for community access
ChemSpider
Refs - we live in linked world
Properties
ChemSpider spectra
Knowledge systems
Datastore
Raw data
Data in process
Data out process
UI, API,
Services, etc
RSC Archive – since 1841
Prospecting RSC articles
Further work – properties and spectra mining
Text mining of the chemical documents
Term Examples of text matched
FromLiterature “lit.”
MeltingPoint “mpt”, “melting point”, “m.p.”
Qualifier “>”; “approximately”
Value “75° C”, “200° F”, “one hundred degrees Celsius”
Range “184-186° C”, “191.5 to 192.4° C”
MeasurementE
rror
“50±° C”
OutcomeQuali
fier
“decomp.”, “with decomposition”, “subl.”
FromLiterature? MeltingPoint Qualifier? (Value | Range | MeasurementError) OutcomeQualifier?
Why MP?
Used for water solubility prediction
Yalkowsky equation:
logS = 0.5 – 0.01(MP-25) – log Kow
Detecting suspicious melting
points
• Value was greater than 500° C
• Value was a range wider than 50° C
• Value was a range where the second
temperature was lower than the first
temperature
300k Melting Point Datasets
Bergström 277
Bradley 2886
OCHEM 22404
Enamine 21883
Patents 228079
data
Bergström
Bradley
OCHEM
Enamine
Patents
Tetko et al J. Chemoinformatics, in preparation
Melting point model: data distribution
Some modeling highlights
LibSVM grid search was used to select parameters in grid (ca
1.5 years of CPU-time optimization)
Largest model:
668k descriptors (MolPrint) ~ 0.2 trillions entries
Biggest model:
618Mb (Dragon descriptors)
Most accurate model: Consensus, average of 5 models
RMSE < 32°C for the drug like region, MP [50,250]°C
Prediction error
NMR data
• Extract from 1976-2014 USPTO applications
*unknown – starts off with NMR: peak list (no nucleus)
H 975543
C 56536
unknown 44306
F 9429
P 3241
B 91
Si 62
Sn 22
Se 11
N 8
NMR text mining
• We can find and index text spectra:13C NMR
(CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH,
benzylic methane), 30.77 (CH, benzylic
methane), 66.12 (CH2), 68.49 (CH2), 117.72,
118.19, 120.29, 122.67, 123.37, 125.69, 125.84,
129.03, 130.00, 130.53 (ArCH), 99.42, 123.60,
134.69, 139.23, 147.21, 147.61, 149.41,
152.62, 154.88 (ArC)
NMR extracted by year of
publication
0
500000
1000000
1500000
2000000
2500000
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
CumulativedistinctNMRextracted
Year of Publication
USPTO grants
USPTO applications
NMR solvents
48.5%
38.3%
8.7%
1.1% 1.0% 1.0% 1.4%
CDCl3
DMSO-d6
CD3OD
D2O
Acetone-d6
MeOD
Others
Others: CD2Cl2, CD3CN-d3, C6D6, Pyridine-d5, THF-d8, CD3Cl, dimethylformamide-d7,
d1-trifluoroacetic acid, methanol-d3, acetic acid-d4, toluene-d8, sulfuric acid-d2, 1,1,2,2-
tetrachloroethane-d2, CD3OCD3, dioxane-d8, 1,2-dichloroethane-d4
1H-NMR frequency over time
0 Mhz
50 Mhz
100 Mhz
150 Mhz
200 Mhz
250 Mhz
300 Mhz
350 Mhz
400 Mhz
450 Mhz
1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014
Year of patent filing
MestreLabs Mnova NMR
1H NMR (CDCl3, 400 MHz):
δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t,
1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz,
C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)
13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane),
30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19,
120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42,
123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)
Detecting suspicious NMR spectra
• Last peak of NMR spectra is unannotated
and:
– All other peaks are annotated
– Spectrum has 1 peak and is proton or
unknown NMR
> <SuspiciousValue>
true
> <Value>
1H-NMR (400 MHz, d6-Acetone): 11.8-10.8 (brs, 1H), 7.78
Comments: Only the labile proton is reported in the spectrum. The other aromatic and aliphatic protons are completely missing in the spectrum.
> <SuspiciousValue>
true
> <Value>
1H-NMR (400 MHz, CDCl3): 6.85 (1H, d, J=7.8 Hz), 6.10 (1H, dd, J=7.8 and 2.2 Hz), 6.06 (1H, d, J=2.2 Hz), 4.66 (1H, m), 3.75 (4H, br s), 3.40 (2H, s), 1.97
Comments: There are only 11 protons reported in the spectrum whilst the molecule contains more than 50 protons.
Knowledge systems
Datastore
Raw data
Data in process
Data out process
UI, API,
Services, etc
Synthetic chemistry article
Compounds
Reaction
Analytical Data
Text and References
RSC Databases
RSC Compounds
RSC Reactions
RSC Spectra
RSC Crystals
RSC Polymers
RSC Materials
RSC Assays
RSC Algorithms
RSC Models
…and on…
Input pipeline
Deposition Gateway
Staging
databases
Compounds Reactions Spectra Crystals
Materials
Compounds
Module
Spectra
Module
Reactions
Module
Materials
Module
Textmining
Module Module
Web UI for unified depositions
DropBox, Google Drive,
SkyDrive, etc
ELNs, templated data input
Documents
API, FTP, etc
Raw data
Validated
data
Staging
databases
All databases are
sliced by data
sources/ data
collections and
have simple
security model
where each data
slice/ source is
private, public or
embargoed
Etc
Experiments
Research
Output pipeline
Compounds Reactions Spectra Crystals Documents
Compounds
API
Reactions
API
Spectra
API
Crystals
API
Documents
API
Compounds
Widgets
Reactions
Widgets
Spectra
Widgets
Crystals
Widgets
Documents
Widgets
Data layer
Data access
layer
User
interface
widgets
layer
Analytical Laboratory application
User
interface
layer
(examples)
Electronic Laboratory Notebook
Paid 3rd party integrations
(various platforms – SharePoint, Google, etc)
Chemical Inventory application
ChemSpider 2.0
Cross-database links
Compounds domain
Data quality issue and CVSP
– Robochemistry
– Proliferation of errors in public and
private databases
• ChemSpider
• PubChem
• DrugBank
• KEGG
• ChEBI/ChEMBL
– Automated quality control system
Chemistry Validation and Standardization Platform
Reactions domain
Reactions domain
Analytical data domain
Crystallography domain
3D printable structures
New Repository Architecture
doi: 10.1007/s10822-014-9784-5
Thank you
Email: tkachenkov@rsc.org
Slides:
http://www.slideshare.net/valerytkachenko16

More Related Content

PDF
PALI (48 by 41) [V1.0]
PDF
JCE - PNA - PANIC-2015 - Final 3x4 Poster 2-5-15
PDF
Rapid analysis of polymers and petroleum by Ion-mobility mass spectrometry
PDF
Characterization of Liquid Waste in Isotope production and Research Facilities
PDF
Electrical Capacitance Volume Tomography
PDF
Bioalgo 2012-03-massspec
PDF
Talents up grazioli cesare_20_05_2013
PDF
Victoria weinstein
PALI (48 by 41) [V1.0]
JCE - PNA - PANIC-2015 - Final 3x4 Poster 2-5-15
Rapid analysis of polymers and petroleum by Ion-mobility mass spectrometry
Characterization of Liquid Waste in Isotope production and Research Facilities
Electrical Capacitance Volume Tomography
Bioalgo 2012-03-massspec
Talents up grazioli cesare_20_05_2013
Victoria weinstein

Viewers also liked (20)

PPT
The rsc e science - reflecting the change in the world we live in
PPT
Experiences and adventures with no sql and its applications to cheminformatic...
PPT
OpenPHACTS - Chemistry Platform Update and Learnings
PPTX
Implementing chemistry platform for OpenPHACTS
PDF
Letter to my great-grandfather on his 18th birthday
PPT
Ivan chakarov-2015.eng-1
PPT
The salvation army red kettle run
DOC
quality control of food and drugs
PPS
九個月的紐西蘭
DOC
Zaragoza turismo 211
DOC
Zaragoza turismo-60
PDF
Xkr072015-myjurnal.ru
PPT
DigitalShoreditch: The gamification of customer service
PPT
Votre Entreprise sur Facebook... Pour quoi faire?
PPT
梯田上的音符 哈尼
PPS
Je Suis Charlie
PDF
анализ рынка приложений в социальных сетях
DOC
Zaragoza turismo 237
PDF
7 Tips for Design Teams Collaborating Remotely
PDF
EPA DROE Email 6.30.03
The rsc e science - reflecting the change in the world we live in
Experiences and adventures with no sql and its applications to cheminformatic...
OpenPHACTS - Chemistry Platform Update and Learnings
Implementing chemistry platform for OpenPHACTS
Letter to my great-grandfather on his 18th birthday
Ivan chakarov-2015.eng-1
The salvation army red kettle run
quality control of food and drugs
九個月的紐西蘭
Zaragoza turismo 211
Zaragoza turismo-60
Xkr072015-myjurnal.ru
DigitalShoreditch: The gamification of customer service
Votre Entreprise sur Facebook... Pour quoi faire?
梯田上的音符 哈尼
Je Suis Charlie
анализ рынка приложений в социальных сетях
Zaragoza turismo 237
7 Tips for Design Teams Collaborating Remotely
EPA DROE Email 6.30.03
Ad

Similar to Text mining to produce large chemistry datasets for community access (20)

PPT
A Pde Silva Slintec
PPT
Teaching analytical spectroscopy using online spectroscopic data
PPTX
Balaram Lecture slides
PPTX
Evolution of open chemical information
PPT
Tandem Mass Spectroscopy Basics
PPT
The importance of standards for data exchange and interchange on the Royal So...
PDF
ELUSIDASI STRUKTUR.pdf
PPTX
Structural elucidation by NMR(1HNMR)
PPTX
ICP Presentation
PPTX
Liquid Chromatography-Mass Spectrometry (LC-MS)
PPT
Mass spectroscopy
PPT
lectures-genova2006-lecture3.ppt
PPT
Cheminformatics and the Structure Elucidation of Natural Products
PDF
PPTX
NOMAD
PPTX
Icpms basics and instrumentation
PPTX
Fragmentation rules mass spectroscopy
PPTX
Journal Club Presentation.
PPTX
Introduction To Proton NMR and Interpretation
A Pde Silva Slintec
Teaching analytical spectroscopy using online spectroscopic data
Balaram Lecture slides
Evolution of open chemical information
Tandem Mass Spectroscopy Basics
The importance of standards for data exchange and interchange on the Royal So...
ELUSIDASI STRUKTUR.pdf
Structural elucidation by NMR(1HNMR)
ICP Presentation
Liquid Chromatography-Mass Spectrometry (LC-MS)
Mass spectroscopy
lectures-genova2006-lecture3.ppt
Cheminformatics and the Structure Elucidation of Natural Products
NOMAD
Icpms basics and instrumentation
Fragmentation rules mass spectroscopy
Journal Club Presentation.
Introduction To Proton NMR and Interpretation
Ad

More from Valery Tkachenko (20)

PPTX
Evolution of public chemistry databases: past and the future
PPTX
In silico design of new functional materials
PPTX
Metal-organic frameworks: from database to supramolecular effects in complexa...
PPTX
Abstract recommendation system: beyond word-level representations
PPTX
Machine learning methods for chemical properties and toxicity based endpoints
PPTX
Chemical workflows supporting automated research data collection
PDF
Deep learning methods applied to physicochemical and toxicological endpoints
PDF
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
PDF
Using publicly available resources to build a comprehensive knowledgebase of ...
PPTX
Need and benefits for structure standardization to facilitate integration and...
PPTX
Development and comparison of deep learning toolkit with other machine learni...
PPTX
Living in a world of federated knowledge challenges, principles, tools and ...
PPTX
Open chemistry registry and mapping platform based on open source cheminforma...
PPTX
Using the structured product labeling format to index versatile chemical data
PPTX
Tools and approaches for data deposition into nanomaterial databases
PPTX
Chemistry Validation and Standardization Platform v2.0
PPTX
Open Science Data Repository - the platform for materials research
PPTX
Opportunities in chemical structure standardization
PPTX
OMPOL – visualisation of large chemical spaces
PPTX
Not just another reaction database
Evolution of public chemistry databases: past and the future
In silico design of new functional materials
Metal-organic frameworks: from database to supramolecular effects in complexa...
Abstract recommendation system: beyond word-level representations
Machine learning methods for chemical properties and toxicity based endpoints
Chemical workflows supporting automated research data collection
Deep learning methods applied to physicochemical and toxicological endpoints
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Using publicly available resources to build a comprehensive knowledgebase of ...
Need and benefits for structure standardization to facilitate integration and...
Development and comparison of deep learning toolkit with other machine learni...
Living in a world of federated knowledge challenges, principles, tools and ...
Open chemistry registry and mapping platform based on open source cheminforma...
Using the structured product labeling format to index versatile chemical data
Tools and approaches for data deposition into nanomaterial databases
Chemistry Validation and Standardization Platform v2.0
Open Science Data Repository - the platform for materials research
Opportunities in chemical structure standardization
OMPOL – visualisation of large chemical spaces
Not just another reaction database

Recently uploaded (20)

PPTX
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
PPT
6.1 High Risk New Born. Padetric health ppt
PDF
Placing the Near-Earth Object Impact Probability in Context
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
famous lake in india and its disturibution and importance
PPTX
Science Quipper for lesson in grade 8 Matatag Curriculum
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PDF
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
PPTX
Pharmacology of Autonomic nervous system
PDF
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PPTX
2. Earth - The Living Planet Module 2ELS
PDF
. Radiology Case Scenariosssssssssssssss
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
6.1 High Risk New Born. Padetric health ppt
Placing the Near-Earth Object Impact Probability in Context
Classification Systems_TAXONOMY_SCIENCE8.pptx
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
famous lake in india and its disturibution and importance
Science Quipper for lesson in grade 8 Matatag Curriculum
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
Pharmacology of Autonomic nervous system
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
2. Earth - The Living Planet Module 2ELS
. Radiology Case Scenariosssssssssssssss
Phytochemical Investigation of Miliusa longipes.pdf
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
TOTAL hIP ARTHROPLASTY Presentation.pptx
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf

Text mining to produce large chemistry datasets for community access

  • 1. Text-mining to produce large chemistry datasets for community access Valery Tkachenko1, Aileen Day1, Daniel Lowe2, Igor Tetko3, Carlos Coba4 , Antony Williams5 1 Royal Society of Chemistry, UK 2 NextMove Software, UK 3 HelmholtzZentrum München, Germany 4 Mestrelab Research, Santiago de Compostela, Spain 5 EPA, US ACS Fall 2015 Boston, MA August 17th 2015
  • 4. Refs - we live in linked world
  • 7. Knowledge systems Datastore Raw data Data in process Data out process UI, API, Services, etc
  • 8. RSC Archive – since 1841
  • 10. Further work – properties and spectra mining
  • 11. Text mining of the chemical documents Term Examples of text matched FromLiterature “lit.” MeltingPoint “mpt”, “melting point”, “m.p.” Qualifier “>”; “approximately” Value “75° C”, “200° F”, “one hundred degrees Celsius” Range “184-186° C”, “191.5 to 192.4° C” MeasurementE rror “50±° C” OutcomeQuali fier “decomp.”, “with decomposition”, “subl.” FromLiterature? MeltingPoint Qualifier? (Value | Range | MeasurementError) OutcomeQualifier?
  • 12. Why MP? Used for water solubility prediction Yalkowsky equation: logS = 0.5 – 0.01(MP-25) – log Kow
  • 13. Detecting suspicious melting points • Value was greater than 500° C • Value was a range wider than 50° C • Value was a range where the second temperature was lower than the first temperature
  • 14. 300k Melting Point Datasets Bergström 277 Bradley 2886 OCHEM 22404 Enamine 21883 Patents 228079 data Bergström Bradley OCHEM Enamine Patents Tetko et al J. Chemoinformatics, in preparation
  • 15. Melting point model: data distribution
  • 16. Some modeling highlights LibSVM grid search was used to select parameters in grid (ca 1.5 years of CPU-time optimization) Largest model: 668k descriptors (MolPrint) ~ 0.2 trillions entries Biggest model: 618Mb (Dragon descriptors) Most accurate model: Consensus, average of 5 models RMSE < 32°C for the drug like region, MP [50,250]°C
  • 18. NMR data • Extract from 1976-2014 USPTO applications *unknown – starts off with NMR: peak list (no nucleus) H 975543 C 56536 unknown 44306 F 9429 P 3241 B 91 Si 62 Sn 22 Se 11 N 8
  • 19. NMR text mining • We can find and index text spectra:13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)
  • 20. NMR extracted by year of publication 0 500000 1000000 1500000 2000000 2500000 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 CumulativedistinctNMRextracted Year of Publication USPTO grants USPTO applications
  • 21. NMR solvents 48.5% 38.3% 8.7% 1.1% 1.0% 1.0% 1.4% CDCl3 DMSO-d6 CD3OD D2O Acetone-d6 MeOD Others Others: CD2Cl2, CD3CN-d3, C6D6, Pyridine-d5, THF-d8, CD3Cl, dimethylformamide-d7, d1-trifluoroacetic acid, methanol-d3, acetic acid-d4, toluene-d8, sulfuric acid-d2, 1,1,2,2- tetrachloroethane-d2, CD3OCD3, dioxane-d8, 1,2-dichloroethane-d4
  • 22. 1H-NMR frequency over time 0 Mhz 50 Mhz 100 Mhz 150 Mhz 200 Mhz 250 Mhz 300 Mhz 350 Mhz 400 Mhz 450 Mhz 1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 Year of patent filing
  • 24. 1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)
  • 25. 13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)
  • 26. Detecting suspicious NMR spectra • Last peak of NMR spectra is unannotated and: – All other peaks are annotated – Spectrum has 1 peak and is proton or unknown NMR
  • 27. > <SuspiciousValue> true > <Value> 1H-NMR (400 MHz, d6-Acetone): 11.8-10.8 (brs, 1H), 7.78 Comments: Only the labile proton is reported in the spectrum. The other aromatic and aliphatic protons are completely missing in the spectrum.
  • 28. > <SuspiciousValue> true > <Value> 1H-NMR (400 MHz, CDCl3): 6.85 (1H, d, J=7.8 Hz), 6.10 (1H, dd, J=7.8 and 2.2 Hz), 6.06 (1H, d, J=2.2 Hz), 4.66 (1H, m), 3.75 (4H, br s), 3.40 (2H, s), 1.97 Comments: There are only 11 protons reported in the spectrum whilst the molecule contains more than 50 protons.
  • 29. Knowledge systems Datastore Raw data Data in process Data out process UI, API, Services, etc
  • 31. RSC Databases RSC Compounds RSC Reactions RSC Spectra RSC Crystals RSC Polymers RSC Materials RSC Assays RSC Algorithms RSC Models …and on…
  • 32. Input pipeline Deposition Gateway Staging databases Compounds Reactions Spectra Crystals Materials Compounds Module Spectra Module Reactions Module Materials Module Textmining Module Module Web UI for unified depositions DropBox, Google Drive, SkyDrive, etc ELNs, templated data input Documents API, FTP, etc Raw data Validated data Staging databases All databases are sliced by data sources/ data collections and have simple security model where each data slice/ source is private, public or embargoed Etc Experiments Research
  • 33. Output pipeline Compounds Reactions Spectra Crystals Documents Compounds API Reactions API Spectra API Crystals API Documents API Compounds Widgets Reactions Widgets Spectra Widgets Crystals Widgets Documents Widgets Data layer Data access layer User interface widgets layer Analytical Laboratory application User interface layer (examples) Electronic Laboratory Notebook Paid 3rd party integrations (various platforms – SharePoint, Google, etc) Chemical Inventory application ChemSpider 2.0
  • 36. Data quality issue and CVSP – Robochemistry – Proliferation of errors in public and private databases • ChemSpider • PubChem • DrugBank • KEGG • ChEBI/ChEMBL – Automated quality control system
  • 37. Chemistry Validation and Standardization Platform
  • 43. New Repository Architecture doi: 10.1007/s10822-014-9784-5

Editor's Notes

  • #22: List of others probably isn’t completely comprehensive (solvent is free text!). 2 million spectra (from USPTO applications) have identified solvents
  • #23: Excluded results < 1MHz and >1GHz (…mixing up Hz and MHz not uncommon!). Just to confuse things this is from the grant data while the previous data was from applications :-p
  • #27: Extracted NMR spectrum is truncated as it finds the valid spectra up till before the error US20140378645A1 0057 typo? US20140378687A1 0195 missing open bracket
  • #33: Change to add more database, rearrange
  • #40: Information typically associated with reactions