Text mining to produce large chemistry datasets for community access

Text-mining to produce large chemistry
datasets for community access
Valery Tkachenko1, Aileen Day1, Daniel Lowe2, Igor
Tetko3, Carlos Coba4 , Antony Williams5
1 Royal Society of Chemistry, UK
2 NextMove Software, UK
3 HelmholtzZentrum München, Germany
4 Mestrelab Research, Santiago de Compostela, Spain
5 EPA, US
ACS Fall 2015
Boston, MA
August 17th 2015

Refs - we live in linked world

Knowledge systems
Datastore
Raw data
Data in process
Data out process
UI, API,
Services, etc

Further work – properties and spectra mining

Text mining of the chemical documents
Term Examples of text matched
FromLiterature “lit.”
MeltingPoint “mpt”, “melting point”, “m.p.”
Qualifier “>”; “approximately”
Value “75° C”, “200° F”, “one hundred degrees Celsius”
Range “184-186° C”, “191.5 to 192.4° C”
MeasurementE
rror
“50±° C”
OutcomeQuali
fier
“decomp.”, “with decomposition”, “subl.”
FromLiterature? MeltingPoint Qualifier? (Value | Range | MeasurementError) OutcomeQualifier?

Why MP?
Used for water solubility prediction
Yalkowsky equation:
logS = 0.5 – 0.01(MP-25) – log Kow

Detecting suspicious melting
points
• Value was greater than 500° C
• Value was a range wider than 50° C
• Value was a range where the second
temperature was lower than the first
temperature

300k Melting Point Datasets
Bergström 277
Bradley 2886
OCHEM 22404
Enamine 21883
Patents 228079
data
Bergström
Bradley
OCHEM
Enamine
Patents
Tetko et al J. Chemoinformatics, in preparation

Melting point model: data distribution

Some modeling highlights
LibSVM grid search was used to select parameters in grid (ca
1.5 years of CPU-time optimization)
Largest model:
668k descriptors (MolPrint) ~ 0.2 trillions entries
Biggest model:
618Mb (Dragon descriptors)
Most accurate model: Consensus, average of 5 models
RMSE < 32°C for the drug like region, MP [50,250]°C

NMR data
• Extract from 1976-2014 USPTO applications
*unknown – starts off with NMR: peak list (no nucleus)
H 975543
C 56536
unknown 44306
F 9429
P 3241
B 91
Si 62
Sn 22
Se 11
N 8

NMR text mining
• We can find and index text spectra:13C NMR
(CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH,
benzylic methane), 30.77 (CH, benzylic
methane), 66.12 (CH2), 68.49 (CH2), 117.72,
118.19, 120.29, 122.67, 123.37, 125.69, 125.84,
129.03, 130.00, 130.53 (ArCH), 99.42, 123.60,
134.69, 139.23, 147.21, 147.61, 149.41,
152.62, 154.88 (ArC)

NMR extracted by year of
publication
0
500000
1000000
1500000
2000000
2500000
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
CumulativedistinctNMRextracted
Year of Publication
USPTO grants
USPTO applications

NMR solvents
48.5%
38.3%
8.7%
1.1% 1.0% 1.0% 1.4%
CDCl3
DMSO-d6
CD3OD
D2O
Acetone-d6
MeOD
Others
Others: CD2Cl2, CD3CN-d3, C6D6, Pyridine-d5, THF-d8, CD3Cl, dimethylformamide-d7,
d1-trifluoroacetic acid, methanol-d3, acetic acid-d4, toluene-d8, sulfuric acid-d2, 1,1,2,2-
tetrachloroethane-d2, CD3OCD3, dioxane-d8, 1,2-dichloroethane-d4

1H-NMR frequency over time
0 Mhz
50 Mhz
100 Mhz
150 Mhz
200 Mhz
250 Mhz
300 Mhz
350 Mhz
400 Mhz
450 Mhz
1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014
Year of patent filing

1H NMR (CDCl3, 400 MHz):
δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t,
1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz,
C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)

13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane),
30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19,
120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42,
123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)

Detecting suspicious NMR spectra
• Last peak of NMR spectra is unannotated
and:
– All other peaks are annotated
– Spectrum has 1 peak and is proton or
unknown NMR

> <SuspiciousValue>
true
> <Value>
1H-NMR (400 MHz, d6-Acetone): 11.8-10.8 (brs, 1H), 7.78
Comments: Only the labile proton is reported in the spectrum. The other aromatic and aliphatic protons are completely missing in the spectrum.

> <SuspiciousValue>
true
> <Value>
1H-NMR (400 MHz, CDCl3): 6.85 (1H, d, J=7.8 Hz), 6.10 (1H, dd, J=7.8 and 2.2 Hz), 6.06 (1H, d, J=2.2 Hz), 4.66 (1H, m), 3.75 (4H, br s), 3.40 (2H, s), 1.97
Comments: There are only 11 protons reported in the spectrum whilst the molecule contains more than 50 protons.

Synthetic chemistry article
Compounds
Reaction
Analytical Data
Text and References

RSC Databases
RSC Compounds
RSC Reactions
RSC Spectra
RSC Crystals
RSC Polymers
RSC Materials
RSC Assays
RSC Algorithms
RSC Models
…and on…

Input pipeline
Deposition Gateway
Staging
databases
Compounds Reactions Spectra Crystals
Materials
Compounds
Module
Spectra
Module
Reactions
Module
Materials
Module
Textmining
Module Module
Web UI for unified depositions
DropBox, Google Drive,
SkyDrive, etc
ELNs, templated data input
Documents
API, FTP, etc
Raw data
Validated
data
Staging
databases
All databases are
sliced by data
sources/ data
collections and
have simple
security model
where each data
slice/ source is
private, public or
embargoed
Etc
Experiments
Research

Output pipeline
Compounds Reactions Spectra Crystals Documents
Compounds
API
Reactions
API
Spectra
API
Crystals
API
Documents
API
Compounds
Widgets
Reactions
Widgets
Spectra
Widgets
Crystals
Widgets
Documents
Widgets
Data layer
Data access
layer
User
interface
widgets
layer
Analytical Laboratory application
User
interface
layer
(examples)
Electronic Laboratory Notebook
Paid 3rd party integrations
(various platforms – SharePoint, Google, etc)
Chemical Inventory application
ChemSpider 2.0

Data quality issue and CVSP
– Robochemistry
– Proliferation of errors in public and
private databases
• ChemSpider
• PubChem
• DrugBank
• KEGG
• ChEBI/ChEMBL
– Automated quality control system

Chemistry Validation and Standardization Platform

New Repository Architecture
doi: 10.1007/s10822-014-9784-5

Thank you
Email: tkachenkov@rsc.org
Slides:
http://www.slideshare.net/valerytkachenko16

Text mining to produce large chemistry datasets for community access

More Related Content

Viewers also liked (20)

Similar to Text mining to produce large chemistry datasets for community access (20)

More from Valery Tkachenko (20)

Recently uploaded (20)

Text mining to produce large chemistry datasets for community access

Editor's Notes