SlideShare a Scribd company logo
Chemical mixtures: File format, open source tools,
example data, and mixtures InChI derivative
Alex M. Clark
COLLABORATIVE DRUG DISCOVERY - MIXTURES
Premise
◈ Representing specific chemicals routine since 1980s
▷ e.g. the MDL Molfile CTAB for organics
◈ Most real world encounters are mixtures
◈ No industry standard format...
▷ ... even though it's rather easy
▷ interchange is done with text
2
COLLABORATIVE DRUG DISCOVERY - MIXTURES
Goal
◈ Grant-supported project to:
▷ define a simple format (Mixfile)
▷ create open source tools for editing & manipulating
▷ bootstrap content via text mining
▷ work with IUPAC for interoperability (MInChI)
◈ Results pubished in Journal of Cheminformatics 11:33
3
COLLABORATIVE DRUG DISCOVERY - MIXTURES
Common Lab Mixtures
4
impurity
isomers
simple

solution
multisolvent

solution
COLLABORATIVE DRUG DISCOVERY - MIXTURES
Real World Mixtures
5
drug

tablet
toothpaste
COLLABORATIVE DRUG DISCOVERY - MIXTURES
File Format
◈ Mixfile is to mixtures as Molfile is to
molecules
◈ Components:
▷ structure & name
▷ concentration
▷ identifiers
◈ Hierarchy
▷ captures nuances of mixing
▷ relative concentrations/uncertainty
6
COLLABORATIVE DRUG DISCOVERY - MIXTURES
Representation
◈ Serialised using JSON: natural datastructure, human
readable, concise, easy to code on any platform
7
{
"mixfileVersion": 0.01,
"name": "Lithium diisopropylamide solution",
"contents":
[
{
"name": "Lithium diisopropylamide",
"molfile": "nGenerated by WebMolKitnn 8 6 0 0 0 0 0 0 0 0999 V2000n 0.2500 1.5000 0.0000 N 0 5 0 0 0 0 0 0 0
0 0 0n -1.0490 0.7500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0n -2.3481 1.5000 0.0000 C 0 0 0 0 0
0 0 0 0 0 0 0n -1.0490 -0.7500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0n 1.5490 0.7500 0.0000 C 0
0 0 0 0 0 0 0 0 0 0 0n 2.8481 1.5000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0n 1.5490 -0.7500
0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0n 0.2500 3.0000 0.0000 Li 0 3 0 0 0 0 0 0 0 0 0 0n 1 2 1
0 0 0 0n 2 3 1 0 0 0 0n 2 4 1 0 0 0 0n 1 5 1 0 0 0 0n 5 6 1 0 0 0 0n 5 7 1 0 0 0 0nM
CHG 2 1 -1 8 1nM END",
"quantity": 1,
"units": "mol/L",
"inchi": "InChI=1S/C6H14N.Li/c1-5(2)7-6(3)4;/h5-6H,1-4H3;/q-1;+1",
"inchiKey": "InChIKey=ZCSHNCUQKCANBX-UHFFFAOYSA-N"
},
{
"contents":
[
{
"name": "THF",
"molfile": "nGenerated by WebMolKitnn 5 5 0 0 0 0 0 0 0 0999 V2000n -0.2500 3.7500 0.0000 O 0 0 0 0 0 0
0 0 0 0 0 0n -1.4600 2.8700 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0n -1.0000 1.4400 0.0000 C 0 0
0 0 0 0 0 0 0 0 0 0n 0.9600 2.8700 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0n 0.5000 1.4400
0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0n 3 2 1 0 0 0 0n 2 1 1 0 0 0 0n 1 4 1 0 0 0 0n 4 5 1 0
0 0 0n 5 3 1 0 0 0 0nM END",
"ratio":
[
1,
8
],
"inchi": "InChI=1S/C4H8O/c1-2-4-5-3-1/h1-4H2",
"inchiKey": "InChIKey=WYURNTSHIVDZCO-UHFFFAOYSA-N"
},
COLLABORATIVE DRUG DISCOVERY - MIXTURES
Tooling
◈ Editor & libraries written in TypeScript, cheminformatics
library: WebMolKit
◈ Create, modify, view & render Mixfiles
◈ Open source https://github.com/cdd/mixtures
8
(a) (b) (c)
COLLABORATIVE DRUG DISCOVERY - MIXTURES
Mixture Data
◈ Lots of mixtures,
but mostly text
◈ Active ingredient
often separated
▷ purity
▷ solvent/conc.
◈ Semi-structured
databases also
9
COLLABORATIVE DRUG DISCOVERY - MIXTURES
Text Extraction
◈ Many common patterns in
text descriptions
10
1-Aza-12-crown-4 ≥97.0%
Trimethyl(trifluoromethyl)silane solution 2 M in THF
COLLABORATIVE DRUG DISCOVERY - MIXTURES
Brute Force
◈ Compose a set of regular expression rules
◈ Remove brand names:
11
{"effect": "remove", "regex": "(.*)A[cC][rR][oO][sS] Organics?™?(.*?)$"},
{"effect": "remove", "regex": "(.*)AcroSeal™?(.*?)$"},
{"effect": "remove", "regex": "(.*)Alfa Aesar™?(.*?)$"},
{"effect": "remove", "regex": "(.*)BioReagents™?(.*?)$"},
{"effect": "remove", "regex": "(.*)Burdick & Jackson™?(.*?)$"},
{"effect": "remove", "regex": "(.*)(Technical)$"},
{"effect": "remove", "regex": "(.*)(Certified)$"},
{"effect": "remove", "regex": "(.*), pure$"},
{"effect": "remove", "regex": "(.*), for analysis$"},
{"effect": "remove", "regex": "(.*), extra pure$"},
◈ Remove suffixes:
COLLABORATIVE DRUG DISCOVERY - MIXTURES
Brute Force (ctd)
◈ Identify branches (with quantities):
12
{"effect": "branch", "regex":
"(.*) (over (molecular sieve .*))", "substance": "$2"},
{"effect": "branch", "regex": "(.*) (d[d.]*) ? mM in (.*)",
"quantity": "$2", "units": "mmol/L", "substance": "$3"},
{"effect": "branch", "regex":
"(.*) ~?d[d.]*s?% in (.*) ((~?)(d[d.]*) M)$",
"quantity": "$4", "units": "mol/L", "relation": "$3", "substance": "$2"},
{"effect": "branch", "regex":
"(.*), (d[d.]*)s?M [Ss]olution in (.*)",
"quantity": "$2", "units": "mol/L", "substance": "$3"},
{"effect": "conc", "regex": "(.*) (d[d.]*)%, ee d[d.]*.%$",
"quantity": "$2","units": "%", "relation": ""},
{"effect": "conc", "regex": "(.*) ≥ ?(d[d.]*)%$",
"quantity": "$2","units": "%", "relation": ">="},
{"effect": "conc", "regex": "(.*) [Ss]olution (d[d.]*) ?M$",
"quantity": "$2", "units": "mol/L"},
◈ Concentration suffixes:
COLLABORATIVE DRUG DISCOVERY - MIXTURES
Implementation
◈ Apply regular expression rules greedily/recursively
▷ package up into hierarchical components
▷ substances referenced by name
◈ Use OPSIN for name-to-SMILES, RDKit to depict
◈ Use lookup database for:
▷ common exceptions (e.g. trivial names)
▷ named mixtures (e.g. hexanes)
13
COLLABORATIVE DRUG DISCOVERY - MIXTURES
Results
◈ Aggregated thousands of single-line text entries,
mostly from online catalogs
◈ 5600 mixtures verified & included on GitHub using
the Mixfile format
◈ Success rate ~95% with:
▷ 250 regular expression based rules
▷ 280 name lookup mappings
◈ All of them with calculated MInChI strings...
14
COLLABORATIVE DRUG DISCOVERY - MIXTURES
MInChI
◈ Mixtures InChI is a composite notation...
▷ Mixfile encapsulates Molfiles + extra metadata
▷ MInChI encapsulates InChI + extra metadata
◈ Distilled down to the basics:
▷ canonical structures
▷ simplified concentration
▷ nesting hierarchy
◈ Useful to accompany primary originated data
15
COLLABORATIVE DRUG DISCOVERY - MIXTURES
Dissection
◈ Mixfile to MInChI: designed for one way
conversion (verbose-to-concise)
◈ Reduced information content of MInChI
adds value for certain use cases
◈ Proof of concept implemented
16
(a) MInChI=0.00.1S/C7H8N4O2/c1-10-3-8-5-4(10)6(12)11(2)7(13)9-5/h3H,1-2H3,(H,9,13)/n1/g99pp0
header structure identifier indexing concentration
MInChI=0.00.1S/BBr3/c2-1(3)4&CH2Cl2/c2-1-3/h1H2/n{1&2}/g{1mr0&}(b)
(c) MInChI=0.00.1S/C4H8O/c1-2-4-5-3-1/h1-4H2&C6H12/c1-6-4-2-3-5-6/h6H,2-5H2,1H3
&C6H14/c1-3-5-6-4-2/h3-6H2,1-2H3&C6H14/c1-4-5-6(2)3/h6H,4-5H2,1-3H3
&C6H14/c1-4-6(3)5-2/h6H,4-5H2,1-3H3&C6H14N.Li/c1-5(2)7-6(3)4;/h5-6H,1-4H3;/q-1;+1
/n{6&{1&{3&2&4&5}}}/g{1mr0&{1vp0&{5:7vf-1&1:2vf-1&1:5vf-2&1:5vf-2}7vp0}}
COLLABORATIVE DRUG DISCOVERY - MIXTURES
Future Work: ELNs
◈ Embed convenient UI within web ELN
▷ sketch out mixture definitions for procedure writeup
▷ quick lookup databases of known mixtures
▷ search content using precise mixture-aware
queries
◈ Vendor integration
▷ work with vendors to markup their products
▷ scientists can use product code to embed mixture
17
COLLABORATIVE DRUG DISCOVERY - MIXTURES
Future Work: Text Extraction
◈ Initial proof of concept is effective but crude
◈ Planning a much more sophisticated iterative-learning
strategy: start by formulating input for recurrent deep
neural network
18
Trimethyl(trifluoromethyl)silane solution 2 M in THF
■ component ■ quantity ■ branch ■ superfluous
... -5 -4 -3 -2 -1 0 +1 +2 +3 +4 +5 ...
[... *, *, *, *, *, T, r, i, m, e, t ...] = ■ component
[... *, *, *, *, T, r, i, m, e, t, h ...] = ■ component
[... *, *, *, T, r, i, m, e, t, h, y ...] = ■ component
[... *, *, T, r, i, m, e, t, h, y, l ...] = ■ component
...
[... l, u, t, i, o, n, , 2, M, , i ...] = ■ superfluous
[... u, t, i, o, n, , 2, M, , i, n ...] = ■ superfluous
[... t, i, o, n, , 2, M, , i, n, ...] = ■ quantity
[... i, o, n, , 2, M, , i, n, , T ...] = ■ quantity
... etc. ...
◈ Learning from input/
output data rather than
curated rules: more
scalable
COLLABORATIVE DRUG DISCOVERY - MIXTURES
Future Work: Molecules
◈ Structures currently limited to ≈ Molfile/InChI subset
◈ Want to extend to:
▷ inorganics (non-integral bond orders)
▷ polymers (repeat units & distributions)
▷ variations (Markush structures, partially definitions)
▷ large molecules (proteins, DNA)
▷ pseudomolecules (ceramics, alloys)
◈ More formal definition of ID codes (CAS, PubChem, etc.)
19
COLLABORATIVE DRUG DISCOVERY - MIXTURES
Acknowledgments
20
◈ Leah McEwen
▷ and the rest of the IUPAC/InChI working group
◈ Hande Küçük McGinty (and iCorps)
▷ Collaborative Drug Discovery
Funding
NIH SBIR

More Related Content

PDF
Mixtures QSAR: modelling collections of chemicals
PPTX
How can the international chemical identifier (InChI) be extended to non triv...
PPTX
How can the international chemical identifier (InChI) be extended to non …
PDF
Mixtures: informatics for formulations and consumer products
PDF
Representing molecules with minimalism: A solution to the entropy of informatics
PPT
Great promise of navigating the internet using in chis
PPT
Data integration and building a profile for yourself as an online scientist
Mixtures QSAR: modelling collections of chemicals
How can the international chemical identifier (InChI) be extended to non triv...
How can the international chemical identifier (InChI) be extended to non …
Mixtures: informatics for formulations and consumer products
Representing molecules with minimalism: A solution to the entropy of informatics
Great promise of navigating the internet using in chis
Data integration and building a profile for yourself as an online scientist

Similar to Chemical mixtures: File format, open source tools, example data, and mixtures InChI derivative (20)

PDF
Efficient matching of multiple chemical subgraphs
PDF
Mixtures InChI: a story of how standards drive upstream products
PPTX
Approaches for extraction and digital chromatography of chemical data
PPT
The importance of the InChI identifier as a foundation technology for eScienc...
PDF
Mixtures as first class citizens in the realm of informatics
PPTX
Representing Chemicals Digitally: An overview of Cheminformatics
PPTX
Overview of cheminformatics
PPT
Cheminformatics and the Structure Elucidation of Natural Products
PPT
How the InChI identifier is used to underpin our online chemistry databases a...
PPT
How the InChI identifier is used to underpin our online chemistry databases a...
PPT
SOT short course on computational toxicology
PPT
ACS Meeting New Orleans 2013 (CINF)
PPT
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
PDF
II-PIC 2017: Drug Discovery of Novel Molecules using Chemical Data Mining tool
PPTX
Web-based access to data for >600 disinfection by-products via the EPA CompTo...
PDF
Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...
PPTX
Need and benefits for structure standardization to facilitate integration and...
PDF
Nucleoside libray e-conference VRX-Harry
PDF
Cheminformatics toolkits: a personal perspective
PPTX
Non-targeted analysis supported by data and cheminformatics delivered via the...
Efficient matching of multiple chemical subgraphs
Mixtures InChI: a story of how standards drive upstream products
Approaches for extraction and digital chromatography of chemical data
The importance of the InChI identifier as a foundation technology for eScienc...
Mixtures as first class citizens in the realm of informatics
Representing Chemicals Digitally: An overview of Cheminformatics
Overview of cheminformatics
Cheminformatics and the Structure Elucidation of Natural Products
How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...
SOT short course on computational toxicology
ACS Meeting New Orleans 2013 (CINF)
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
II-PIC 2017: Drug Discovery of Novel Molecules using Chemical Data Mining tool
Web-based access to data for >600 disinfection by-products via the EPA CompTo...
Reading and Writing Molecular File Formats for Data Exchange of Small Molecul...
Need and benefits for structure standardization to facilitate integration and...
Nucleoside libray e-conference VRX-Harry
Cheminformatics toolkits: a personal perspective
Non-targeted analysis supported by data and cheminformatics delivered via the...
Ad

More from Alex Clark (20)

PDF
Mixing small molecules and macromolecules in the world of informatics
PDF
ACS Denver 2024: Generative chemistry with deep learning models
PDF
ACS Denver 2024: Assay annotation with ontologies
PDF
Coordination InChI (2019)
PDF
Bringing bioassay protocols to the world of informatics, using semantic annot...
PDF
ACS CINF Luncheon talk (Boston 2018)
PDF
Autonomous model building with a preponderance of well annotated assay protocols
PDF
CDD BioAssay Express: Expanding the target dimension: How to visualize a lot ...
PDF
BioAssay Express
PDF
SLAS2016: Why have one model when you could have thousands?
PDF
The anatomy of a chemical reaction: Dissection by machine learning algorithms
PDF
Compact models for compact devices: Visualisation of SAR using mobile apps
PDF
Green chemistry in chemical reactions: informatics by design
PDF
ICCE 2014: The Green Lab Notebook
PDF
Cloud hosted APIs for cheminformatics on mobile devices (ACS Dallas 2014)
PDF
Building a mobile reaction lab notebook (ACS Dallas 2014)
PDF
Reaction Lab Notebooks for Mobile Devices - Alex M. Clark - GDCh 2013
PDF
Alex Clark : NETTAB 2013
PDF
Open Drug Discovery Teams @ Hacking Health Montreal
PDF
Pistoia Alliance App Strategy
Mixing small molecules and macromolecules in the world of informatics
ACS Denver 2024: Generative chemistry with deep learning models
ACS Denver 2024: Assay annotation with ontologies
Coordination InChI (2019)
Bringing bioassay protocols to the world of informatics, using semantic annot...
ACS CINF Luncheon talk (Boston 2018)
Autonomous model building with a preponderance of well annotated assay protocols
CDD BioAssay Express: Expanding the target dimension: How to visualize a lot ...
BioAssay Express
SLAS2016: Why have one model when you could have thousands?
The anatomy of a chemical reaction: Dissection by machine learning algorithms
Compact models for compact devices: Visualisation of SAR using mobile apps
Green chemistry in chemical reactions: informatics by design
ICCE 2014: The Green Lab Notebook
Cloud hosted APIs for cheminformatics on mobile devices (ACS Dallas 2014)
Building a mobile reaction lab notebook (ACS Dallas 2014)
Reaction Lab Notebooks for Mobile Devices - Alex M. Clark - GDCh 2013
Alex Clark : NETTAB 2013
Open Drug Discovery Teams @ Hacking Health Montreal
Pistoia Alliance App Strategy
Ad

Recently uploaded (20)

PPTX
2. Earth - The Living Planet earth and life
PDF
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PPTX
Pharmacology of Autonomic nervous system
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PDF
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PDF
An interstellar mission to test astrophysical black holes
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PPT
protein biochemistry.ppt for university classes
PPTX
2Systematics of Living Organisms t-.pptx
PPTX
famous lake in india and its disturibution and importance
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PDF
The scientific heritage No 166 (166) (2025)
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
Microbiology with diagram medical studies .pptx
2. Earth - The Living Planet earth and life
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
Biophysics 2.pdffffffffffffffffffffffffff
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
Pharmacology of Autonomic nervous system
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
POSITIONING IN OPERATION THEATRE ROOM.ppt
An interstellar mission to test astrophysical black holes
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
The KM-GBF monitoring framework – status & key messages.pptx
protein biochemistry.ppt for university classes
2Systematics of Living Organisms t-.pptx
famous lake in india and its disturibution and importance
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
Phytochemical Investigation of Miliusa longipes.pdf
The scientific heritage No 166 (166) (2025)
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
Microbiology with diagram medical studies .pptx

Chemical mixtures: File format, open source tools, example data, and mixtures InChI derivative

  • 1. Chemical mixtures: File format, open source tools, example data, and mixtures InChI derivative Alex M. Clark
  • 2. COLLABORATIVE DRUG DISCOVERY - MIXTURES Premise ◈ Representing specific chemicals routine since 1980s ▷ e.g. the MDL Molfile CTAB for organics ◈ Most real world encounters are mixtures ◈ No industry standard format... ▷ ... even though it's rather easy ▷ interchange is done with text 2
  • 3. COLLABORATIVE DRUG DISCOVERY - MIXTURES Goal ◈ Grant-supported project to: ▷ define a simple format (Mixfile) ▷ create open source tools for editing & manipulating ▷ bootstrap content via text mining ▷ work with IUPAC for interoperability (MInChI) ◈ Results pubished in Journal of Cheminformatics 11:33 3
  • 4. COLLABORATIVE DRUG DISCOVERY - MIXTURES Common Lab Mixtures 4 impurity isomers simple solution multisolvent solution
  • 5. COLLABORATIVE DRUG DISCOVERY - MIXTURES Real World Mixtures 5 drug tablet toothpaste
  • 6. COLLABORATIVE DRUG DISCOVERY - MIXTURES File Format ◈ Mixfile is to mixtures as Molfile is to molecules ◈ Components: ▷ structure & name ▷ concentration ▷ identifiers ◈ Hierarchy ▷ captures nuances of mixing ▷ relative concentrations/uncertainty 6
  • 7. COLLABORATIVE DRUG DISCOVERY - MIXTURES Representation ◈ Serialised using JSON: natural datastructure, human readable, concise, easy to code on any platform 7 { "mixfileVersion": 0.01, "name": "Lithium diisopropylamide solution", "contents": [ { "name": "Lithium diisopropylamide", "molfile": "nGenerated by WebMolKitnn 8 6 0 0 0 0 0 0 0 0999 V2000n 0.2500 1.5000 0.0000 N 0 5 0 0 0 0 0 0 0 0 0 0n -1.0490 0.7500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0n -2.3481 1.5000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0n -1.0490 -0.7500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0n 1.5490 0.7500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0n 2.8481 1.5000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0n 1.5490 -0.7500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0n 0.2500 3.0000 0.0000 Li 0 3 0 0 0 0 0 0 0 0 0 0n 1 2 1 0 0 0 0n 2 3 1 0 0 0 0n 2 4 1 0 0 0 0n 1 5 1 0 0 0 0n 5 6 1 0 0 0 0n 5 7 1 0 0 0 0nM CHG 2 1 -1 8 1nM END", "quantity": 1, "units": "mol/L", "inchi": "InChI=1S/C6H14N.Li/c1-5(2)7-6(3)4;/h5-6H,1-4H3;/q-1;+1", "inchiKey": "InChIKey=ZCSHNCUQKCANBX-UHFFFAOYSA-N" }, { "contents": [ { "name": "THF", "molfile": "nGenerated by WebMolKitnn 5 5 0 0 0 0 0 0 0 0999 V2000n -0.2500 3.7500 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0n -1.4600 2.8700 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0n -1.0000 1.4400 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0n 0.9600 2.8700 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0n 0.5000 1.4400 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0n 3 2 1 0 0 0 0n 2 1 1 0 0 0 0n 1 4 1 0 0 0 0n 4 5 1 0 0 0 0n 5 3 1 0 0 0 0nM END", "ratio": [ 1, 8 ], "inchi": "InChI=1S/C4H8O/c1-2-4-5-3-1/h1-4H2", "inchiKey": "InChIKey=WYURNTSHIVDZCO-UHFFFAOYSA-N" },
  • 8. COLLABORATIVE DRUG DISCOVERY - MIXTURES Tooling ◈ Editor & libraries written in TypeScript, cheminformatics library: WebMolKit ◈ Create, modify, view & render Mixfiles ◈ Open source https://github.com/cdd/mixtures 8 (a) (b) (c)
  • 9. COLLABORATIVE DRUG DISCOVERY - MIXTURES Mixture Data ◈ Lots of mixtures, but mostly text ◈ Active ingredient often separated ▷ purity ▷ solvent/conc. ◈ Semi-structured databases also 9
  • 10. COLLABORATIVE DRUG DISCOVERY - MIXTURES Text Extraction ◈ Many common patterns in text descriptions 10 1-Aza-12-crown-4 ≥97.0% Trimethyl(trifluoromethyl)silane solution 2 M in THF
  • 11. COLLABORATIVE DRUG DISCOVERY - MIXTURES Brute Force ◈ Compose a set of regular expression rules ◈ Remove brand names: 11 {"effect": "remove", "regex": "(.*)A[cC][rR][oO][sS] Organics?™?(.*?)$"}, {"effect": "remove", "regex": "(.*)AcroSeal™?(.*?)$"}, {"effect": "remove", "regex": "(.*)Alfa Aesar™?(.*?)$"}, {"effect": "remove", "regex": "(.*)BioReagents™?(.*?)$"}, {"effect": "remove", "regex": "(.*)Burdick & Jackson™?(.*?)$"}, {"effect": "remove", "regex": "(.*)(Technical)$"}, {"effect": "remove", "regex": "(.*)(Certified)$"}, {"effect": "remove", "regex": "(.*), pure$"}, {"effect": "remove", "regex": "(.*), for analysis$"}, {"effect": "remove", "regex": "(.*), extra pure$"}, ◈ Remove suffixes:
  • 12. COLLABORATIVE DRUG DISCOVERY - MIXTURES Brute Force (ctd) ◈ Identify branches (with quantities): 12 {"effect": "branch", "regex": "(.*) (over (molecular sieve .*))", "substance": "$2"}, {"effect": "branch", "regex": "(.*) (d[d.]*) ? mM in (.*)", "quantity": "$2", "units": "mmol/L", "substance": "$3"}, {"effect": "branch", "regex": "(.*) ~?d[d.]*s?% in (.*) ((~?)(d[d.]*) M)$", "quantity": "$4", "units": "mol/L", "relation": "$3", "substance": "$2"}, {"effect": "branch", "regex": "(.*), (d[d.]*)s?M [Ss]olution in (.*)", "quantity": "$2", "units": "mol/L", "substance": "$3"}, {"effect": "conc", "regex": "(.*) (d[d.]*)%, ee d[d.]*.%$", "quantity": "$2","units": "%", "relation": ""}, {"effect": "conc", "regex": "(.*) ≥ ?(d[d.]*)%$", "quantity": "$2","units": "%", "relation": ">="}, {"effect": "conc", "regex": "(.*) [Ss]olution (d[d.]*) ?M$", "quantity": "$2", "units": "mol/L"}, ◈ Concentration suffixes:
  • 13. COLLABORATIVE DRUG DISCOVERY - MIXTURES Implementation ◈ Apply regular expression rules greedily/recursively ▷ package up into hierarchical components ▷ substances referenced by name ◈ Use OPSIN for name-to-SMILES, RDKit to depict ◈ Use lookup database for: ▷ common exceptions (e.g. trivial names) ▷ named mixtures (e.g. hexanes) 13
  • 14. COLLABORATIVE DRUG DISCOVERY - MIXTURES Results ◈ Aggregated thousands of single-line text entries, mostly from online catalogs ◈ 5600 mixtures verified & included on GitHub using the Mixfile format ◈ Success rate ~95% with: ▷ 250 regular expression based rules ▷ 280 name lookup mappings ◈ All of them with calculated MInChI strings... 14
  • 15. COLLABORATIVE DRUG DISCOVERY - MIXTURES MInChI ◈ Mixtures InChI is a composite notation... ▷ Mixfile encapsulates Molfiles + extra metadata ▷ MInChI encapsulates InChI + extra metadata ◈ Distilled down to the basics: ▷ canonical structures ▷ simplified concentration ▷ nesting hierarchy ◈ Useful to accompany primary originated data 15
  • 16. COLLABORATIVE DRUG DISCOVERY - MIXTURES Dissection ◈ Mixfile to MInChI: designed for one way conversion (verbose-to-concise) ◈ Reduced information content of MInChI adds value for certain use cases ◈ Proof of concept implemented 16 (a) MInChI=0.00.1S/C7H8N4O2/c1-10-3-8-5-4(10)6(12)11(2)7(13)9-5/h3H,1-2H3,(H,9,13)/n1/g99pp0 header structure identifier indexing concentration MInChI=0.00.1S/BBr3/c2-1(3)4&CH2Cl2/c2-1-3/h1H2/n{1&2}/g{1mr0&}(b) (c) MInChI=0.00.1S/C4H8O/c1-2-4-5-3-1/h1-4H2&C6H12/c1-6-4-2-3-5-6/h6H,2-5H2,1H3 &C6H14/c1-3-5-6-4-2/h3-6H2,1-2H3&C6H14/c1-4-5-6(2)3/h6H,4-5H2,1-3H3 &C6H14/c1-4-6(3)5-2/h6H,4-5H2,1-3H3&C6H14N.Li/c1-5(2)7-6(3)4;/h5-6H,1-4H3;/q-1;+1 /n{6&{1&{3&2&4&5}}}/g{1mr0&{1vp0&{5:7vf-1&1:2vf-1&1:5vf-2&1:5vf-2}7vp0}}
  • 17. COLLABORATIVE DRUG DISCOVERY - MIXTURES Future Work: ELNs ◈ Embed convenient UI within web ELN ▷ sketch out mixture definitions for procedure writeup ▷ quick lookup databases of known mixtures ▷ search content using precise mixture-aware queries ◈ Vendor integration ▷ work with vendors to markup their products ▷ scientists can use product code to embed mixture 17
  • 18. COLLABORATIVE DRUG DISCOVERY - MIXTURES Future Work: Text Extraction ◈ Initial proof of concept is effective but crude ◈ Planning a much more sophisticated iterative-learning strategy: start by formulating input for recurrent deep neural network 18 Trimethyl(trifluoromethyl)silane solution 2 M in THF ■ component ■ quantity ■ branch ■ superfluous ... -5 -4 -3 -2 -1 0 +1 +2 +3 +4 +5 ... [... *, *, *, *, *, T, r, i, m, e, t ...] = ■ component [... *, *, *, *, T, r, i, m, e, t, h ...] = ■ component [... *, *, *, T, r, i, m, e, t, h, y ...] = ■ component [... *, *, T, r, i, m, e, t, h, y, l ...] = ■ component ... [... l, u, t, i, o, n, , 2, M, , i ...] = ■ superfluous [... u, t, i, o, n, , 2, M, , i, n ...] = ■ superfluous [... t, i, o, n, , 2, M, , i, n, ...] = ■ quantity [... i, o, n, , 2, M, , i, n, , T ...] = ■ quantity ... etc. ... ◈ Learning from input/ output data rather than curated rules: more scalable
  • 19. COLLABORATIVE DRUG DISCOVERY - MIXTURES Future Work: Molecules ◈ Structures currently limited to ≈ Molfile/InChI subset ◈ Want to extend to: ▷ inorganics (non-integral bond orders) ▷ polymers (repeat units & distributions) ▷ variations (Markush structures, partially definitions) ▷ large molecules (proteins, DNA) ▷ pseudomolecules (ceramics, alloys) ◈ More formal definition of ID codes (CAS, PubChem, etc.) 19
  • 20. COLLABORATIVE DRUG DISCOVERY - MIXTURES Acknowledgments 20 ◈ Leah McEwen ▷ and the rest of the IUPAC/InChI working group ◈ Hande Küçük McGinty (and iCorps) ▷ Collaborative Drug Discovery Funding NIH SBIR