SlideShare a Scribd company logo
Great promise of navigating the
          internet using InChIs

                     Antony J Williams
                 ACS San Diego March 2012
Openness and Quality Issues
Williams and Ekins, DDT, 16: 747-750 (2011)

              Science Translational Medicine 2011
Warning…
 This talk is not about Quality…it’s about quantity
Warning…
 This talk is not about Quality…it’s about quantity




                  Drugbank was here
Data quality is a known issue
We ALL have issues!!!
It’s about what’s out there…
How to Link it…
And getting out of overwhelm…
So what is Yohimbine?
Of course it is out there…




      Drugbox: 3001/5080 with InChIs

      Chembox:5436/7690 with InChIs
Tell me more…
   Where can I find the molfile for Yohimbine?
   Papers/Patents about Yohimbine?
   What are the side effects of Yohimbine?
   Where can I order Yohimbine?
   What are the physicochemical properties?
   Metabolic pathways?
   Different synonyms of Yohimbine?
   Synthesis of Yohimbine?
   Side effects of Yohimbine?
   Etc….
Quantity!
Yohimbine on ChemSpider..Quality?
How do we build it?
 We deal in Molfiles or SDF files – with coordinates

 Deposit anything that has an InChI – we support
  what InChI can handle, good and bad

 Standardization based on “InChI standardization”

 InChIs aggregate (certain) tautomers

 We link out to external sites using their IDs
Downsides of InChI
 InChI was a moving target (multi versions) but
  overall worked as planned.

 Good for small molecules – but no polymers,
  issues with inorganics, organometallics, imperfect
  stereochemistry. ChemSpider is “small molecules”

 InChI used as the “deduplicator” – FIRST version
  of a compound into the database becomes THE
  structure to deduplicate against…
Side Effects of InChI Usage
SMILES by comparison…
Side Effects of InChI Usage
Standardization Issues
Depiction based on molfile
Downsides of Overall Approach
 Meshing data together based on InChIs worked
  for simple molecules

 2D layout errors inherited or limited by algorithm

 Complex molecules that are meant to be the
  same thing were NOT deduplicated. Compounds
  differing by one stereocenter, named the same,
  meant to be the same, are not the same
Yohimbine on ChemSpider..Quality?
So where can we travel???
So where can we travel???
Great promise of navigating the internet using in chis
InChI String Search via Google
Give me InChIKeys…
And where can we travel???
 ChemSpider

 BRENDA

 Wikipedia

 ChEMBL

 ChEBI

 DrugBank
 Aggregator

 Enzymes

 Encyclopedia

 Pharmacology

 Curated Chemicals

 Drug-Drug Target
Recognizing Compound Dilution
 So much chemistry on the web….

 And so much dilution – “structural uniqueness”
  versus “accidental ambiguity”

 InChI as an easy skeleton search
Vancomycin – Search the Internet
Vancomycin




Search Molecular   Search Full Molecule
  SKELETON
Full Skeleton Search
All aggegators suffer dilution!
Many Problems Can be Solved…
 Clean up databases – structure validation,
  structure standardization

 Warn about
   Valency, charge balance, depiction issues,
    bond types, absent stereo, and another 100
    rules (or so…)

 Standardize
   Agree community rules to “Standardize”
Structure Validation
Structure Validation - Fixed
What needs to happen?
 If we could validate
    Catch errors in databases (and clean)
    Proactively catch errors in publications/patents
    Reduce junk in the ether – improve QUALITY!

 If we standardized
    Interlinking should improve
Great promise of navigating the internet using in chis
NPC Browser Set
Download, Deposit, Reprocess
Substructure   # of    # of          No           Incomplete       Complete but

                Hits   Correct   stereochemistry Stereochemistry      incorrect

                        Hits                                       stereochemistry


Gonane          34       5             8               21                0

Gon-4-ene       55       12            3               33                7

Gon-1,4-diene   60       17            10              23                10
Structure-Name Validation
                                  H3C
                                                                           NH2
                                               O
                                                                      I              I
                                      O            O                                     CH3
                           H3C                          OH
                   O                                CH3
                                                                                                  O
                                          CH3
                       O                             H
     HN
                                          CH3                               I                OH
              OH
                                                             O
          O                      HO
                                               O     O
                                           O
                                                                            Choladine
                                  O
                                                   CH3


      Taxol

                                                                 Cl
                       H3C                                                               N
                                                                                 N
                       CH3                  CH3

          CH3      H
                                  Cholane
              H        H
                                                                      Chlotrimazole
Standardize




 Use the SRS as a guidance document for
  standardization
 Adjust as necessary to our needs
Nitro groups
Salt and Ionic Bonds
Ammonium salts
Millions of structures? Lots of Issues
ChemSpider Standardization
 Entire ChemSpider database will be standardized
  using modified FDA rule set

 Original Molfiles will be standardized and all
  properties (predicted properties, SMILES, InChIs,
  Names) will all be regenerated

 Standardization procedures automatically applied
  to all future depositions
Identifier Dictionaries
 Reciprocal curation processes…share curation
  with each other.

 If a database has a compound already then use
  InChiKeys to match “suggested” validation
  against the compound.

 A series of “added” and “removed” synonyms
  against InChIKeys for matching.
Proof of Concept Data Curation Sharing
Who wants to work with us?
Structure Validation using feed
 Look for approved synonyms

 Compare feed InChIKey with database InChIKey

 If different, flag for inspection
It is so difficult to navigate…
                                                        IP?
                                What’s the
                                structure?
                                                    Are they in
                                                     our file?
                                  What’s
                                 similar?
                                                    What’s the
                              Pharmacology           target?
                                  data?

                                              Known
                                            Pathways?
                             Competitors?
                                                    Working On
                              Connections             Now?
                              to disease?
                                              Expressed in
                                             right cell type?
Open PHACTS Project
 Develop a set of robust standards…
 Implement the standards in a semantic integration hub
 Deliver services to support drug discovery programs in
  pharma and public domain
 22 partners, 8 pharmaceutical companies, 3 biotechs
 36 months project

  Guiding principle is open access, open usage, open source
                - Key to standards adoption -
Great promise of navigating the internet using in chis
Chemistry in Open PHACTS
 Selected data slices of ChemSpider carrying
  pharmacological links into the “linked data cache”

 ChemSpiderIDs and InChIs/InChIKeys will be in
  Open PHACTS and available for linking

 A structure ID standard to enable further linking
  across the semantic web of science
ChemSpider and InChI
                      Internet Data




 Small organic molecules              Commercial Software
 Undefined materials                  Pre-competitive Data
 Organometallics                            Open Science
 Nanomaterials                                 Open Data
 Polymers                                      Publishers
 Minerals                                      Educators
 Particle bound                           Open Databases
 Links to Biologicals                   Chemical Vendors
The great promise should be obvious
 InChIs are here to stay
 They will evolve, they will encompass, we will
  adopt and adapt
 Public and private databases will federate &
  build a linked environment of validated data!
 Data validation and standardization is
  needed
 Open Data will continue to proliferate
 InChIs are in the “Semantic Web” already
If InChI never existed or went away..
 ChemSpider would never have been built

 Database linking would suffer dramatically

 The web would not be “structure searchable”

 Cheminformatics tools would likely not be linking
  to public domain databases in the same way

 And we would not have the pleasure of today…
Acknowledgments
 The inspiration of the InChI Masters – Steve H.,
  Steve S., Alan, Dmitrii, Igor

 IUPAC, NIST, all adopters, supporters,
  challengers and users

 The InChI Trust and its supporters for funding
  continued development

 Al Gore –enabling us to search InChIs on the web
Steve Heller
Steve Heller
Thank you

Email: williamsa@rsc.org
Twitter: ChemConnector
Personal Blog: www.chemconnector.com
SLIDES: www.slideshare.net/AntonyWilliams

More Related Content

PPTX
Homol-MetReS: A web application for integration between molecular systems bio...
PPTX
RSC membership presentation 2011
PPT
Chemspider hosting linking and curating chemistry data for the community
PPT
ChemSpider as a chemical term resolver
PDF
Open Data: Touching Upon the Intangible
PDF
Realizing a UK National Compound Collection
PDF
Research Data Management - EPSRC’s Perspective
Homol-MetReS: A web application for integration between molecular systems bio...
RSC membership presentation 2011
Chemspider hosting linking and curating chemistry data for the community
ChemSpider as a chemical term resolver
Open Data: Touching Upon the Intangible
Realizing a UK National Compound Collection
Research Data Management - EPSRC’s Perspective

Similar to Great promise of navigating the internet using in chis (20)

PPT
5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk
PDF
Chemistry Online and The vision and challenges associated with building the c...
PPT
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
PPT
Delivering Curated Chemistry to the World via Crowdsourced Deposition and Ann...
PPT
ChemSpider and Traveling the Internet via Chemical Structures Cheminformatics...
PPT
The importance of the InChI identifier as a foundation technology for eScienc...
PDF
Chemical mixtures: File format, open source tools, example data, and mixtures...
PPT
Chemicals, Chemical Identifiers and Navigating Through Databases
PPTX
Approaches for extraction and digital chromatography of chemical data
PPT
ChemSpider – An Online Database and Registration System Linking the Web
PPTX
How can the international chemical identifier (InChI) be extended to non triv...
PPTX
How can the international chemical identifier (InChI) be extended to non …
PDF
Delivering Curated Chemistry to the World via Crowdsourced Deposition and Ann...
PPT
Data integration and building a profile for yourself as an online scientist
PPT
Current initiatives in developing research data repositories at the Royal Soc...
PDF
Data Quality Issues That Can Impact Drug Discovery
PDF
New Drug Discovery from natural products
5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk
Chemistry Online and The vision and challenges associated with building the c...
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
Delivering Curated Chemistry to the World via Crowdsourced Deposition and Ann...
ChemSpider and Traveling the Internet via Chemical Structures Cheminformatics...
The importance of the InChI identifier as a foundation technology for eScienc...
Chemical mixtures: File format, open source tools, example data, and mixtures...
Chemicals, Chemical Identifiers and Navigating Through Databases
Approaches for extraction and digital chromatography of chemical data
ChemSpider – An Online Database and Registration System Linking the Web
How can the international chemical identifier (InChI) be extended to non triv...
How can the international chemical identifier (InChI) be extended to non …
Delivering Curated Chemistry to the World via Crowdsourced Deposition and Ann...
Data integration and building a profile for yourself as an online scientist
Current initiatives in developing research data repositories at the Royal Soc...
Data Quality Issues That Can Impact Drug Discovery
New Drug Discovery from natural products
Ad

More from Royal Society of Chemistry (16)

PDF
The Global Chemistry Network - driving innovation
PPTX
20130724 cisrg sugars_batchelor
PPT
20130410 carbohydrates
PPT
Engaging students in publishing on the internet early in their careers
PPT
Navigating scientific resources using wiki based resources
PDF
Utilizing open source software to facilitate communication of chemistry at rsc
PPTX
ChemCareers India Specialist presentation
PPT
Newcastle chemistry admissions talk for MTU Online
PPTX
ChemNet Careers 2011-12
PPTX
Town hall speech
PPTX
Chemistry Landscape - Town Hall Speech
PPT
All aboard the Semantic Bandwagon
PPT
Linking chemistry: wider lessons for how we publish research
PPT
AZ of Chemspider February 2011
PPT
Metabolomics seminarslides 013111final 110201
PPT
Chem spider introduction spring 2011
The Global Chemistry Network - driving innovation
20130724 cisrg sugars_batchelor
20130410 carbohydrates
Engaging students in publishing on the internet early in their careers
Navigating scientific resources using wiki based resources
Utilizing open source software to facilitate communication of chemistry at rsc
ChemCareers India Specialist presentation
Newcastle chemistry admissions talk for MTU Online
ChemNet Careers 2011-12
Town hall speech
Chemistry Landscape - Town Hall Speech
All aboard the Semantic Bandwagon
Linking chemistry: wider lessons for how we publish research
AZ of Chemspider February 2011
Metabolomics seminarslides 013111final 110201
Chem spider introduction spring 2011
Ad

Recently uploaded (20)

PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Approach and Philosophy of On baking technology
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Tartificialntelligence_presentation.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Machine Learning_overview_presentation.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PPTX
1. Introduction to Computer Programming.pptx
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Approach and Philosophy of On baking technology
“AI and Expert System Decision Support & Business Intelligence Systems”
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Digital-Transformation-Roadmap-for-Companies.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
SOPHOS-XG Firewall Administrator PPT.pptx
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Diabetes mellitus diagnosis method based random forest with bat algorithm
MIND Revenue Release Quarter 2 2025 Press Release
Tartificialntelligence_presentation.pptx
Programs and apps: productivity, graphics, security and other tools
Machine Learning_overview_presentation.pptx
Encapsulation_ Review paper, used for researhc scholars
Mobile App Security Testing_ A Comprehensive Guide.pdf
Group 1 Presentation -Planning and Decision Making .pptx
1. Introduction to Computer Programming.pptx

Great promise of navigating the internet using in chis

  • 1. Great promise of navigating the internet using InChIs Antony J Williams ACS San Diego March 2012
  • 2. Openness and Quality Issues Williams and Ekins, DDT, 16: 747-750 (2011) Science Translational Medicine 2011
  • 3. Warning…  This talk is not about Quality…it’s about quantity
  • 4. Warning…  This talk is not about Quality…it’s about quantity Drugbank was here
  • 5. Data quality is a known issue
  • 6. We ALL have issues!!!
  • 7. It’s about what’s out there…
  • 8. How to Link it…
  • 9. And getting out of overwhelm…
  • 10. So what is Yohimbine?
  • 11. Of course it is out there… Drugbox: 3001/5080 with InChIs Chembox:5436/7690 with InChIs
  • 12. Tell me more…  Where can I find the molfile for Yohimbine?  Papers/Patents about Yohimbine?  What are the side effects of Yohimbine?  Where can I order Yohimbine?  What are the physicochemical properties?  Metabolic pathways?  Different synonyms of Yohimbine?  Synthesis of Yohimbine?  Side effects of Yohimbine?  Etc….
  • 15. How do we build it?  We deal in Molfiles or SDF files – with coordinates  Deposit anything that has an InChI – we support what InChI can handle, good and bad  Standardization based on “InChI standardization”  InChIs aggregate (certain) tautomers  We link out to external sites using their IDs
  • 16. Downsides of InChI  InChI was a moving target (multi versions) but overall worked as planned.  Good for small molecules – but no polymers, issues with inorganics, organometallics, imperfect stereochemistry. ChemSpider is “small molecules”  InChI used as the “deduplicator” – FIRST version of a compound into the database becomes THE structure to deduplicate against…
  • 17. Side Effects of InChI Usage
  • 19. Side Effects of InChI Usage
  • 21. Downsides of Overall Approach  Meshing data together based on InChIs worked for simple molecules  2D layout errors inherited or limited by algorithm  Complex molecules that are meant to be the same thing were NOT deduplicated. Compounds differing by one stereocenter, named the same, meant to be the same, are not the same
  • 23. So where can we travel???
  • 24. So where can we travel???
  • 26. InChI String Search via Google Give me InChIKeys…
  • 27. And where can we travel???
  • 28.  ChemSpider  BRENDA  Wikipedia  ChEMBL  ChEBI  DrugBank
  • 29.  Aggregator  Enzymes  Encyclopedia  Pharmacology  Curated Chemicals  Drug-Drug Target
  • 30. Recognizing Compound Dilution  So much chemistry on the web….  And so much dilution – “structural uniqueness” versus “accidental ambiguity”  InChI as an easy skeleton search
  • 31. Vancomycin – Search the Internet
  • 32. Vancomycin Search Molecular Search Full Molecule SKELETON
  • 35. Many Problems Can be Solved…  Clean up databases – structure validation, structure standardization  Warn about  Valency, charge balance, depiction issues, bond types, absent stereo, and another 100 rules (or so…)  Standardize  Agree community rules to “Standardize”
  • 38. What needs to happen?  If we could validate  Catch errors in databases (and clean)  Proactively catch errors in publications/patents  Reduce junk in the ether – improve QUALITY!  If we standardized  Interlinking should improve
  • 42. Substructure # of # of No Incomplete Complete but Hits Correct stereochemistry Stereochemistry incorrect Hits stereochemistry Gonane 34 5 8 21 0 Gon-4-ene 55 12 3 33 7 Gon-1,4-diene 60 17 10 23 10
  • 43. Structure-Name Validation H3C NH2 O I I O O CH3 H3C OH O CH3 O CH3 O H HN CH3 I OH OH O O HO O O O Choladine O CH3 Taxol Cl H3C N N CH3 CH3 CH3 H Cholane H H Chlotrimazole
  • 44. Standardize  Use the SRS as a guidance document for standardization  Adjust as necessary to our needs
  • 46. Salt and Ionic Bonds
  • 48. Millions of structures? Lots of Issues
  • 49. ChemSpider Standardization  Entire ChemSpider database will be standardized using modified FDA rule set  Original Molfiles will be standardized and all properties (predicted properties, SMILES, InChIs, Names) will all be regenerated  Standardization procedures automatically applied to all future depositions
  • 50. Identifier Dictionaries  Reciprocal curation processes…share curation with each other.  If a database has a compound already then use InChiKeys to match “suggested” validation against the compound.  A series of “added” and “removed” synonyms against InChIKeys for matching.
  • 51. Proof of Concept Data Curation Sharing Who wants to work with us?
  • 52. Structure Validation using feed  Look for approved synonyms  Compare feed InChIKey with database InChIKey  If different, flag for inspection
  • 53. It is so difficult to navigate… IP? What’s the structure? Are they in our file? What’s similar? What’s the Pharmacology target? data? Known Pathways? Competitors? Working On Connections Now? to disease? Expressed in right cell type?
  • 54. Open PHACTS Project  Develop a set of robust standards…  Implement the standards in a semantic integration hub  Deliver services to support drug discovery programs in pharma and public domain  22 partners, 8 pharmaceutical companies, 3 biotechs  36 months project Guiding principle is open access, open usage, open source - Key to standards adoption -
  • 56. Chemistry in Open PHACTS  Selected data slices of ChemSpider carrying pharmacological links into the “linked data cache”  ChemSpiderIDs and InChIs/InChIKeys will be in Open PHACTS and available for linking  A structure ID standard to enable further linking across the semantic web of science
  • 57. ChemSpider and InChI Internet Data Small organic molecules Commercial Software Undefined materials Pre-competitive Data Organometallics Open Science Nanomaterials Open Data Polymers Publishers Minerals Educators Particle bound Open Databases Links to Biologicals Chemical Vendors
  • 58. The great promise should be obvious  InChIs are here to stay  They will evolve, they will encompass, we will adopt and adapt  Public and private databases will federate & build a linked environment of validated data!  Data validation and standardization is needed  Open Data will continue to proliferate  InChIs are in the “Semantic Web” already
  • 59. If InChI never existed or went away..  ChemSpider would never have been built  Database linking would suffer dramatically  The web would not be “structure searchable”  Cheminformatics tools would likely not be linking to public domain databases in the same way  And we would not have the pleasure of today…
  • 60. Acknowledgments  The inspiration of the InChI Masters – Steve H., Steve S., Alan, Dmitrii, Igor  IUPAC, NIST, all adopters, supporters, challengers and users  The InChI Trust and its supporters for funding continued development  Al Gore –enabling us to search InChIs on the web
  • 63. Thank you Email: williamsa@rsc.org Twitter: ChemConnector Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams