SlideShare a Scribd company logo
ChemSpider – An Online Database and  Registration System Linking the Web Antony Williams and Valery Tkachenko EBI Chemical Registry Systems Workshop, October 2011
www.chemspider.com
ChemSpider… >26 million unique molecules from >400 sources .NET, SQL Server and GGA Indigo toolkit  Multiple Open Source components – Jmol, JSpecView, Balloon, OpenBabel, MediaWiki Slices of data are Open but the entire data collection is not Open Crowdsourced depositions and curations Uses  InChIs  for navigating and linking the web
Vancomycin
Vancomycin Search Molecular SKELETON Search Full Molecule
Full  Skeleton  Search: 104 Hits
Full  Molecule  Search: 4 Hits
ChemSpider… >26 million unique molecules from >400 sources .NET, SQL Server and GGA Indigo toolkit  Multiple Open Source components – Jmol, JSpecView, Balloon, OpenBabel, MediaWiki Slides of data are Open but the entire data collection is not Open Crowdsourced depositions and curations Uses  InChIs  for navigating and linking the web Uses  Names  for navigating and linking the web
I want to know about “Vincristine” If all algorithms work then everything on the page is correct by default except the name-structure relationship!
Vincristine: Identifiers and Properties
Vincristine: Vendors and Sources Linked by  Structure
Vincristine: Patents Linked by  Name
Vincristine: Articles Linked by  Name
ORIGINAL  ChemSpider “ Create a system for linking and navigating databases on the web ” Use the power of  InChI , and the proliferation of  InChIs  in databases, to make connections Developed on .NET and SQL Server for speed of implementation and existing skill sets Seeded with PubChem database of 10.5M chemicals and expanded using other sources to 20M
How do we build it? We deal in Molfiles or SDF files – with coordinates Deposit anything that has an InChI – we support what InChI can handle, good and bad Standardization based on “InChI standardization” InChIs aggregate (certain) tautomers We link out to external sites using their IDs
InChIs – both on ChemSpider
Downsides of InChI InChI was a moving target (multi versions) but overall worked as planned. Good for small molecules – but no polymers, issues with inorganics, organometallics, imperfect stereochemistry. ChemSpider is “small molecules” InChI used as the “deduplicator” –  FIRST  version of a compound into the database becomes THE structure to deduplicate against…
Side Effects of InChI Usage
SMILES by comparison…
Side Effects of InChI Usage
Standardization Issues Depiction based on molfile
Downsides of Overall Approach Meshing data together based on InChIs worked for simple molecules 2D layout errors inherited or limited by algorithm Complex molecules that are  meant  to be the same thing were NOT deduplicated. Compounds differing by one stereocenter,  named  the same, meant to be the same, are not the same
Yohimbine
Originally 15 compounds “called” Yohimbine 54 Skeletons for Yohimbine
ChemSpider as an Aggregator ChemSpider has inherited many errors, and it continues but we are way more careful now with pre-filtering Cannot deposit chemicals without an InChI Deprecated compounds remain deprecated Curated name-structure relationships do NOT remove the related structure If Taxol is removed from 20 asserted “incorrect structures” those compounds remain in the database
Chemistry Databases on the Internet Some public databases are “trusted” as primary sources Trust is granted without investigation or understanding of the content What do we know about some of the online resources?
PHYSPROP Database The freely downloadable database under the EPI Suite prediction software Very Basic filters suggest data quality issues
The  Stereochemistry challenge. 12500 chemicals with “missed” stereo
Searches on ChemSpider Most searches are text-based: people searching for information about known chemicals Creating accurate name-structure dictionaries is critical
NIST Webbook
PubChem
NPC Browser  http://tripod.nih.gov/npc/
NPC Browser  http://tripod.nih.gov/npc/
 
NPC Browser  http://tripod.nih.gov/npc/
Synonyms on PubChem 1,3-DICHLORO-PROPAN-2-ONE (2R,3R)-Butanediol bis(methanesulfonate) Ethyl-1-propenyl ether, mixture of cis and trans PSS-[2-[(Chloromethyl)phenyl]ethyl]-Heptaisobutyl substituted 1-Chlorobenzylethyl-3,5,7,9,11,13,15-heptaisobutylpentacyclo [9.5.1.1(3,9).1(5,15).1(7,13)]octasiloxane
Synonyms on PubChem
Data Proliferation
 
 
 
 
 
What is meant by a name?
Choose a Starting Point
“ The First 10”
What is getting into Our Databases? Large aggregators are inheriting junk data Data HAS proliferated from ChemSpider through PubChem – in process of deprecating and redepositing A lot of data is for chemicals that will never exist (probably)
Standardization of Patent Data???
Standardization of Patent Data???
WYSIWYG compounds
WYSIWYG compounds
Text Mining Chemical Name Errors
“ DPA”
All aggegators suffer dilution!
Structures have timelines
Name-Structure Dictionaries…
Depiction for Humans
Human Depiction versus Algorithms
Human Depiction versus Algorithms
Identifier Dictionaries Reciprocal curation processes…share curation with each other. If a database has a compound already then use InChiKeys to match “suggested” validation against the compound. A series of “added” and “removed” synonyms against InChIKeys for matching.
Proof of Concept Data Curation Sharing
Structure Validation using feed Look for approved synonyms Compare feed InChIKey with database InChIKey If different, flag for inspection
Open PHACTS : partnership between European Community and EFPIA Freely accessible for knowledge discovery and verification. Data on  small molecules Pharmacological profiles Pharmacokinetics ADMET data Biological targets and pathways  Proprietary and public data sources.
Adopting Modified FDA Rules As already used by ChEMBL…
Nitro groups
Salt and Ionic Bonds
Ammonium salts
Parent and Child Chemical entities reduced to primary component plus relationships salt forms solvates combinations
ChemSpider Standardization Entire ChemSpider database will be standardized using modified FDA rule set Original Molfiles will be standardized and all properties (predicted properties, SMILES, InChIs, Names) will all be regenerated Standardization procedures automatically applied to all future depositions
Project Status Standardization pipelining process initiated  Rule implementation and checking – iterative work with Open PHACTS pharma members Data model development to support parent-child relationships In dialog with the FDA about latest form of recommendations
Conclusions ChemSpider has an important role in quality data Crowdsourced deposition, validation and curation works but  low  engagement to date  Standardization of our entire backfile is necessary Designing the standardization processes with input from pharma and general chemists is necessary
Acknowledgments  The ChemSpider team Our data providers, depositors, collaborators and curators Software providers – OpenEye, ChemDoodle, ACD/Labs, GGA Software, Open Source (Jmol, JSpecView, OpenBabel)
Thank you Email: williamsa@rsc.org  Twitter: ChemConnector Blog: www.chemspider.com/blog Personal Blog:  www.chemconnector.com   SLIDES:  www.slideshare.net/AntonyWilliams

More Related Content

PPT
ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Ac...
PPT
ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...
PPT
How the web has weaved a web of interlinked chemistry data final
PPT
ChemSpider – The Vision and Challenges Associated with Building a Free Online...
PPT
Delivering Curated Chemistry to the World via Crowdsourced Deposition and Ann...
PPT
Sourcing high quality online data resources for computational toxicology
PPT
ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry...
ChemSpider – A Platform to Gather, Host and Integrate Structure Based Data Ac...
ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...
How the web has weaved a web of interlinked chemistry data final
ChemSpider – The Vision and Challenges Associated with Building a Free Online...
Delivering Curated Chemistry to the World via Crowdsourced Deposition and Ann...
Sourcing high quality online data resources for computational toxicology
ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry...

What's hot (19)

PPT
Ebi public meeting on internet chemistry databases november 2010
PPT
ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...
PDF
ICAR 2015 Poster - Araport
PDF
ICIC 2014 From SureChem to SureChEMBL
PPT
Integrating and curating internet based chemistry resources to serve life sci...
PPT
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
PPT
ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...
PPT
RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build...
PDF
ICAR 2015 Workshop - Agnes Chan
PPT
The importance of the InChI identifier as a foundation technology for eScienc...
PDF
2015 Summer - Araport Project Overview Leaflet
PPT
ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Comm...
PDF
Plant ontology web services on Araport
PPTX
Serving the medicinal chemistry community with Royal Society of Chemistry che...
PDF
PMR metabolomics and transcriptomics database and its RESTful web APIs: A dat...
DOCX
2016 Summer - Araport Project Overview Leaflet
PDF
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?
Ebi public meeting on internet chemistry databases november 2010
ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...
ICAR 2015 Poster - Araport
ICIC 2014 From SureChem to SureChEMBL
Integrating and curating internet based chemistry resources to serve life sci...
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
ChemSpider - Building a Foundation for the Semantic Web by Hosting a Crowd So...
RSC ChemSpider -- Managing and Integrating Chemistry on the Internet to Build...
ICAR 2015 Workshop - Agnes Chan
The importance of the InChI identifier as a foundation technology for eScienc...
2015 Summer - Araport Project Overview Leaflet
ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Comm...
Plant ontology web services on Araport
Serving the medicinal chemistry community with Royal Society of Chemistry che...
PMR metabolomics and transcriptomics database and its RESTful web APIs: A dat...
2016 Summer - Araport Project Overview Leaflet
ICIC 2017: Freeware and public databases: Towards a Wiki Drug Discovery?
Ad

Similar to ChemSpider – An Online Database and Registration System Linking the Web (20)

PPT
ChemSpider as an integration hub for interlinked chemistry data
PPT
PPT
RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn
PPT
eScience at the Royal Society of Chemistry and our current initiatives
PPT
Taming The Wild West Of Internet Based Chemistry You Can Help
PPT
ChemSpider and How The Wisdom Of The Crowds Can Improve The Quality Of ...
PPT
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
PPT
Chem spider introduction spring 2011
PPT
AZ of Chemspider February 2011
PPT
Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry
PPT
eScience Resources for the Chemistry Community from the Royal Society of Chem...
PPT
PPT
PPT
PPTX
ChemValidator – an online service for validating and standardizing chemical s...
PPTX
RSC ChemSpider – Building An Internet Based Community For Chemists
ChemSpider as an integration hub for interlinked chemistry data
RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn
eScience at the Royal Society of Chemistry and our current initiatives
Taming The Wild West Of Internet Based Chemistry You Can Help
ChemSpider and How The Wisdom Of The Crowds Can Improve The Quality Of ...
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
Chem spider introduction spring 2011
AZ of Chemspider February 2011
Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry
eScience Resources for the Chemistry Community from the Royal Society of Chem...
ChemValidator – an online service for validating and standardizing chemical s...
RSC ChemSpider – Building An Internet Based Community For Chemists
Ad

Recently uploaded (20)

PPTX
1. Introduction to Computer Programming.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Encapsulation theory and applications.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Approach and Philosophy of On baking technology
PPTX
Tartificialntelligence_presentation.pptx
PPTX
Spectroscopy.pptx food analysis technology
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Big Data Technologies - Introduction.pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
Machine learning based COVID-19 study performance prediction
1. Introduction to Computer Programming.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Programs and apps: productivity, graphics, security and other tools
Advanced methodologies resolving dimensionality complications for autism neur...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Encapsulation theory and applications.pdf
MIND Revenue Release Quarter 2 2025 Press Release
NewMind AI Weekly Chronicles - August'25-Week II
Approach and Philosophy of On baking technology
Tartificialntelligence_presentation.pptx
Spectroscopy.pptx food analysis technology
Encapsulation_ Review paper, used for researhc scholars
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
A Presentation on Artificial Intelligence
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Big Data Technologies - Introduction.pptx
Empathic Computing: Creating Shared Understanding
Machine learning based COVID-19 study performance prediction

ChemSpider – An Online Database and Registration System Linking the Web

  • 1. ChemSpider – An Online Database and Registration System Linking the Web Antony Williams and Valery Tkachenko EBI Chemical Registry Systems Workshop, October 2011
  • 3. ChemSpider… >26 million unique molecules from >400 sources .NET, SQL Server and GGA Indigo toolkit Multiple Open Source components – Jmol, JSpecView, Balloon, OpenBabel, MediaWiki Slices of data are Open but the entire data collection is not Open Crowdsourced depositions and curations Uses InChIs for navigating and linking the web
  • 5. Vancomycin Search Molecular SKELETON Search Full Molecule
  • 6. Full Skeleton Search: 104 Hits
  • 7. Full Molecule Search: 4 Hits
  • 8. ChemSpider… >26 million unique molecules from >400 sources .NET, SQL Server and GGA Indigo toolkit Multiple Open Source components – Jmol, JSpecView, Balloon, OpenBabel, MediaWiki Slides of data are Open but the entire data collection is not Open Crowdsourced depositions and curations Uses InChIs for navigating and linking the web Uses Names for navigating and linking the web
  • 9. I want to know about “Vincristine” If all algorithms work then everything on the page is correct by default except the name-structure relationship!
  • 11. Vincristine: Vendors and Sources Linked by Structure
  • 14. ORIGINAL ChemSpider “ Create a system for linking and navigating databases on the web ” Use the power of InChI , and the proliferation of InChIs in databases, to make connections Developed on .NET and SQL Server for speed of implementation and existing skill sets Seeded with PubChem database of 10.5M chemicals and expanded using other sources to 20M
  • 15. How do we build it? We deal in Molfiles or SDF files – with coordinates Deposit anything that has an InChI – we support what InChI can handle, good and bad Standardization based on “InChI standardization” InChIs aggregate (certain) tautomers We link out to external sites using their IDs
  • 16. InChIs – both on ChemSpider
  • 17. Downsides of InChI InChI was a moving target (multi versions) but overall worked as planned. Good for small molecules – but no polymers, issues with inorganics, organometallics, imperfect stereochemistry. ChemSpider is “small molecules” InChI used as the “deduplicator” – FIRST version of a compound into the database becomes THE structure to deduplicate against…
  • 18. Side Effects of InChI Usage
  • 20. Side Effects of InChI Usage
  • 22. Downsides of Overall Approach Meshing data together based on InChIs worked for simple molecules 2D layout errors inherited or limited by algorithm Complex molecules that are meant to be the same thing were NOT deduplicated. Compounds differing by one stereocenter, named the same, meant to be the same, are not the same
  • 24. Originally 15 compounds “called” Yohimbine 54 Skeletons for Yohimbine
  • 25. ChemSpider as an Aggregator ChemSpider has inherited many errors, and it continues but we are way more careful now with pre-filtering Cannot deposit chemicals without an InChI Deprecated compounds remain deprecated Curated name-structure relationships do NOT remove the related structure If Taxol is removed from 20 asserted “incorrect structures” those compounds remain in the database
  • 26. Chemistry Databases on the Internet Some public databases are “trusted” as primary sources Trust is granted without investigation or understanding of the content What do we know about some of the online resources?
  • 27. PHYSPROP Database The freely downloadable database under the EPI Suite prediction software Very Basic filters suggest data quality issues
  • 28. The Stereochemistry challenge. 12500 chemicals with “missed” stereo
  • 29. Searches on ChemSpider Most searches are text-based: people searching for information about known chemicals Creating accurate name-structure dictionaries is critical
  • 32. NPC Browser http://tripod.nih.gov/npc/
  • 33. NPC Browser http://tripod.nih.gov/npc/
  • 34.  
  • 35. NPC Browser http://tripod.nih.gov/npc/
  • 36. Synonyms on PubChem 1,3-DICHLORO-PROPAN-2-ONE (2R,3R)-Butanediol bis(methanesulfonate) Ethyl-1-propenyl ether, mixture of cis and trans PSS-[2-[(Chloromethyl)phenyl]ethyl]-Heptaisobutyl substituted 1-Chlorobenzylethyl-3,5,7,9,11,13,15-heptaisobutylpentacyclo [9.5.1.1(3,9).1(5,15).1(7,13)]octasiloxane
  • 39.  
  • 40.  
  • 41.  
  • 42.  
  • 43.  
  • 44. What is meant by a name?
  • 46. “ The First 10”
  • 47. What is getting into Our Databases? Large aggregators are inheriting junk data Data HAS proliferated from ChemSpider through PubChem – in process of deprecating and redepositing A lot of data is for chemicals that will never exist (probably)
  • 52. Text Mining Chemical Name Errors
  • 60. Identifier Dictionaries Reciprocal curation processes…share curation with each other. If a database has a compound already then use InChiKeys to match “suggested” validation against the compound. A series of “added” and “removed” synonyms against InChIKeys for matching.
  • 61. Proof of Concept Data Curation Sharing
  • 62. Structure Validation using feed Look for approved synonyms Compare feed InChIKey with database InChIKey If different, flag for inspection
  • 63. Open PHACTS : partnership between European Community and EFPIA Freely accessible for knowledge discovery and verification. Data on small molecules Pharmacological profiles Pharmacokinetics ADMET data Biological targets and pathways Proprietary and public data sources.
  • 64. Adopting Modified FDA Rules As already used by ChEMBL…
  • 66. Salt and Ionic Bonds
  • 68. Parent and Child Chemical entities reduced to primary component plus relationships salt forms solvates combinations
  • 69. ChemSpider Standardization Entire ChemSpider database will be standardized using modified FDA rule set Original Molfiles will be standardized and all properties (predicted properties, SMILES, InChIs, Names) will all be regenerated Standardization procedures automatically applied to all future depositions
  • 70. Project Status Standardization pipelining process initiated Rule implementation and checking – iterative work with Open PHACTS pharma members Data model development to support parent-child relationships In dialog with the FDA about latest form of recommendations
  • 71. Conclusions ChemSpider has an important role in quality data Crowdsourced deposition, validation and curation works but low engagement to date Standardization of our entire backfile is necessary Designing the standardization processes with input from pharma and general chemists is necessary
  • 72. Acknowledgments The ChemSpider team Our data providers, depositors, collaborators and curators Software providers – OpenEye, ChemDoodle, ACD/Labs, GGA Software, Open Source (Jmol, JSpecView, OpenBabel)
  • 73. Thank you Email: williamsa@rsc.org Twitter: ChemConnector Blog: www.chemspider.com/blog Personal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams