SlideShare a Scribd company logo
Hosting public domain chemicals
data online for the community – the
challenges of handling materials
Antony Williams
NIST Diffusion/CALPHAD Data Informatics and Tools Workshop
May 14th
, 2015
ORCID ID:0000-0002-2668-4821
Disclaimer…
• Previously at the Royal Society of Chemistry
• Now I am here…
Many challenges are the same
• What I will discuss in terms of publisher,
public domain databases, curated chemistry
challenges etc. are the same…
• Need capable tools to handle the data
• Need standards for data exchange
• Meshing data without review is dangerous!
• Quality costs – time, effort and money
• Algorithms can help clean data
Where is chemistry online?
• Encyclopedic articles (Wikipedia)
• Chemical vendor databases
• Metabolic pathway databases
• Property databases
• Patents with chemical structures
• Drug Discovery data
• Scientific publications
• Compound aggregators
• Blogs/Wikis and Open Notebook Science
Chemistry on the Internet…
• Most searching for chemistry on the internet…
• Name searching Google/Bing/Yahoo
• Name searching Wikipedia
• Name searching Wolfram Alpha
• Name, name, name, name…searching
The issue of identifiers
Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials
Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials
Some names for Aspirin..
The CAS Number
• MUCH integration is done using CAS Numbers
• MANY searches are CAS Numbers and Names
CAS Numbers are GREAT!
The CAS Number Index grows…
Scifinder
Prophetic Enumeration
CAS Numbers are “Trademarked”?
• From http://www.cas.org/legal/infopolicy
CAS and Wikipedia
• http://en.wikipedia.org/wiki/Wikipedia_talk:WikiProject_Chemistry/CAS_validation
CAS and Wikipedia
CAS and Wikipedia
7900 CAS Chemicals Online…
How many CAS Numbers?
How many CAS Numbers?
• >34 million chemicals from >500 sources
But CAS is hard to “Resolve”
Why CAS Numbers are not great
• There is no free service…like DOIs
• The resolver is a “Google Search”
• Maybe we need another “identifier”?
• And thanks to IUPAC/NIST….
The InChI Identifier
Multiple Layers
InChI
• SINGLE code base managed by IUPAC –
integrated into drawing packages and used by
MANY databases. No variability as with
SMILES
Vendor-dependent SMILES
ACD/Labs
CC(C)CCC[C@@H](C)CCC[C@@H]
(C)CCCC(C)=CCC2=C(C)C(=O)c1ccccc1C2
=O
OpenEye
CC1=C(C(=O)c2ccccc2C1=O)C/C=C(C)/CC
C[C@H](C)CCC[C@H](C)CCCC(C)C
ChEMBL
CC(C)CCC[C@@H](C)CCC[C@@H]
(C)CCCC(=CCC1=C(C)C(=O)c2ccccc2C1=
InChI
• SINGLE code base managed by IUPAC –
integrated into drawing packages and used by
MANY databases. No variability as with
SMILES
• InChI Strings can be reversed to structures –
same problem as with SMILES – no layout
• Adopted by the community (databases, blogs,
Wikipedia) – good for searching the internet
InChIStrings Hash to InChIKeys
InChIs for small molecules…
• InChIs are good for “small molecules”
• Read here: http://www.jcheminf.com/series/InChI
A Vision in December 2006
Lots of data coming online…
ChemSpider
ChemSpider
ChemSpider
Experimental/Predicted Properties
Literature references
Patents references
Google Books
Vendors and data sources
Structure search the web
Exact Search
Skeleton Search
6 years ago this week…
ChemSpider strengths
• Serves over 40,000 unique users per day
• Advanced searching of >34 million chemicals
Fully documented APIs
Fully documented APIs
Data Quality/Standardization
• MANY structures meant to be something
online are MISREPRESENTED.
• Commonly you will have better success finding
information by name searches than structure –
with many caveats of course…
• Validating chemical structure representations
is laborious work – and it’s shocking to review
data…
What is the Structure of Vitamin K1?
Data Quality Issues
Williams and Ekins, DDT, 16: 747-750 (2011)
Science Translational Medicine 2011
Data quality is a known issue
Data quality is a known issue
Patent data in public databases
Patent data in public databases
Depiction vs Accurate Representation
Depiction vs Accurate Representation
There are Unused Standards!
There are Unused Standards!
There are Unused Standards!
Nitro groups
Salt and Ionic Bonds
Ammonium salts
Can we MAKE Quality Data?
• Systems for everyone to validate and
standardize their data would be useful
• Would improve structure data in publications,
databases etc. and make searching across
resources better
• Collaboration to establish community rules
would be good!
Chemical Validation and
Standardization:
http://cvsp.chemspider.com
CVSP Rules Sets
CVSP Filtering of DrugBank
CVSP Filtering of DrugBank
CVSP is Open to Anyone!
ChemSpider limitations
• Supports “small molecules” only – no
InChI, no possibility to register a compound
• SO MUCH of chemistry is “materials”
• Severe limitation in chemistry coverage:
• Monomers but no polymers
• Inorganic and organometallic handling
• Ambiguous structures – “Markush”
• Nanomaterials
• Minerals
• Bound to beads, surfaces etc
ORGANICS vs. Materials
• Comment – you don’t know all of the
challenges until you start to work in the area!
• We, and cheminformatics companies, have
solved MANY, but not all of the issues
regarding organic chemistry management
• The majority of our approaches do not map to
materials
• No standard ways to represent compounds
• No InChI for materials
Questions to consider…
• Organics are hard enough!
• What are your best dictionaries of materials?
• We have chemical ontologies. Status for
materials?
• Is open annotation of your databases possible?
• What standards do you have for materials data
exchange?
Polymorphism is common
Known Challenges
• Many materials are non-stoichiometric
• How to represent composite materials (e.g.
supported catalysts)?
• Methods to distinguish novelty in materials
(equivalent to diversity in organic structures)?
• Lots of challenges ahead..a curated
“community dictionary” would be of value…
Mapped DICTIONARIES…
• Structure IDs
• Systematic name(s)
• Trivial Name(s)
• SMILES
• InChI Strings
• InChIKeys
• Database IDs
• Registry Number
Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials
Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials
Pragmatism wins
Collaboration is key
Wouldn’t it be nice if…
Thank you
Email: tony27587@gmail.com
ORCID: 0000-0002-2668-4821
Twitter: @ChemConnector
Personal Blog: www.chemconnector.com
SLIDES: www.slideshare.net/AntonyWilliams

More Related Content

PPT
Encouraging undergraduate students to participate as authors of scientific pu...
PPT
Open innovation contributions from RSC resulting from the Open Phacts project
PPT
Our dire need to mandate data standards and expectations for scientific publi...
PPTX
Data Mining Dissertations and Adventures and Experiences in the World of Chem...
PPT
Dealing with the complex challenge of managing diverse chemistry data online
PPT
Value of the mediawiki platform for providing content to the chemistry community
PPT
The application of text and data mining to enhance the RSC publication archive
Encouraging undergraduate students to participate as authors of scientific pu...
Open innovation contributions from RSC resulting from the Open Phacts project
Our dire need to mandate data standards and expectations for scientific publi...
Data Mining Dissertations and Adventures and Experiences in the World of Chem...
Dealing with the complex challenge of managing diverse chemistry data online
Value of the mediawiki platform for providing content to the chemistry community
The application of text and data mining to enhance the RSC publication archive

What's hot (19)

PPTX
Investigating Impact Metrics for Performance for the US-EPA National Center f...
PPT
Royal society of chemistry activities to develop a data repository for chemis...
PPT
How the InChI identifier is used to underpin our online chemistry databases a...
PDF
Linked Data Snowball, or Why We Need Reconciliation
PPT
Data integration and building a profile for yourself as an online scientist
ODP
What the Adoption of schema.org Tells about Linked Open Data
PDF
The Need for and fundamentals of an Open Web Index
PPT
The Possibilities and Pitfalls of Internet-Based Chemical Data
PPT
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
PDF
ICIC 2014 What Can We Learn from Our Past, that Equips Us for the Future?
PPT
The UK National Chemical Database Service – an integration of commercial and ...
PPTX
The International Standard Name Identifier (ISNI): A Close Look, with Laura D...
PPTX
Linked Data for Libraries: Experiments between Cornell, Harvard and Stanford
PPTX
ISNI identifiers and linked data in the research space la trobe unviersity 20...
PDF
II-SDV 2016 Linguamatics
PPT
Providing support for JC Bradleys vision of open science using RSC cheminform...
PDF
Linked Data Best Practices and BibFrame
PDF
Linked Open Data
Investigating Impact Metrics for Performance for the US-EPA National Center f...
Royal society of chemistry activities to develop a data repository for chemis...
How the InChI identifier is used to underpin our online chemistry databases a...
Linked Data Snowball, or Why We Need Reconciliation
Data integration and building a profile for yourself as an online scientist
What the Adoption of schema.org Tells about Linked Open Data
The Need for and fundamentals of an Open Web Index
The Possibilities and Pitfalls of Internet-Based Chemical Data
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
ICIC 2014 What Can We Learn from Our Past, that Equips Us for the Future?
The UK National Chemical Database Service – an integration of commercial and ...
The International Standard Name Identifier (ISNI): A Close Look, with Laura D...
Linked Data for Libraries: Experiments between Cornell, Harvard and Stanford
ISNI identifiers and linked data in the research space la trobe unviersity 20...
II-SDV 2016 Linguamatics
Providing support for JC Bradleys vision of open science using RSC cheminform...
Linked Data Best Practices and BibFrame
Linked Open Data
Ad

Viewers also liked (11)

PDF
El cuerpo habla
PPTX
Getting Students Excited About Reading with the iPad
PPT
四校交流_菩提居服務隊簡報
PDF
Engage 2013 - Optimizing Mobile + Social Channels
PPT
AIR CAR
PPTX
Collaborative Sites for Learners
PPTX
Galbraith, Ed, Barr Engineering, US EPA Draft Rulemaking on Waters of the US...
PDF
Replace Output Iterator and Extend Range JP
PPTX
Богдан Аганін
PPTX
PPTX
Олександр Кубраков
El cuerpo habla
Getting Students Excited About Reading with the iPad
四校交流_菩提居服務隊簡報
Engage 2013 - Optimizing Mobile + Social Channels
AIR CAR
Collaborative Sites for Learners
Galbraith, Ed, Barr Engineering, US EPA Draft Rulemaking on Waters of the US...
Replace Output Iterator and Extend Range JP
Богдан Аганін
Олександр Кубраков
Ad

Similar to Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials (20)

PPT
Hosting public domain chemicals data online for the community – the challenge...
PPT
eScience Resources for the Chemistry Community from the Royal Society of Chem...
PPT
Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry
PPT
The importance of the InChI identifier as a foundation technology for eScienc...
PPT
eScience at the Royal Society of Chemistry and our current initiatives
PPT
How Internet Resources Are Providing a Collaborative Community for Chemistry
PPT
Delivering Curated Chemistry to the World via Crowdsourced Deposition and Ann...
PPT
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
PPT
ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...
PPT
ChemSpider as an integration hub for interlinked chemistry data
PPT
ChemSpider – An Online Database and Registration System Linking the Web
PPT
ChemSpider and How The Wisdom Of The Crowds Can Improve The Quality Of ...
PPT
Integrating and curating internet based chemistry resources to serve life sci...
PPT
How the InChI identifier is used to underpin our online chemistry databases a...
PPT
RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn
PPT
PPT
ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...
Hosting public domain chemicals data online for the community – the challenge...
eScience Resources for the Chemistry Community from the Royal Society of Chem...
Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry
The importance of the InChI identifier as a foundation technology for eScienc...
eScience at the Royal Society of Chemistry and our current initiatives
How Internet Resources Are Providing a Collaborative Community for Chemistry
Delivering Curated Chemistry to the World via Crowdsourced Deposition and Ann...
ChemSpider as a Foundation for Crowdsourcing and Collaborations in Open Chemi...
ChemSpider - Does Community Engagement work to Build a Quality Online Resourc...
ChemSpider as an integration hub for interlinked chemistry data
ChemSpider – An Online Database and Registration System Linking the Web
ChemSpider and How The Wisdom Of The Crowds Can Improve The Quality Of ...
Integrating and curating internet based chemistry resources to serve life sci...
How the InChI identifier is used to underpin our online chemistry databases a...
RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn
ChemSpider – A Community Platform for Chemistry and Resources Supporting the ...

Recently uploaded (20)

PPTX
Your Guide to a Winning Interview Aug 2025.
PDF
CV of Architect Professor A F M Mohiuddin Akhand.pdf
PDF
Biography of Mohammad Anamul Haque Nayan
PDF
MCQ Practice CBT OL Official Language 1.pptx.pdf
PPT
notes_Lecture2 23l3j2 dfjl dfdlkj d 2.ppt
PPTX
Theory of Change. AFH-FRDP OCEAN ToCpptx
PPTX
microtomy kkk. presenting to cryst in gl
PPT
2- CELL INJURY L1 Medical (2) gggggggggg
PPTX
AREAS OF SPECIALIZATION AND CAREER OPPORTUNITIES FOR COMMUNICATORS AND JOURNA...
PDF
servsafecomprehensive-ppt-full-140617222538-phpapp01.pdf
PPTX
Overview Planner of Soft Skills in a single ppt
PDF
APNCET2025RESULT Result Result 2025 2025
PPTX
cse couse aefrfrqewrbqwrgbqgvq2w3vqbvq23rbgw3rnw345
PPTX
DPT-MAY24.pptx for review and ucploading
PPTX
1751884730-Visual Basic -Unitj CS B.pptx
PPTX
FINAL PPT.pptx cfyufuyfuyuy8ioyoiuvy ituyc utdfm v
DOCX
mcsp232projectguidelinesjan2023 (1).docx
PPTX
Principles of Inheritance and variation class 12.pptx
PPT
BCH3201 (Enzymes and biocatalysis)-JEB (1).ppt
PPTX
Sports and Dance -lesson 3 powerpoint presentation
Your Guide to a Winning Interview Aug 2025.
CV of Architect Professor A F M Mohiuddin Akhand.pdf
Biography of Mohammad Anamul Haque Nayan
MCQ Practice CBT OL Official Language 1.pptx.pdf
notes_Lecture2 23l3j2 dfjl dfdlkj d 2.ppt
Theory of Change. AFH-FRDP OCEAN ToCpptx
microtomy kkk. presenting to cryst in gl
2- CELL INJURY L1 Medical (2) gggggggggg
AREAS OF SPECIALIZATION AND CAREER OPPORTUNITIES FOR COMMUNICATORS AND JOURNA...
servsafecomprehensive-ppt-full-140617222538-phpapp01.pdf
Overview Planner of Soft Skills in a single ppt
APNCET2025RESULT Result Result 2025 2025
cse couse aefrfrqewrbqwrgbqgvq2w3vqbvq23rbgw3rnw345
DPT-MAY24.pptx for review and ucploading
1751884730-Visual Basic -Unitj CS B.pptx
FINAL PPT.pptx cfyufuyfuyuy8ioyoiuvy ituyc utdfm v
mcsp232projectguidelinesjan2023 (1).docx
Principles of Inheritance and variation class 12.pptx
BCH3201 (Enzymes and biocatalysis)-JEB (1).ppt
Sports and Dance -lesson 3 powerpoint presentation

Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials