SlideShare a Scribd company logo
Open PHACTS - Chemistry
Platform Update and learnings
Antony Williams and Valery Tkachenko
ORCID ID:0000-0002-2668-4821
@gray_alasdair Big Data Integration 2
OpenPHACTS and CRS Diagram
The Chemical Registration Service
Chemistry processing
•Validation
•Standardization
•Properties generation
•Properties retrieval
Export
•RDF
•SDF
API
•Domain-specific searches
•Chemical visualization
•Properties
•Conversions
OpenPHACTS - Chemistry Platform Update and Learnings
Subsystems
• “CVSP” (frontend, backend, database)
• Compounds (frontend, database)
• OpenPHACTS API (frontend, database)
• Datasources registry (frontend, database)
• Processing farm (optional)
Structure-Based Database linking
• Open PHACTS, and many other projects
requiring the linking of structure databases,
depend on mappings
• Different databases use different processes
for standardization prior at deposition
• Examples: PubChem, EBI databases,
ChemSpider, etc.
DrugBank
• ~60 records can’t be dearomatized unambiguously
• ~40 records where InChIs did not match structure
• 2 records where SMILES, InChI and name did not
match the structure
• 7 records with 2 stereo bonds at chiral atoms
DB04283 DB04462
Standardizers
• EBI Standardizer:
https://wwwdev.ebi.ac.uk/chembl/extra/francis/sta
/
• PubChem Standardizer: https://
pubchem.ncbi.nlm.nih.gov/standardize/standardi
• NCGC Standardizer: https://tripod.nih.gov/?
p=61
• The CVSP Standardizer work in Open
PHACTS http://cvsp.chemspider.com/
OpenPHACTS - Chemistry Platform Update and Learnings
Standardization Rules
• Available from: http://tinyurl.com/hwapem3
• Use the SRS as guidance for standardization
• Adjust as necessary to our needs
Nitro groups
Salt and Ionic Bonds
The CVSP System
http://cvsp.chemspider.com
Supports various file formats
Comptox Chemistry Dashboard
Prior to deposition check a deposition…
>3450 compounds in one SDF
98 Errors, 1571 Warnings
Review Errors
Validation Rule Set
Various Rules Sets Available
CVSP – My own custom rules
ChEMBL Validation Review
(of 1.3 million records)
• 11,020 records with 4 bonds and zero charge, e.g.
CHEMBL501101 or CHEMBL501973
• 271 records with hypervalent oxygen (e.g. ,
CHEMBL2219679), carbon (e.g. 1005895), boron,
chlorine, iodine or phosphine
• 6,177 records where direction of bond makes no
sense, e.g. CHEMBL12760 and CHEMBL34704
Chemical Validation first…
Standardization Second
• Chemical Validation detects errors –
Standardization FIXES them according to rules
• SMIRKS transformations are based on both
InChI Normalization and FDA SRS rules
Standardization SMIRKS
Examples of InChI normalization
[*;H+:1]>>[*;H:1]
[O,S,Se,Te:1]=[O+,S+,Se+,Te+:2][C-;v3:3]>>[O,S,Se,Te:1]=[O,S,Se,Te:2]=[C:3]
[N-,P-,As-,Sb-:1]=[C+;v3:2]>>[N,P,As,Sb:1]#[C:2]
Examples of FDA SRS rules
[n:1]=[O:2]>>[n+:1][O-:2]
[*:1]=[N:2]#[N:3]>>[*:1]=[N+:2]=[N-:3]
[N+0;H3:1].[C:3](=[O:4])[O:5][H:6]>>[N+1;H4:1].[C:3](=[O:4])[O-:5]
Thiopurine [H:1][S:2][c:3]1[n:8][c:7]([H,*:13])[n:6][c:5]2[c:4]1[n:11][c:10]
([H,*:12])[n:9]2>>[H:1][N:8]1[C:7]([H,*:13])=[N:6][C:5]2=[C:4]([N:11]=[C:10]
([H,*:12])[N:9]2)[C:3]1=[S:2]
Examples of Standardization
Double bond with adjacent wiggly single bond
Collapser hydrogen atoms with no stereo bonds
Examples of Standardization
Remove symmetric stereocenters
Turn off chiral flag if no up or down bonds
Defining a Community Rule Set
• There are multiple standardizers, each with
their own rules set
• Can we decide on a default community rules
set, like Standard InChI, that could be used
by ALL Standardizers?
• A joint meeting between the Research Data
Alliance (RDA), IUPAC and ACS Division of
Chemical Information discussed the value
and possibilities of this approach (July 2016)
EPA is investigating CVSP
• EPA is investigating CVSP as a validation
and standardization platform
• Considering the API aspects of CVSP to
integrate to our registration system
• CVSP is a reference implementation and
“starting point” for a community rules set
CVSP code is now Open Source
• Open Source CVSP code now released
• Code is hosted on Open PHACTS Github
https://github.com/openphacts/ops-crs
• Valery Tkachenko will offer future support
• Hoping for additional community engagement
and support
• Some details of availability….
Virtual Machines
• OPS_FRONT (all websites and API)
• OPS_BACK (all heavy-lifting)
• OPS_DB (databases)
• VMs are VMware images
• Can be converted to other hypervisors
Thank you
Emails: tony27587@gmail.com and tkachenko.valery@gmail.com
SLIDES: www.slideshare.net/AntonyWilliams

More Related Content

PDF
Automatic extraction of bioactivity data from patents
PDF
Challenges and successes in machine interpretation of Markush descriptions
PPTX
Opportunities in chemical structure standardization
PPTX
Implementing chemistry platform for OpenPHACTS
PPTX
ChemSpider compound database as one of the pillars of a semantic web for …
PPTX
The influence of data curation on QSAR Modeling – Presented at American Chemi...
PPTX
Linked-Data based Data Management for data.gov.sg
Automatic extraction of bioactivity data from patents
Challenges and successes in machine interpretation of Markush descriptions
Opportunities in chemical structure standardization
Implementing chemistry platform for OpenPHACTS
ChemSpider compound database as one of the pillars of a semantic web for …
The influence of data curation on QSAR Modeling – Presented at American Chemi...
Linked-Data based Data Management for data.gov.sg

Viewers also liked (18)

PPTX
Proposed Linked Data Migration Framework for Singapore Government Datasets
PPT
Experiences and adventures with no sql and its applications to cheminformatic...
PPTX
Verkko ja tyohyvinvointi, TTL 26052010
PDF
Wordpress: Make Your Site Impressively Beautiful
PPTX
Dercho civil
PDF
Testing and validating your idea
PPTX
Kevin's Portfolio, Graphics and Photos
PPTX
MAPA CONCEPTUAL SOCIOLOGIA (unidades IV Y V)
DOC
2016 laura's resume
PDF
Empleo y esclerosis múltiple.
PDF
Processes should serve creativity - Which processes help creatives to work be...
PPTX
Evolution of open chemical information
PPTX
Nociones de derecho civil mapa conceptual
PPTX
Serverless Logging with AWS Lambda and the Elastic Stack
PPTX
Érzelmek hálójában – hálózat- és tartalomelemzés
PPTX
Nociones del derecho civil mapa conceptual
PPTX
How to Decide: When to Use What In Office 365 - SharePoint Fest DC
PDF
Postcron: Automate and Plan Posts Ahead
Proposed Linked Data Migration Framework for Singapore Government Datasets
Experiences and adventures with no sql and its applications to cheminformatic...
Verkko ja tyohyvinvointi, TTL 26052010
Wordpress: Make Your Site Impressively Beautiful
Dercho civil
Testing and validating your idea
Kevin's Portfolio, Graphics and Photos
MAPA CONCEPTUAL SOCIOLOGIA (unidades IV Y V)
2016 laura's resume
Empleo y esclerosis múltiple.
Processes should serve creativity - Which processes help creatives to work be...
Evolution of open chemical information
Nociones de derecho civil mapa conceptual
Serverless Logging with AWS Lambda and the Elastic Stack
Érzelmek hálójában – hálózat- és tartalomelemzés
Nociones del derecho civil mapa conceptual
How to Decide: When to Use What In Office 365 - SharePoint Fest DC
Postcron: Automate and Plan Posts Ahead
Ad

Similar to OpenPHACTS - Chemistry Platform Update and Learnings (20)

PPTX
ChemValidator – an online service for validating and standardizing chemical s...
PPT
How the InChI identifier is used to underpin our online chemistry databases a...
PPT
How the InChI identifier is used to underpin our online chemistry databases a...
PPT
The RSC chemical validation and standardization platform, a potential path to...
PPTX
How to place your research questions or results into the context of the "Lega...
PPTX
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
PDF
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
PPTX
Structure Identification Using High Resolution Mass Spectrometry Data and the...
PPTX
US-EPA Cheminformatics Support for Delivering Data Related to Chemicals of E...
PPTX
Cheminformatics tools and chemistry data underpinning mass spectrometry analy...
PPT
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
PPT
Open innovation contributions from RSC resulting from the Open Phacts project
PPT
Open innovation contributions from RSC resulting from the Open Phacts project
PPT
Dealing with the complex challenge of managing diverse chemistry data online
PPT
Dealing with the complex challenge of managing diverse chemistry data online
PPTX
Acs 2013 indianapolis_cvsp
PDF
Automated workflows for data curation and standardization of chemical structu...
PPT
Experiences in Hosting Big Chemistry Data Collections for the Community
PPTX
Cheminformatics tools and chemistry data underpinning mass spectrometry analy...
PPT
The RSC chemical validation and standardization platform, a potential path to...
ChemValidator – an online service for validating and standardizing chemical s...
How the InChI identifier is used to underpin our online chemistry databases a...
How the InChI identifier is used to underpin our online chemistry databases a...
The RSC chemical validation and standardization platform, a potential path to...
How to place your research questions or results into the context of the "Lega...
Introduction to Cheminformatics: Accessing data through the CompTox Chemicals...
ICIC 2017: Tutorial - Digging bioactive chemistry out of patents using open r...
Structure Identification Using High Resolution Mass Spectrometry Data and the...
US-EPA Cheminformatics Support for Delivering Data Related to Chemicals of E...
Cheminformatics tools and chemistry data underpinning mass spectrometry analy...
Hosting Public Domain Chemicals Data Online for the Community – the Challenge...
Open innovation contributions from RSC resulting from the Open Phacts project
Open innovation contributions from RSC resulting from the Open Phacts project
Dealing with the complex challenge of managing diverse chemistry data online
Dealing with the complex challenge of managing diverse chemistry data online
Acs 2013 indianapolis_cvsp
Automated workflows for data curation and standardization of chemical structu...
Experiences in Hosting Big Chemistry Data Collections for the Community
Cheminformatics tools and chemistry data underpinning mass spectrometry analy...
The RSC chemical validation and standardization platform, a potential path to...
Ad

More from Valery Tkachenko (20)

PPTX
Evolution of public chemistry databases: past and the future
PPTX
In silico design of new functional materials
PPTX
Metal-organic frameworks: from database to supramolecular effects in complexa...
PPTX
Abstract recommendation system: beyond word-level representations
PPTX
Machine learning methods for chemical properties and toxicity based endpoints
PPTX
Chemical workflows supporting automated research data collection
PDF
Deep learning methods applied to physicochemical and toxicological endpoints
PDF
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
PDF
Using publicly available resources to build a comprehensive knowledgebase of ...
PPTX
Need and benefits for structure standardization to facilitate integration and...
PPTX
Development and comparison of deep learning toolkit with other machine learni...
PPTX
Living in a world of federated knowledge challenges, principles, tools and ...
PPTX
Open chemistry registry and mapping platform based on open source cheminforma...
PPTX
Using the structured product labeling format to index versatile chemical data
PPTX
Tools and approaches for data deposition into nanomaterial databases
PPTX
Chemistry Validation and Standardization Platform v2.0
PPTX
Open Science Data Repository - the platform for materials research
PPTX
OMPOL – visualisation of large chemical spaces
PPTX
Not just another reaction database
PPTX
Building linked data large-scale chemistry platform - challenges, lessons and...
Evolution of public chemistry databases: past and the future
In silico design of new functional materials
Metal-organic frameworks: from database to supramolecular effects in complexa...
Abstract recommendation system: beyond word-level representations
Machine learning methods for chemical properties and toxicity based endpoints
Chemical workflows supporting automated research data collection
Deep learning methods applied to physicochemical and toxicological endpoints
Deep Learning on nVidia GPUs for QSAR, QSPR and QNAR predictions
Using publicly available resources to build a comprehensive knowledgebase of ...
Need and benefits for structure standardization to facilitate integration and...
Development and comparison of deep learning toolkit with other machine learni...
Living in a world of federated knowledge challenges, principles, tools and ...
Open chemistry registry and mapping platform based on open source cheminforma...
Using the structured product labeling format to index versatile chemical data
Tools and approaches for data deposition into nanomaterial databases
Chemistry Validation and Standardization Platform v2.0
Open Science Data Repository - the platform for materials research
OMPOL – visualisation of large chemical spaces
Not just another reaction database
Building linked data large-scale chemistry platform - challenges, lessons and...

Recently uploaded (20)

PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PDF
Business Analytics and business intelligence.pdf
PPTX
Managing Community Partner Relationships
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PPTX
Introduction to Inferential Statistics.pptx
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PPTX
New ISO 27001_2022 standard and the changes
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PDF
Introduction to Data Science and Data Analysis
PPTX
IMPACT OF LANDSLIDE.....................
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Introduction to the R Programming Language
PPTX
A Complete Guide to Streamlining Business Processes
Pilar Kemerdekaan dan Identi Bangsa.pptx
Business Analytics and business intelligence.pdf
Managing Community Partner Relationships
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
IBA_Chapter_11_Slides_Final_Accessible.pptx
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
Introduction to Inferential Statistics.pptx
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
New ISO 27001_2022 standard and the changes
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Introduction to Data Science and Data Analysis
IMPACT OF LANDSLIDE.....................
Topic 5 Presentation 5 Lesson 5 Corporate Fin
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
ISS -ESG Data flows What is ESG and HowHow
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
Qualitative Qantitative and Mixed Methods.pptx
Introduction to the R Programming Language
A Complete Guide to Streamlining Business Processes

OpenPHACTS - Chemistry Platform Update and Learnings

  • 1. Open PHACTS - Chemistry Platform Update and learnings Antony Williams and Valery Tkachenko ORCID ID:0000-0002-2668-4821
  • 2. @gray_alasdair Big Data Integration 2 OpenPHACTS and CRS Diagram
  • 3. The Chemical Registration Service Chemistry processing •Validation •Standardization •Properties generation •Properties retrieval Export •RDF •SDF API •Domain-specific searches •Chemical visualization •Properties •Conversions
  • 5. Subsystems • “CVSP” (frontend, backend, database) • Compounds (frontend, database) • OpenPHACTS API (frontend, database) • Datasources registry (frontend, database) • Processing farm (optional)
  • 6. Structure-Based Database linking • Open PHACTS, and many other projects requiring the linking of structure databases, depend on mappings • Different databases use different processes for standardization prior at deposition • Examples: PubChem, EBI databases, ChemSpider, etc.
  • 7. DrugBank • ~60 records can’t be dearomatized unambiguously • ~40 records where InChIs did not match structure • 2 records where SMILES, InChI and name did not match the structure • 7 records with 2 stereo bonds at chiral atoms DB04283 DB04462
  • 8. Standardizers • EBI Standardizer: https://wwwdev.ebi.ac.uk/chembl/extra/francis/sta / • PubChem Standardizer: https:// pubchem.ncbi.nlm.nih.gov/standardize/standardi • NCGC Standardizer: https://tripod.nih.gov/? p=61 • The CVSP Standardizer work in Open PHACTS http://cvsp.chemspider.com/
  • 10. Standardization Rules • Available from: http://tinyurl.com/hwapem3 • Use the SRS as guidance for standardization • Adjust as necessary to our needs
  • 12. Salt and Ionic Bonds
  • 15. Comptox Chemistry Dashboard Prior to deposition check a deposition…
  • 17. 98 Errors, 1571 Warnings
  • 20. Various Rules Sets Available
  • 21. CVSP – My own custom rules
  • 22. ChEMBL Validation Review (of 1.3 million records) • 11,020 records with 4 bonds and zero charge, e.g. CHEMBL501101 or CHEMBL501973 • 271 records with hypervalent oxygen (e.g. , CHEMBL2219679), carbon (e.g. 1005895), boron, chlorine, iodine or phosphine • 6,177 records where direction of bond makes no sense, e.g. CHEMBL12760 and CHEMBL34704
  • 23. Chemical Validation first… Standardization Second • Chemical Validation detects errors – Standardization FIXES them according to rules • SMIRKS transformations are based on both InChI Normalization and FDA SRS rules
  • 24. Standardization SMIRKS Examples of InChI normalization [*;H+:1]>>[*;H:1] [O,S,Se,Te:1]=[O+,S+,Se+,Te+:2][C-;v3:3]>>[O,S,Se,Te:1]=[O,S,Se,Te:2]=[C:3] [N-,P-,As-,Sb-:1]=[C+;v3:2]>>[N,P,As,Sb:1]#[C:2] Examples of FDA SRS rules [n:1]=[O:2]>>[n+:1][O-:2] [*:1]=[N:2]#[N:3]>>[*:1]=[N+:2]=[N-:3] [N+0;H3:1].[C:3](=[O:4])[O:5][H:6]>>[N+1;H4:1].[C:3](=[O:4])[O-:5] Thiopurine [H:1][S:2][c:3]1[n:8][c:7]([H,*:13])[n:6][c:5]2[c:4]1[n:11][c:10] ([H,*:12])[n:9]2>>[H:1][N:8]1[C:7]([H,*:13])=[N:6][C:5]2=[C:4]([N:11]=[C:10] ([H,*:12])[N:9]2)[C:3]1=[S:2]
  • 25. Examples of Standardization Double bond with adjacent wiggly single bond Collapser hydrogen atoms with no stereo bonds
  • 26. Examples of Standardization Remove symmetric stereocenters Turn off chiral flag if no up or down bonds
  • 27. Defining a Community Rule Set • There are multiple standardizers, each with their own rules set • Can we decide on a default community rules set, like Standard InChI, that could be used by ALL Standardizers? • A joint meeting between the Research Data Alliance (RDA), IUPAC and ACS Division of Chemical Information discussed the value and possibilities of this approach (July 2016)
  • 28. EPA is investigating CVSP • EPA is investigating CVSP as a validation and standardization platform • Considering the API aspects of CVSP to integrate to our registration system • CVSP is a reference implementation and “starting point” for a community rules set
  • 29. CVSP code is now Open Source • Open Source CVSP code now released • Code is hosted on Open PHACTS Github https://github.com/openphacts/ops-crs • Valery Tkachenko will offer future support • Hoping for additional community engagement and support • Some details of availability….
  • 30. Virtual Machines • OPS_FRONT (all websites and API) • OPS_BACK (all heavy-lifting) • OPS_DB (databases) • VMs are VMware images • Can be converted to other hypervisors
  • 31. Thank you Emails: tony27587@gmail.com and tkachenko.valery@gmail.com SLIDES: www.slideshare.net/AntonyWilliams

Editor's Notes

  • #3: Open PHACTS was developed to support the key questions of drug discovery Business questions have been at the heart of Open PHACTS and have driven the development of the platform Mx/psa, how calculated who did it? Mash up. With your data too, - top layer join together but need them all commercial Data provided by many publishers Originally in many formats: relational, SD files and RDF Worked closely with publishers Data licensing was a major issue Over 5 billion triples – 14 datasets & growing Hosted on beefy hardware; data in memory (aim) Extensive memcaching Pose complex queries to extract data