SlideShare a Scribd company logo
Accurate  biochemical knowledge  starting with precise  structure-based criteria for molecular identity Michel Dumontier , Ph.D. Assistant Professor of Bioinformatics Department of Biology, School of Computer Science Institute of Biochemistry, Ottawa Institute of Systems Biology Carleton University 01/04/2009 NCBO Seminar Series::Michel Dumontier
Problem Statement (I) Although biochemical events can be described with reference to specific chemical substances, we may want to describe them at finer/grainier levels of (mereological) granularity. residue  :  post translational modification collection of residues  :  motif/domain/interaction site atom  :  atomic interactions, catalytic mechanism collection of atoms  :  binding/catalytic site, interaction This requires  identifiers  for parts, regions (contiguous and non-contiguous), aggregates/complexes. However, we do not (AFAIK) have a precise ( reproducible ) methodology to automatically generate these! 01/04/2009 NCBO Seminar Series::Michel Dumontier
Bio2RDF: 2.3B triples of SPARQL-accessible linked biological data! Chemical Parts!
Case Study: HIF1 α Hypoxia-Inducible Factor 1, alpha chain (uniprot:Q16665) Master transcriptional regulator of the adaptive response to hypoxia Under normoxic conditions , HIF1 α  is hydroxylated on Pro-402  and Pro-564 in the oxygen-dependent degradation domain (ODD) by EGLN1/PHD1 and EGLN2/PHD2. EGLN3/PHD3 has also been shown to hydroxylate Pro-564. The hydroxylated prolines promote interaction with VHL, initiating rapid ubiquitination and subsequent proteasomal degradation.  Context Dependent Behavior Normoxic Conditions Hypoxic Conditions Multiple hydroxylations Part of a domain The part is the agent in the process Selective interaction with parts 01/04/2009 NCBO Seminar Series::Michel Dumontier
Are these the same? HIF1 α  –  au naturel HIF1 α hydroxylated @P402 HIF1 α hydroxylated @P564 HIF1 α hydroxylated @P402 & @P564 HIF1 α hydroxylated @P402 & (@P564) ubiquitinated @Lys-532 HIF1 α L400A & L397A 01/04/2009 NCBO Seminar Series::Michel Dumontier
NO!!!! These are structurally different Each exhibits distinct functionality! Yet most databases ( Uniprot / Genbank ) don’t have separate identifiers for them Reactome  has an internal identifier for referring to different forms, but links to Uniprot entries and doesn’t provide an explicit description of the structure that it corresponds to! 01/04/2009 NCBO Seminar Series::Michel Dumontier
So We have a clear  need  for being able to refer to  distinct  biochemical entities, based at least on their structure. We also need to refer to arbitrary structural parts. Should we generate  all  the combinations a priori???     NO!!  Should we be able to  automatically generate  the identifier from the structural attributes? ->  YES!!!  Should we semantically annotate (manually or otherwise) those forms known to be involved in specific processes???  ->  YES!!! What identifiers are unique for a given structure? 01/04/2009 NCBO Seminar Series::Michel Dumontier
InChI IUPAC International Chemical Identifier  (InChI) A data string that provides the structure of a chemical compound  the convention for drawing the structure Different compounds  must  have different identifiers. Several attributes can be used to distinguish one compound from another.  chemical graph (connection table)  Formula Atom type (only some atoms explicit) Bond type Stereochemistry Mobile/fixed H-bonds (tautomers) Isotopic composition Atomic charge 01/04/2009 NCBO Seminar Series::Michel Dumontier
(S)-Glutamic Acid InChI= {version}1 /{formula}C5H9NO4 /c{connections}6-3(5(9)10)1-2-4(7)8 /h{H_atoms}3H,1-2,6H2,(H,7,8)(H,9,10) /p{protons}+1 /t{stereo:sp3}3- /m{stereo:sp3:inverted}0 /s{stereo:type (1=abs, 2=rel, 3=rac)}1 /i{isotopic:atoms}4+1 01/04/2009 NCBO Seminar Series::Michel Dumontier
More non-core info captured in  “AuxInfo” string... AuxInfo= {version}1 /{normalization_type}1 /N:{original_atom_numbers}5,6,2,7,1,4,8,9,10,11 /E:{atom_equivalence}(7,8)(9,10) /it:{abs_stereo_inverted:sp3}im /I:{isotopic:original_atom_numbers} /E:{isotopic:atom_equivalence}m /rA:{reversibility:atoms}11nCCHN+CCC.i13OOOO /rB:{reversibility:bonds}s1;N2;P2;s2;s5;s6;s7;d7;d1;s1; /rC:{reversibility:xyz}6.1671,-19.3365,0;7.0125,-18.4864,0;6.4113,-17.4485,0;7.6089,-17.4485,0;7.8578,-19.3318,0;8.891,-18.7306,0;9.7363,-19.576,0;9.7316,-20.7735,0;10.8916,-19.266,0;5.0071,-19.0265,0;6.1624,-20.534,0; AuxInfo=1/1/N:5,6,2,7,1,4,8,9,10,11/E:(7,8)(9,10)/it:im/I:/E:m/rA:11nCCHN+CCC.i13OOOO/rB:s1;N2;P2;s2;s5;s6;s7;d7;d1;s1;/rC:6.1671,-19.3365,0;7.0125,-18.4864,0;6.4113,-17.4485,0;7.6089,-17.4485,0;7.8578,-19.3318,0;8.891,-18.7306,0;9.7363,-19.576,0;9.7316,-20.7735,0;10.8916,-19.266,0;5.0071,-19.0265,0;6.1624,-20.534,0; 01/04/2009 NCBO Seminar Series::Michel Dumontier
So... InChi a really just a cryptic  data  identifier Clever  software required to gradually build the chemical identifiers in a series of well-defined steps – normalization, canonicalization then serialization Humans can’t (easily) generate them nor can they easily understand them. But that’s OK. It’s not (user) extensible. But that’s OK. 01/04/2009 NCBO Seminar Series::Michel Dumontier
Possible... but a 1000 residue protein would contain ~15,000 atoms on average....  OpenBabel seemed to struggle with anything over 100 residues  Maybe needs some performance tweaking? Size of the string will be enormous We can use InChiKeys (SHA1 hash), but then we need to provide a  you-submit-InChI ,  we-store-both and they-look-it-up service. Modularize InChI construction for (linear) polymers? Make InChi strings for each residue, and concatenate – rename the atoms according to the residue position We still need to translate the InChi string ... InCHI for Proteins??? 01/04/2009 NCBO Seminar Series::Michel Dumontier
OpenBabel CML SDF O1[C@@H]([C@@H](O)([C@H](O)([C@@H](O)([C@@H]1(O)))))(CO) 79025 IUPAC InChI=1/C6H12O6/c7-1-2-3(8)4(9)5(10)6(11)12-2/h2-11H,1H2/t2-,3-,4+,5-,6+/m1/s1 InCHI α -D-Glucose 6-(hydroxymethyl)oxane-2,3,4,5-tetrol OR (2R,3R,4S,5R,6R)-6 -(hydroxymethyl)tetrahydro -2H-pyran-2,3,4,5-tetraol  SMILES
OWL Has Explicit Semantics Can therefore be used to capture knowledge in a machine understandable way 01/04/2009 NCBO Seminar Series::Michel Dumontier
Chemical Ontology Chemical Knowledge for the Semantic Web. Mykola Konyk ,  Alexander De Leon , and  Michel Dumontier .  LNBI . 2008. 5109:169-176.  Data Integration in the Life Sciences (DILS2008) . Evry. France. 
http://code.google.com/p/semanticwebopenbabel/ 01/04/2009 NCBO Seminar Series::Michel Dumontier
Describing chemical functional groups in OWL-DL for the classification of chemical compounds hydroxyl group methyl group Knowledge of functional groups is important in chemical synthesis, pharmaceutical design and lead optimization. Functional groups describe chemical reactivity in terms of atoms and their connectivity, and exhibits characteristic chemical behavior when present in a compound.  N Villanueva-Rosales, MDumontier. 2007. OWLED, Innsbruck, Austria. Ethanol 01/04/2009 NCBO Seminar Series::Michel Dumontier
Describing Functional Groups in DL HydroxylGroup:  CarbonGroup that (hasSingleBondWith some (OxygenAtom that hasSingleBondWith some HydrogenAtom)  O H R R group 01/04/2009 NCBO Seminar Series::Michel Dumontier
Fully Classified Ontology 35 FG 01/04/2009 NCBO Seminar Series::Michel Dumontier
And, we define certain compounds Alcohol:  OrganicCompound that (hasPart some HydroxylGroup)  01/04/2009 NCBO Seminar Series::Michel Dumontier
Organic Compound Ontology 28 OC 01/04/2009 NCBO Seminar Series::Michel Dumontier
Question Answering Query all annotations Query PubChem, DrugBank and dbPedia* * Requires import of relevant URIs 01/04/2009 NCBO Seminar Series::Michel Dumontier
But... Molecules represented as  individuals  because OWL-DL only allows tree-like class descriptions No variable binding (e.g. ?x) ... no cyclic molecule/functional group descriptions at the class level   Boris Motik et al has a proposal for  Description Graphs ,  Robert Stevens & Duncan Hull trying it out for chemical representation.... 01/04/2009 NCBO Seminar Series::Michel Dumontier
Identifiers for Atoms Atom identifiers can be consistently retrieved from the OpenBabel model. Canonical numbering  means we can reliably refer to a specific region rather than a (possibly degenerate) sub-graph match. In our plugin, URI component naming was based on the assigned molecule identifier e.g. pubchemid#aN, where N is the number Use InChiKey as base? e.g. InChiKey#aN 01/04/2009 NCBO Seminar Series::Michel Dumontier
What about identifiers for collection of atoms? Potentially useful in describing residues, PTMs, binding sites, etc.  Is the lack of connectivity sufficient? Contiguous:  ranges (aN-aN) enumerations (aN,aN,aN) Non-contiguous: Combination of ranges, enumerations? 01/04/2009 NCBO Seminar Series::Michel Dumontier
Can we reuse our positional nomenclature for residues? Residues are generally referred to by their absolute position in the biopolymer sequence. e.g. Pro @ X on Protein Y InChiKey#a50-a65 owl:sameAs InChiKey#r5 InChiKey#r5_a1-r5_a15 owl:sameAs InChiKey#r5 Collection of Residues might follow the same rules as a Collection of Atoms. Useful for defining domains, motifs, etc 01/04/2009 NCBO Seminar Series::Michel Dumontier
We already have a  simplified  representation for biopolymers...  Canonical sequence is represented by a string of single letter characters DNA: ACGT RNA: ACGU Proteins: 20 amino acids (not B,J,O,U,X,Z) Modifications can be referred to with ChEBI/PSI-MOD ontology  (e.g. Prolyl hydroxylated residue @ 402) Each (modified) residue must have its InChi description so as to capture explicit structural deviations (de-protonation, etc)  An Alternative Scheme 01/04/2009 NCBO Seminar Series::Michel Dumontier
PSI-MOD contains modified residues with links to structural descriptions 01/04/2009 NCBO Seminar Series::Michel Dumontier
But what if we have a modification that isn’t contained in the ontology! No problem... define your own term, with the corresponding structural description (InChi, SMILES), and add to an ontology document... If you’re using OWL, you can add the import statement and publish it. And, of course, you should submit it to the appropriate ontology development teams. (and later make it equivalent to) 01/04/2009 NCBO Seminar Series::Michel Dumontier
While we’re at it, we could extend our expressive capability to match that of OWL: Specification  Exactly  mod1@pos X Only  mod1@posX Minimum  :  At least  [email_address] Combination: mod1@posX  AND  mod2@posY, X != Y Possibilities/Uncertainty:  (mod1  OR  mod2) @posX  Exclusion : not  mod1 @ posX 01/04/2009 NCBO Seminar Series::Michel Dumontier
So what if... we describe the structural features of the molecule with OWL (sequence + PTMs), and generate an identifier from one of its serializations (RDF/XML?) that way we have the explicit description as the identifier in a form that is compatible with the semantic web. 01/04/2009 NCBO Seminar Series::Michel Dumontier
01/04/2009 NCBO Seminar Series::Michel Dumontier
Uniprot example revisited Under normoxic conditions , HIF1 α  is hydroxylated on Pro-402  and Pro-564 in the oxygen-dependent degradation domain (ODD) by EGLN1/PHD1 and EGLN2/PHD2. The hydroxylated prolines promote interaction with VHL, initiating rapid ubiquitination and subsequent proteasomal degradation  .  :A rdfs:subClassOf :Hydroxylation :A hasParticipant (:0#r402 and :Substrate) :A hasParticipant (:1#r402 and :Product) :A hasParticipant (:5 and :Enzyme) :B rdfs:subClassOf :Interaction :B :hasParticipant (:2#r402 or :3#r564 or :4#r402,r564) :B :hasParticipant (:6) :1 (HIF1 α ) :2 (HIF1 α  + P402hyd) :3 (HIF1 α  + P564hyd) :4 (HIF1 α  + P402hyd + P564hyd) :5 (EGLN1) :6 (VHL) Please ignore the made up short-hand syntax! 01/04/2009 NCBO Seminar Series::Michel Dumontier
Infering Protein Participation  OWL Role Chain hasParticipant o isPartOf -> hasParticipant if process has the part as a participant, then the whole is also a participant :0#r402 :isPartOf :0 :1#r402 :isPartOf :1 :A rdfs:subClassOf :Hydroxylation :A hasParticipant (:0#r402 and :Substrate) :A hasParticipant (:1#r402 and :Product) :A hasParticipant :0 :A hasParticipant :1 01/04/2009 NCBO Seminar Series::Michel Dumontier
Contextual, but non-structural considerations in identifier generation? Chemical? pH? Temperature? Environment ( in vitro, in vivo, in silico )? Biological? Species? mRNA/Gene from which it was transcribed/encoded? Indirect Relationships? Point & Multiple Mutations? Alternative Splice Variants? Sequence Similarity? 01/04/2009 NCBO Seminar Series::Michel Dumontier
Summary We  need  a precise method to generate identifiers for biopolymers and arbitrary sets of their parts. Consistent  identifier generation will allow anybody to specify findings according to the biopolymers for which it was observed, whether it exists in a database or not, and will allow us to link biochemical knowledge at finer levels of granularity. (at least) two identifier schemes were put forward to initiate discussion, with the goal of setting a standard naming convention. 01/04/2009 NCBO Seminar Series::Michel Dumontier
dumontierlab.com [email_address] Special thanks to PhD Student Leonid Chepelev for insightful discussions   semanticscience.org 01/04/2009 NCBO Seminar Series::Michel Dumontier

More Related Content

PDF
IPCC2010-2
PPTX
RISI Latin American Conference, Eden Roc Hotel, Miami, Nov. 20, 2013
PDF
Graphic Design Portfolio
PPTX
Monaco201209
KEY
IT for Nursing @ RIC - 5
PPT
Camp It, June 2012, How To Design Your Bi Architecture To Capitalize on New T...
PDF
Design thinking in efl context
IPCC2010-2
RISI Latin American Conference, Eden Roc Hotel, Miami, Nov. 20, 2013
Graphic Design Portfolio
Monaco201209
IT for Nursing @ RIC - 5
Camp It, June 2012, How To Design Your Bi Architecture To Capitalize on New T...
Design thinking in efl context

Viewers also liked (18)

PPSX
Business Teaming
PPT
BigScrum - Scaling Teams to Programs
PPT
Managing Diversity: Using the CLAS Standards to guide organizational change
PDF
Arai presentation
PPTX
William Kosar What Every Budget Officer Should Know_Rwanda
PPT
The Defense Industry In Western Central Pennsylvania
PDF
Financiranje malih in srednjih podjetij
PPT
Kenenisa
PDF
Gen X and Y at work [AMI Conf / Sydney / Sep 08]
PDF
Fig geometricz
PDF
Tennessee Ballot
PPT
Email etiquette
PDF
Nastas Presentation, Int'l Partners & Network Creation
PPT
Detskaya Rabota2
PPT
Technical Communication Lab Projects
PDF
Better Search With Structured Knowledge
PDF
Chi next gen-ntino-krampis
Business Teaming
BigScrum - Scaling Teams to Programs
Managing Diversity: Using the CLAS Standards to guide organizational change
Arai presentation
William Kosar What Every Budget Officer Should Know_Rwanda
The Defense Industry In Western Central Pennsylvania
Financiranje malih in srednjih podjetij
Kenenisa
Gen X and Y at work [AMI Conf / Sydney / Sep 08]
Fig geometricz
Tennessee Ballot
Email etiquette
Nastas Presentation, Int'l Partners & Network Creation
Detskaya Rabota2
Technical Communication Lab Projects
Better Search With Structured Knowledge
Chi next gen-ntino-krampis
Ad

Similar to Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity (20)

PPT
20090511 Manchester Biochemistry
PPT
Increasingly Accurate Representation of Biochemistry (v2)
PPT
Chemicals, Chemical Identifiers and Navigating Through Databases
PDF
InChI for Large Molecules
PPTX
02 basic chemistry in biology
PPT
5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk
PPTX
How can the international chemical identifier (InChI) be extended to non triv...
PPTX
How can the international chemical identifier (InChI) be extended to non …
PPTX
Functional groups
PPT
ICCS9 2011 Talk
PPT
Data integration and building a profile for yourself as an online scientist
PPT
Hosting public domain chemicals data online for the community – the challenge...
PPTX
Representing Chemicals Digitally: An overview of Cheminformatics
PPT
Accessing small molecule data using ChEBI
PDF
Basics of Organic Chemistry, Biochemistry
PPT
ACS Salt Lake City 2009 CINF Talk (InChI Symposium)
PDF
ORGANIC CHEMISTRY INTRODUCTION
PDF
images_organic_1st_year__2014.pdf
PPT
Going a mile InChI by InChI : Enabling online chemistry at ChemSpider
PDF
CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemica...
20090511 Manchester Biochemistry
Increasingly Accurate Representation of Biochemistry (v2)
Chemicals, Chemical Identifiers and Navigating Through Databases
InChI for Large Molecules
02 basic chemistry in biology
5th Meeting on U.S. Government Chemical Databases and Open Chemistry Talk
How can the international chemical identifier (InChI) be extended to non triv...
How can the international chemical identifier (InChI) be extended to non …
Functional groups
ICCS9 2011 Talk
Data integration and building a profile for yourself as an online scientist
Hosting public domain chemicals data online for the community – the challenge...
Representing Chemicals Digitally: An overview of Cheminformatics
Accessing small molecule data using ChEBI
Basics of Organic Chemistry, Biochemistry
ACS Salt Lake City 2009 CINF Talk (InChI Symposium)
ORGANIC CHEMISTRY INTRODUCTION
images_organic_1st_year__2014.pdf
Going a mile InChI by InChI : Enabling online chemistry at ChemSpider
CINF 1: Generating Canonical Identifiers For (Glycoproteins And Other Chemica...
Ad

More from Michel Dumontier (20)

PPTX
Generating (useful) synthetic data for medical research and AI application
PDF
FAIR & AI Ready KGs for Explainable Predictions.pdf
PPTX
FAIR & AI Ready KGs for Explainable Predictions
PPTX
A metadata standard for Knowledge Graphs
PPTX
Data-Driven Discovery Science with FAIR Knowledge Graphs
PDF
Evaluating FAIRness
PPTX
The Role of the FAIR Guiding Principles for an effective Learning Health System
PPTX
CIKM2020 Keynote: Accelerating discovery science with an Internet of FAIR dat...
PPTX
The role of the FAIR Guiding Principles in a Learning Health System
PPTX
Acclerating biomedical discovery with an internet of FAIR data and services -...
PPTX
Accelerating Biomedical Research with the Emerging Internet of FAIR Data and ...
PPTX
Are we FAIR yet? And will it be worth it?
PPTX
The Future of FAIR Data: An international social, legal and technological inf...
PDF
Keynote at the 2018 Maastricht University Dinner
PPTX
The future of science and business - a UM Star Lecture
PPTX
Are we FAIR yet?
PPTX
Developing and assessing FAIR digital resources
PPTX
Advancing Biomedical Knowledge Reuse with FAIR
PPTX
A Framework to develop the FAIR Metrics
PPTX
FAIR principles and metrics for evaluation
Generating (useful) synthetic data for medical research and AI application
FAIR & AI Ready KGs for Explainable Predictions.pdf
FAIR & AI Ready KGs for Explainable Predictions
A metadata standard for Knowledge Graphs
Data-Driven Discovery Science with FAIR Knowledge Graphs
Evaluating FAIRness
The Role of the FAIR Guiding Principles for an effective Learning Health System
CIKM2020 Keynote: Accelerating discovery science with an Internet of FAIR dat...
The role of the FAIR Guiding Principles in a Learning Health System
Acclerating biomedical discovery with an internet of FAIR data and services -...
Accelerating Biomedical Research with the Emerging Internet of FAIR Data and ...
Are we FAIR yet? And will it be worth it?
The Future of FAIR Data: An international social, legal and technological inf...
Keynote at the 2018 Maastricht University Dinner
The future of science and business - a UM Star Lecture
Are we FAIR yet?
Developing and assessing FAIR digital resources
Advancing Biomedical Knowledge Reuse with FAIR
A Framework to develop the FAIR Metrics
FAIR principles and metrics for evaluation

Recently uploaded (20)

PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Machine learning based COVID-19 study performance prediction
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Empathic Computing: Creating Shared Understanding
PDF
Getting Started with Data Integration: FME Form 101
PPTX
A Presentation on Artificial Intelligence
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
1. Introduction to Computer Programming.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Tartificialntelligence_presentation.pptx
A comparative analysis of optical character recognition models for extracting...
Dropbox Q2 2025 Financial Results & Investor Presentation
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Machine learning based COVID-19 study performance prediction
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
SOPHOS-XG Firewall Administrator PPT.pptx
NewMind AI Weekly Chronicles - August'25-Week II
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Encapsulation_ Review paper, used for researhc scholars
Empathic Computing: Creating Shared Understanding
Getting Started with Data Integration: FME Form 101
A Presentation on Artificial Intelligence
Diabetes mellitus diagnosis method based random forest with bat algorithm
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
1. Introduction to Computer Programming.pptx
Assigned Numbers - 2025 - Bluetooth® Document
Spectral efficient network and resource selection model in 5G networks
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Tartificialntelligence_presentation.pptx

Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity

  • 1. Accurate biochemical knowledge starting with precise structure-based criteria for molecular identity Michel Dumontier , Ph.D. Assistant Professor of Bioinformatics Department of Biology, School of Computer Science Institute of Biochemistry, Ottawa Institute of Systems Biology Carleton University 01/04/2009 NCBO Seminar Series::Michel Dumontier
  • 2. Problem Statement (I) Although biochemical events can be described with reference to specific chemical substances, we may want to describe them at finer/grainier levels of (mereological) granularity. residue : post translational modification collection of residues : motif/domain/interaction site atom : atomic interactions, catalytic mechanism collection of atoms : binding/catalytic site, interaction This requires identifiers for parts, regions (contiguous and non-contiguous), aggregates/complexes. However, we do not (AFAIK) have a precise ( reproducible ) methodology to automatically generate these! 01/04/2009 NCBO Seminar Series::Michel Dumontier
  • 3. Bio2RDF: 2.3B triples of SPARQL-accessible linked biological data! Chemical Parts!
  • 4. Case Study: HIF1 α Hypoxia-Inducible Factor 1, alpha chain (uniprot:Q16665) Master transcriptional regulator of the adaptive response to hypoxia Under normoxic conditions , HIF1 α is hydroxylated on Pro-402 and Pro-564 in the oxygen-dependent degradation domain (ODD) by EGLN1/PHD1 and EGLN2/PHD2. EGLN3/PHD3 has also been shown to hydroxylate Pro-564. The hydroxylated prolines promote interaction with VHL, initiating rapid ubiquitination and subsequent proteasomal degradation. Context Dependent Behavior Normoxic Conditions Hypoxic Conditions Multiple hydroxylations Part of a domain The part is the agent in the process Selective interaction with parts 01/04/2009 NCBO Seminar Series::Michel Dumontier
  • 5. Are these the same? HIF1 α – au naturel HIF1 α hydroxylated @P402 HIF1 α hydroxylated @P564 HIF1 α hydroxylated @P402 & @P564 HIF1 α hydroxylated @P402 & (@P564) ubiquitinated @Lys-532 HIF1 α L400A & L397A 01/04/2009 NCBO Seminar Series::Michel Dumontier
  • 6. NO!!!! These are structurally different Each exhibits distinct functionality! Yet most databases ( Uniprot / Genbank ) don’t have separate identifiers for them Reactome has an internal identifier for referring to different forms, but links to Uniprot entries and doesn’t provide an explicit description of the structure that it corresponds to! 01/04/2009 NCBO Seminar Series::Michel Dumontier
  • 7. So We have a clear need for being able to refer to distinct biochemical entities, based at least on their structure. We also need to refer to arbitrary structural parts. Should we generate all the combinations a priori???  NO!! Should we be able to automatically generate the identifier from the structural attributes? -> YES!!! Should we semantically annotate (manually or otherwise) those forms known to be involved in specific processes??? -> YES!!! What identifiers are unique for a given structure? 01/04/2009 NCBO Seminar Series::Michel Dumontier
  • 8. InChI IUPAC International Chemical Identifier (InChI) A data string that provides the structure of a chemical compound the convention for drawing the structure Different compounds must have different identifiers. Several attributes can be used to distinguish one compound from another. chemical graph (connection table) Formula Atom type (only some atoms explicit) Bond type Stereochemistry Mobile/fixed H-bonds (tautomers) Isotopic composition Atomic charge 01/04/2009 NCBO Seminar Series::Michel Dumontier
  • 9. (S)-Glutamic Acid InChI= {version}1 /{formula}C5H9NO4 /c{connections}6-3(5(9)10)1-2-4(7)8 /h{H_atoms}3H,1-2,6H2,(H,7,8)(H,9,10) /p{protons}+1 /t{stereo:sp3}3- /m{stereo:sp3:inverted}0 /s{stereo:type (1=abs, 2=rel, 3=rac)}1 /i{isotopic:atoms}4+1 01/04/2009 NCBO Seminar Series::Michel Dumontier
  • 10. More non-core info captured in “AuxInfo” string... AuxInfo= {version}1 /{normalization_type}1 /N:{original_atom_numbers}5,6,2,7,1,4,8,9,10,11 /E:{atom_equivalence}(7,8)(9,10) /it:{abs_stereo_inverted:sp3}im /I:{isotopic:original_atom_numbers} /E:{isotopic:atom_equivalence}m /rA:{reversibility:atoms}11nCCHN+CCC.i13OOOO /rB:{reversibility:bonds}s1;N2;P2;s2;s5;s6;s7;d7;d1;s1; /rC:{reversibility:xyz}6.1671,-19.3365,0;7.0125,-18.4864,0;6.4113,-17.4485,0;7.6089,-17.4485,0;7.8578,-19.3318,0;8.891,-18.7306,0;9.7363,-19.576,0;9.7316,-20.7735,0;10.8916,-19.266,0;5.0071,-19.0265,0;6.1624,-20.534,0; AuxInfo=1/1/N:5,6,2,7,1,4,8,9,10,11/E:(7,8)(9,10)/it:im/I:/E:m/rA:11nCCHN+CCC.i13OOOO/rB:s1;N2;P2;s2;s5;s6;s7;d7;d1;s1;/rC:6.1671,-19.3365,0;7.0125,-18.4864,0;6.4113,-17.4485,0;7.6089,-17.4485,0;7.8578,-19.3318,0;8.891,-18.7306,0;9.7363,-19.576,0;9.7316,-20.7735,0;10.8916,-19.266,0;5.0071,-19.0265,0;6.1624,-20.534,0; 01/04/2009 NCBO Seminar Series::Michel Dumontier
  • 11. So... InChi a really just a cryptic data identifier Clever software required to gradually build the chemical identifiers in a series of well-defined steps – normalization, canonicalization then serialization Humans can’t (easily) generate them nor can they easily understand them. But that’s OK. It’s not (user) extensible. But that’s OK. 01/04/2009 NCBO Seminar Series::Michel Dumontier
  • 12. Possible... but a 1000 residue protein would contain ~15,000 atoms on average.... OpenBabel seemed to struggle with anything over 100 residues Maybe needs some performance tweaking? Size of the string will be enormous We can use InChiKeys (SHA1 hash), but then we need to provide a you-submit-InChI , we-store-both and they-look-it-up service. Modularize InChI construction for (linear) polymers? Make InChi strings for each residue, and concatenate – rename the atoms according to the residue position We still need to translate the InChi string ... InCHI for Proteins??? 01/04/2009 NCBO Seminar Series::Michel Dumontier
  • 13. OpenBabel CML SDF O1[C@@H]([C@@H](O)([C@H](O)([C@@H](O)([C@@H]1(O)))))(CO) 79025 IUPAC InChI=1/C6H12O6/c7-1-2-3(8)4(9)5(10)6(11)12-2/h2-11H,1H2/t2-,3-,4+,5-,6+/m1/s1 InCHI α -D-Glucose 6-(hydroxymethyl)oxane-2,3,4,5-tetrol OR (2R,3R,4S,5R,6R)-6 -(hydroxymethyl)tetrahydro -2H-pyran-2,3,4,5-tetraol SMILES
  • 14. OWL Has Explicit Semantics Can therefore be used to capture knowledge in a machine understandable way 01/04/2009 NCBO Seminar Series::Michel Dumontier
  • 15. Chemical Ontology Chemical Knowledge for the Semantic Web. Mykola Konyk ,  Alexander De Leon , and  Michel Dumontier . LNBI . 2008. 5109:169-176.  Data Integration in the Life Sciences (DILS2008) . Evry. France. 
  • 17. Describing chemical functional groups in OWL-DL for the classification of chemical compounds hydroxyl group methyl group Knowledge of functional groups is important in chemical synthesis, pharmaceutical design and lead optimization. Functional groups describe chemical reactivity in terms of atoms and their connectivity, and exhibits characteristic chemical behavior when present in a compound. N Villanueva-Rosales, MDumontier. 2007. OWLED, Innsbruck, Austria. Ethanol 01/04/2009 NCBO Seminar Series::Michel Dumontier
  • 18. Describing Functional Groups in DL HydroxylGroup: CarbonGroup that (hasSingleBondWith some (OxygenAtom that hasSingleBondWith some HydrogenAtom) O H R R group 01/04/2009 NCBO Seminar Series::Michel Dumontier
  • 19. Fully Classified Ontology 35 FG 01/04/2009 NCBO Seminar Series::Michel Dumontier
  • 20. And, we define certain compounds Alcohol: OrganicCompound that (hasPart some HydroxylGroup) 01/04/2009 NCBO Seminar Series::Michel Dumontier
  • 21. Organic Compound Ontology 28 OC 01/04/2009 NCBO Seminar Series::Michel Dumontier
  • 22. Question Answering Query all annotations Query PubChem, DrugBank and dbPedia* * Requires import of relevant URIs 01/04/2009 NCBO Seminar Series::Michel Dumontier
  • 23. But... Molecules represented as individuals because OWL-DL only allows tree-like class descriptions No variable binding (e.g. ?x) ... no cyclic molecule/functional group descriptions at the class level  Boris Motik et al has a proposal for Description Graphs , Robert Stevens & Duncan Hull trying it out for chemical representation.... 01/04/2009 NCBO Seminar Series::Michel Dumontier
  • 24. Identifiers for Atoms Atom identifiers can be consistently retrieved from the OpenBabel model. Canonical numbering means we can reliably refer to a specific region rather than a (possibly degenerate) sub-graph match. In our plugin, URI component naming was based on the assigned molecule identifier e.g. pubchemid#aN, where N is the number Use InChiKey as base? e.g. InChiKey#aN 01/04/2009 NCBO Seminar Series::Michel Dumontier
  • 25. What about identifiers for collection of atoms? Potentially useful in describing residues, PTMs, binding sites, etc. Is the lack of connectivity sufficient? Contiguous: ranges (aN-aN) enumerations (aN,aN,aN) Non-contiguous: Combination of ranges, enumerations? 01/04/2009 NCBO Seminar Series::Michel Dumontier
  • 26. Can we reuse our positional nomenclature for residues? Residues are generally referred to by their absolute position in the biopolymer sequence. e.g. Pro @ X on Protein Y InChiKey#a50-a65 owl:sameAs InChiKey#r5 InChiKey#r5_a1-r5_a15 owl:sameAs InChiKey#r5 Collection of Residues might follow the same rules as a Collection of Atoms. Useful for defining domains, motifs, etc 01/04/2009 NCBO Seminar Series::Michel Dumontier
  • 27. We already have a simplified representation for biopolymers... Canonical sequence is represented by a string of single letter characters DNA: ACGT RNA: ACGU Proteins: 20 amino acids (not B,J,O,U,X,Z) Modifications can be referred to with ChEBI/PSI-MOD ontology (e.g. Prolyl hydroxylated residue @ 402) Each (modified) residue must have its InChi description so as to capture explicit structural deviations (de-protonation, etc) An Alternative Scheme 01/04/2009 NCBO Seminar Series::Michel Dumontier
  • 28. PSI-MOD contains modified residues with links to structural descriptions 01/04/2009 NCBO Seminar Series::Michel Dumontier
  • 29. But what if we have a modification that isn’t contained in the ontology! No problem... define your own term, with the corresponding structural description (InChi, SMILES), and add to an ontology document... If you’re using OWL, you can add the import statement and publish it. And, of course, you should submit it to the appropriate ontology development teams. (and later make it equivalent to) 01/04/2009 NCBO Seminar Series::Michel Dumontier
  • 30. While we’re at it, we could extend our expressive capability to match that of OWL: Specification Exactly mod1@pos X Only mod1@posX Minimum : At least [email_address] Combination: mod1@posX AND mod2@posY, X != Y Possibilities/Uncertainty: (mod1 OR mod2) @posX Exclusion : not mod1 @ posX 01/04/2009 NCBO Seminar Series::Michel Dumontier
  • 31. So what if... we describe the structural features of the molecule with OWL (sequence + PTMs), and generate an identifier from one of its serializations (RDF/XML?) that way we have the explicit description as the identifier in a form that is compatible with the semantic web. 01/04/2009 NCBO Seminar Series::Michel Dumontier
  • 32. 01/04/2009 NCBO Seminar Series::Michel Dumontier
  • 33. Uniprot example revisited Under normoxic conditions , HIF1 α is hydroxylated on Pro-402 and Pro-564 in the oxygen-dependent degradation domain (ODD) by EGLN1/PHD1 and EGLN2/PHD2. The hydroxylated prolines promote interaction with VHL, initiating rapid ubiquitination and subsequent proteasomal degradation . :A rdfs:subClassOf :Hydroxylation :A hasParticipant (:0#r402 and :Substrate) :A hasParticipant (:1#r402 and :Product) :A hasParticipant (:5 and :Enzyme) :B rdfs:subClassOf :Interaction :B :hasParticipant (:2#r402 or :3#r564 or :4#r402,r564) :B :hasParticipant (:6) :1 (HIF1 α ) :2 (HIF1 α + P402hyd) :3 (HIF1 α + P564hyd) :4 (HIF1 α + P402hyd + P564hyd) :5 (EGLN1) :6 (VHL) Please ignore the made up short-hand syntax! 01/04/2009 NCBO Seminar Series::Michel Dumontier
  • 34. Infering Protein Participation OWL Role Chain hasParticipant o isPartOf -> hasParticipant if process has the part as a participant, then the whole is also a participant :0#r402 :isPartOf :0 :1#r402 :isPartOf :1 :A rdfs:subClassOf :Hydroxylation :A hasParticipant (:0#r402 and :Substrate) :A hasParticipant (:1#r402 and :Product) :A hasParticipant :0 :A hasParticipant :1 01/04/2009 NCBO Seminar Series::Michel Dumontier
  • 35. Contextual, but non-structural considerations in identifier generation? Chemical? pH? Temperature? Environment ( in vitro, in vivo, in silico )? Biological? Species? mRNA/Gene from which it was transcribed/encoded? Indirect Relationships? Point & Multiple Mutations? Alternative Splice Variants? Sequence Similarity? 01/04/2009 NCBO Seminar Series::Michel Dumontier
  • 36. Summary We need a precise method to generate identifiers for biopolymers and arbitrary sets of their parts. Consistent identifier generation will allow anybody to specify findings according to the biopolymers for which it was observed, whether it exists in a database or not, and will allow us to link biochemical knowledge at finer levels of granularity. (at least) two identifier schemes were put forward to initiate discussion, with the goal of setting a standard naming convention. 01/04/2009 NCBO Seminar Series::Michel Dumontier
  • 37. dumontierlab.com [email_address] Special thanks to PhD Student Leonid Chepelev for insightful discussions  semanticscience.org 01/04/2009 NCBO Seminar Series::Michel Dumontier