SlideShare a Scribd company logo
How am I supposed to organize a
 protein database when I can't
even organize my address book?
                         Jeremy Yang
                          UNM & IU



    CINF Flash session - ACS National Meeting, March 25, 2012 – San Diego, CA
Alternate title
 (and take home message):

Cheminformatics is so great!

  But is it too good to be
   (transferably) true?

                               2 / 17
How great is cheminformatics?

Example: Are these the same or different molecules?




                                                      3 / 17
How great is cheminformatics?

Example: Are these the same or different molecules?




 Answer: Same, that’s easy, just use canonical graph algorithm
 via canonical SMILES:
     CNC1C(O)C(O)C(CO)OC1OC2C(OC(C)C2(O)C=O)OC7C(O)C(O)C(NC(=N)NCNC(=O)C4=C(O)C(C3CC6C(=C(O)C3(O)C4=O)C(=O)c5c(O)cccc5C6(C)O)N(C)C)C(O)C7NC(N)=N

                                  (TETRACYCLINOMETHYLSTREPTOMYCIN)
                                                                                                                                                   4 / 17
Thanks to…




?                ?

                     5 / 17
Thanks to…




Harry Morgan        Dave Weininger
Actor, “MASH”          Daylight
 (Hmmm…?)              (SMILES)
                                         / 17
                                     6
Thanks to…




  Harry Morgan                         Dave Weininger
     ACS CAS                              Daylight
(Morgan Algorithm)                        (SMILES)
                     Et al., et al….                    7 / 17
Now about those proteins…
  • Example: Are these the same or different proteins?

1YIN:
ALSLTADQMVSALLDAEPPILYSEYDPTRPFSEASMMGLLTNLADRELVHMINWAKRVPGFVDLTLHDQVHLLECAWLEI
LMIGLVWRSMEHPGKLLFAPNLLLDRNQGKCVEGMVEIFDMLLATSSRFRMMNLQGEEFVCLKSIILLNSGVYTFLSSTL
KSLEEKDHIHRVLDKITDTLIHLMAKAGLTLQQQHQRLAQLLLILSHIRHMSNKGMEHLYSMKCKNVVPLYDLLLEMLDA
HRLHAPTS




3OS8:
SNAKRSKKNSLALSLTADQMVSALLDAEPPILYSEYDPTRPFSEASMMGLLTNLADRELVHMINWAKRVPGFVDLTRHDQ
VHLLECAWLEILMIGLVWRSMEHPGKLLFAPNLLLDRNQGKCVEGMVEIFDMLLATSSRFRMMNLQGEEFVCLKSIILLN
SGVYTFLSSTLKSLEEKDHIHRVLDKITDTLIHLMAKAGLTLQQQHQRLAQLLLILSHIRHMSNKGMEHLYSMKCKNVVP
SYDLLLEMLDAHRLHAPT




  PAM250 alignment score:
  (gap: -3; extend: -10)

         1156 (1156/1260 = 92%)
                                                                             Ergo, um… Maybe.
                                                                                                8 / 17
Now about those proteins…
• Example: Are these the same or different proteins?




Answer: Same… but what does that even mean?



                                                       9 / 17
Why protein identification is hard
• Proteins are large, complex, dynamic
• PDB is database of crystallography
  experiments, not molecules
• Ligands, co-crystals, waters
• Protein crystallography & NMR is hard
• History, culture…



                                          10 / 17
How about human identification?
    (Should be easier, may shed light…)




                                          11 / 17
Human identification hard too,
        apparently…
                               Credit card fraud
Homeland security




                     http://forms.cybersource.com/forms/NAFRDQ12012whi
                                                               12 / 17
                     tepaperFraudReport2012CYBSwww2012
(Which brings us to…)
My address book problems




       How many Rob Yangs?




                             13 / 17
(Philosophical tangent:)
Are human entities actually identifiable?




   One Harry Morgan or two?
      How can we know?




                                              14 / 17
(Philosophical tangent:)
Are human entities actually identifiable?



                                     Individuality may be contextual.
   One Harry Morgan or two?
      How can we know?




                                                                        15 / 17
Could I organize my address book
    using cheminformatics?

   What would the algorithm look like?




                                         16 / 17
Conclusions
• “CINF” (cheminformatics) is awesome.
• But some CINF-awesomeness is not readily
  transferable to other domains.
• Cannot automate logic if not logical (How
  many Harry Morgans?).
• Perhaps CINF-awesomeness can be used as an
  indexing approach for chem-related domains.


              “Chester”
                                            17 / 17

More Related Content

PPTX
Protein database ..... of NCBI
PPTX
Overview of cheminformatics
PDF
X-omics Data Integration Challenges
PPTX
Cheminformatics
PPTX
Cheminformatics
PPTX
Bioinformatics- An overwiew..................
PPTX
Searching data with substance and style
Protein database ..... of NCBI
Overview of cheminformatics
X-omics Data Integration Challenges
Cheminformatics
Cheminformatics
Bioinformatics- An overwiew..................
Searching data with substance and style

Similar to How am I supposed to organize a protein database when I can't even organize my address book? (20)

PDF
The Impact of Information Technology on Chemistry and Related Sciences
PPTX
Bio inspiring computing and its application in cheminformatics
PPTX
Data Mining Dissertations and Adventures and Experiences in the World of Chem...
PPTX
chemoinformatics ppt 2.pptx
PPTX
Libraries and Linked Data: Looking to the Future (2)
PPTX
Representing Chemicals Digitally: An overview of Cheminformatics
PPT
An Introduction to Chemoinformatics for the postgraduate students of Agriculture
PDF
Ontologies introduction - ecoOnto meeting
PDF
Integrating Public and Private Data: Lessons Learned from Unison
PDF
Cresset: 25 year of Fields
PPT
There is No Intelligent Life Down Here
PPT
Oct 2011 ualr
PDF
1 -val_gillet_-_ligand-based_and_structure-based_virtual_screening
PPTX
NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BC...
PPTX
Chemoinformatic File Format.pptx
PPT
Cheminformatics: An overview
PDF
RDF, RDA, and other TLAs
PPT
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
The Impact of Information Technology on Chemistry and Related Sciences
Bio inspiring computing and its application in cheminformatics
Data Mining Dissertations and Adventures and Experiences in the World of Chem...
chemoinformatics ppt 2.pptx
Libraries and Linked Data: Looking to the Future (2)
Representing Chemicals Digitally: An overview of Cheminformatics
An Introduction to Chemoinformatics for the postgraduate students of Agriculture
Ontologies introduction - ecoOnto meeting
Integrating Public and Private Data: Lessons Learned from Unison
Cresset: 25 year of Fields
There is No Intelligent Life Down Here
Oct 2011 ualr
1 -val_gillet_-_ligand-based_and_structure-based_virtual_screening
NISO Forum, Denver, Sept. 24, 2012: Opening Keynote: The Many and the One: BC...
Chemoinformatic File Format.pptx
Cheminformatics: An overview
RDF, RDA, and other TLAs
ChemSpider – disseminating data and enabling an abundance of chemistry platforms
Ad

More from Jeremy Yang (20)

PDF
TIGA: Target Illumination GWAS Analytics
PDF
DrugCentralDb and BioClients: Dockerized PostgreSql with Python API-tizer
PDF
Mining ClinicalTrials.gov via CTTI AACT for drug target hypotheses
PDF
TIN-X v2: modernized architecture with REST API
PDF
Ex-files: Sex-Specific Gene Expression Profiles Explorer
PDF
Illuminating the Druggable Genome with Knowledge Engineering and Machine Lear...
PDF
Open Phenotypic Drug Discovery Resource poster
PDF
Badapple: promiscuity patterns from noisy evidence (poster)
PDF
Bibliological data science and drug discovery
PDF
BioMISS: Language Diversity of Computing
PDF
The Language Diversity of Computing
PDF
RMSD: routine measure stirs doubts
PDF
Canonicalized systematic nomenclature in cheminformatics
PDF
Molecular scaffolds poster
PDF
Molecular scaffolds are special and useful guides to discovery
PDF
The BADAPPLE promiscuity plugin for BARD
PDF
Cheminformatics Software Development: Case Studies
PDF
UNM Division of Biocomputing public web applications
PDF
Cyberinfrastructure Day 2010: Applications in Biocomputing
PPT
Promiscuous patterns and perils in PubChem and the MLSCN
TIGA: Target Illumination GWAS Analytics
DrugCentralDb and BioClients: Dockerized PostgreSql with Python API-tizer
Mining ClinicalTrials.gov via CTTI AACT for drug target hypotheses
TIN-X v2: modernized architecture with REST API
Ex-files: Sex-Specific Gene Expression Profiles Explorer
Illuminating the Druggable Genome with Knowledge Engineering and Machine Lear...
Open Phenotypic Drug Discovery Resource poster
Badapple: promiscuity patterns from noisy evidence (poster)
Bibliological data science and drug discovery
BioMISS: Language Diversity of Computing
The Language Diversity of Computing
RMSD: routine measure stirs doubts
Canonicalized systematic nomenclature in cheminformatics
Molecular scaffolds poster
Molecular scaffolds are special and useful guides to discovery
The BADAPPLE promiscuity plugin for BARD
Cheminformatics Software Development: Case Studies
UNM Division of Biocomputing public web applications
Cyberinfrastructure Day 2010: Applications in Biocomputing
Promiscuous patterns and perils in PubChem and the MLSCN
Ad

How am I supposed to organize a protein database when I can't even organize my address book?

  • 1. How am I supposed to organize a protein database when I can't even organize my address book? Jeremy Yang UNM & IU CINF Flash session - ACS National Meeting, March 25, 2012 – San Diego, CA
  • 2. Alternate title (and take home message): Cheminformatics is so great! But is it too good to be (transferably) true? 2 / 17
  • 3. How great is cheminformatics? Example: Are these the same or different molecules? 3 / 17
  • 4. How great is cheminformatics? Example: Are these the same or different molecules? Answer: Same, that’s easy, just use canonical graph algorithm via canonical SMILES: CNC1C(O)C(O)C(CO)OC1OC2C(OC(C)C2(O)C=O)OC7C(O)C(O)C(NC(=N)NCNC(=O)C4=C(O)C(C3CC6C(=C(O)C3(O)C4=O)C(=O)c5c(O)cccc5C6(C)O)N(C)C)C(O)C7NC(N)=N (TETRACYCLINOMETHYLSTREPTOMYCIN) 4 / 17
  • 5. Thanks to… ? ? 5 / 17
  • 6. Thanks to… Harry Morgan Dave Weininger Actor, “MASH” Daylight (Hmmm…?) (SMILES) / 17 6
  • 7. Thanks to… Harry Morgan Dave Weininger ACS CAS Daylight (Morgan Algorithm) (SMILES) Et al., et al…. 7 / 17
  • 8. Now about those proteins… • Example: Are these the same or different proteins? 1YIN: ALSLTADQMVSALLDAEPPILYSEYDPTRPFSEASMMGLLTNLADRELVHMINWAKRVPGFVDLTLHDQVHLLECAWLEI LMIGLVWRSMEHPGKLLFAPNLLLDRNQGKCVEGMVEIFDMLLATSSRFRMMNLQGEEFVCLKSIILLNSGVYTFLSSTL KSLEEKDHIHRVLDKITDTLIHLMAKAGLTLQQQHQRLAQLLLILSHIRHMSNKGMEHLYSMKCKNVVPLYDLLLEMLDA HRLHAPTS 3OS8: SNAKRSKKNSLALSLTADQMVSALLDAEPPILYSEYDPTRPFSEASMMGLLTNLADRELVHMINWAKRVPGFVDLTRHDQ VHLLECAWLEILMIGLVWRSMEHPGKLLFAPNLLLDRNQGKCVEGMVEIFDMLLATSSRFRMMNLQGEEFVCLKSIILLN SGVYTFLSSTLKSLEEKDHIHRVLDKITDTLIHLMAKAGLTLQQQHQRLAQLLLILSHIRHMSNKGMEHLYSMKCKNVVP SYDLLLEMLDAHRLHAPT PAM250 alignment score: (gap: -3; extend: -10) 1156 (1156/1260 = 92%) Ergo, um… Maybe. 8 / 17
  • 9. Now about those proteins… • Example: Are these the same or different proteins? Answer: Same… but what does that even mean? 9 / 17
  • 10. Why protein identification is hard • Proteins are large, complex, dynamic • PDB is database of crystallography experiments, not molecules • Ligands, co-crystals, waters • Protein crystallography & NMR is hard • History, culture… 10 / 17
  • 11. How about human identification? (Should be easier, may shed light…) 11 / 17
  • 12. Human identification hard too, apparently… Credit card fraud Homeland security http://forms.cybersource.com/forms/NAFRDQ12012whi 12 / 17 tepaperFraudReport2012CYBSwww2012
  • 13. (Which brings us to…) My address book problems How many Rob Yangs? 13 / 17
  • 14. (Philosophical tangent:) Are human entities actually identifiable? One Harry Morgan or two? How can we know? 14 / 17
  • 15. (Philosophical tangent:) Are human entities actually identifiable? Individuality may be contextual. One Harry Morgan or two? How can we know? 15 / 17
  • 16. Could I organize my address book using cheminformatics? What would the algorithm look like? 16 / 17
  • 17. Conclusions • “CINF” (cheminformatics) is awesome. • But some CINF-awesomeness is not readily transferable to other domains. • Cannot automate logic if not logical (How many Harry Morgans?). • Perhaps CINF-awesomeness can be used as an indexing approach for chem-related domains. “Chester” 17 / 17