SlideShare a Scribd company logo
x-omics Data
                            Integration Challenges
                                    Dr. Michael Lappe, Ph.D.
                                Senior Bioinformatics Scientist -
                            Functional Genomics and Systems Biology

                                       CLCbio, Denmark
Thursday, February 14, 13
Michael’s Social Network (partial)




Thursday, February 14, 13
No more
                      cargo-cult




  http://en.wikipedia.org/wiki/Cargo_cult_science
  http://en.wikipedia.org/wiki/Cargo_cult
Thursday, February 14, 13
Form follows function


             http://www.youtube.com/watch?v=pQHX-SjgQvQ




                  Do not follow empty ancient rituals that do not serve a useful purpose anymore!
              Do NOT confuse the container with its content. Database systems are NOT the DATA!
Thursday, February 14, 13
Data Integration
               • involves combining data
               • residing in different sources and
               • providing users with a unified view [...]

               (combining research
               results from different
               bioinformatics repositories,
               for example)

               http://en.wikipedia.org/wiki/Data_integration

Thursday, February 14, 13
•
      Different Levels of Resolution
                                           Ecosystem

                                       •   Population

                                       •   Organism

                                       •   Organ

                                       •   Tissue

                                       •   Cell

                                       •   Organelle

                                       •   Complexes

                                       •   Assemblies

                                       •   Molecule

                                       •   Atoms
                                       www.sciencephoto.com
Thursday, February 14, 13
Different experimental sources




  Kühner et al. “Proteome organization in a genome-reduced bacterium.”
  Science (2009) vol. 326 (5957) pp. 1235
Thursday, February 14, 13
Thursday, February 14, 13
www.abcam.com/cancer




 Henning Stehr*, Seon-Hi J. Jang*, Jose M. Duarte, Christoph Wierling, Hans Lehrach, Michael Lappe, Bodo M.H. Lange
(2011) "The structural impact of cancer-associated mutations in oncogenes and tumor suppressors" Molecular Cancer
Thursday, February 14, 13
www.abcam.com/cancer




               What are the typical mechanisms at the structural level
                  that cause the de/activation of cancer genes?




 Henning Stehr*, Seon-Hi J. Jang*, Jose M. Duarte, Christoph Wierling, Hans Lehrach, Michael Lappe, Bodo M.H. Lange
(2011) "The structural impact of cancer-associated mutations in oncogenes and tumor suppressors" Molecular Cancer
Thursday, February 14, 13
Mapping mutations to
                            (modelled) structures
                             ERBB2            MLH1




Thursday, February 14, 13
Structural Analysis
                     surface vs. core - binding site - stability - clustering ...




ERBB2                                                                               MLH1

Thursday, February 14, 13
A simple yet robust classification
                            IN-




Thursday, February 14, 13
• Oncogenes                •   Tumor-suppressor genes
 activating gain-of-function   de-activating loss-of-function
 mutations (surface, near      mutations (in the core,
 functional/binding sites)     destabilising the structure)




ERBB2                                                  MLH1

Thursday, February 14, 13
biological Networks -

         getting to grips
         with COMPLEXITY


                             Complex (biological) Systems as
                            Networks of Interacting Elements.


                   Graph
                                            Life is a graph! G=(V, E)
              records             records



     Nodes            organize   Relationships
     (Vertices)                     (Edges)

               have              have


                  Properties

Thursday, February 14, 13
The human disease network.
Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabási AL.
Proc Natl Acad Sci U S A.
2007 May 22;104(21):8685-90.




Thursday, February 14, 13
Graph Databases
           Think of Graphs not as a visualization
           but as a DATA STRUCTURE




           http://en.wikipedia.org/wiki/Graph_database
           http://nosql-database.org/
           http://www.neo4j.org/learn#graphs


Thursday, February 14, 13
Proteins as                                    1a1m - (Ca 8 A)

                    ResidueInteractionGraphs -
                                                                        Anisotropic Network Model

                                                                              eigen-mode 3

                        capturing dynamics




1a1m (Xray)
 1jnj (NMR,
 20 models)




                                          oGNM: A protein dynamics online calculation engine using the Gaussian Network Model" Yang, L.-W.,
                                          Rader, A.J., Liu, X.,  Jursa, C.J., Chen S.C., Karimi, H, Bahar, I. Nucleic Acids Res, 34, W24-31, 2006
Thursday, February 14, 13
Geometry & Structure




PDB: 1KX5                                   http://vimeo.com/24047115
   S.Daujat, T. Weiss, F.Mohn, U.C.Lange, C.Ziegler-Birling, U.Zeissler, M.Lappe, D.Schubeler, M.E.Torres-Padilla, R.Schneider (2009). "H3K64 trimethylation
             marks heterochromatin and is dynamically remodeled during developmental reprogramming" Nature Structural and Molecular Biology
Thursday, February 14, 13
x-omics =
                Proteomics
                Metabolomics
                Regulation
                [...] +
                x-Seq Data




                                ChIP
                            =          RNA   BS   ...
Thursday, February 14, 13
x-omics =
                Proteomics
                Metabolomics
                Regulation
                [...] +
                x-Seq Data




                                ChIP
                            =          RNA   BS   ...
Thursday, February 14, 13
some challenges ...
                            different experiments, protocols, samples, coverage ...




                            isolated information silos
                            different data formats
                            mapping & identifier chaos
                            error propagation / annotation bottleneck
                            statistical criteria for (dis-)similarity
                            knowledge lock-up, literature access
                            redundancy / implicit co-ordination
                            TMI & essential info ?
Thursday, February 14, 13
                            ...
"Blind monks examining an elephant" by Itcho Hanabusa
    題「衆瞽探象之圖」。英一蝶(はなぶさ・いっちょう 1652 – 1724)の作。

Thursday, February 14, 13
Let’s move on ...
Thursday, February 14, 13
http://5stardata.info/

                                   5★ Open Data




                   Tim Berners-Lee, the inventor of the Web and Linked Data initiator,
                         suggested a 5 star deployment scheme for Open Data.
Thursday, February 14, 13
http://5stardata.info/

                                     5★ Open Data




                  ★ make your stuff available on the Web (whatever format) under an Open License

Thursday, February 14, 13
http://5stardata.info/

                                       5★ Open Data




                  ★ ★ make it available as structured data (machine REadable, e.g. Excel*)
                  * http://dontuseexcel.wordpress.com/2013/02/07/dont-use-excel-for-biological-data/
Thursday, February 14, 13
http://5stardata.info/

                                     5★ Open Data




                  ★ ★ ★ use non-proprietary Open Formats (e.g. CSV instead of Excel)

Thursday, February 14, 13
http://5stardata.info/

                                     5★ Open Data




                  ★ ★ ★ ★ use URIs to denote things, so that people can point at your stuff

Thursday, February 14, 13
http://5stardata.info/

                                    5★ Open Data




                  ★ ★ ★ ★ ★ Link your Data to other data to provide (networked)

Thursday, February 14, 13
http://5stardata.info/

                               5★ Open Data




Thursday, February 14, 13
Giant Global Graph
     important related concept that overlaps with GGG is that of the
     "Semantic Web" - relates to decentralized Information. (≄Web3.0)




Thursday, February 14, 13
Thursday, February 14, 13
The next Web of open, linked data:
                       Tim Berners-Lee on TED.com
                            http://www.ted.com/talks/tim_berners_lee_on_the_next_web.html

         http://www.ted.com/talks/tim_berners_lee_the_year_open_data_went_worldwide.html




Thursday, February 14, 13
Web of biological Data




                            linked open scientific data grass-roots movement

Thursday, February 14, 13
scale-free




                                                 Protein Interaction Networks




                                                                                    small-world
 Park, J., M. Lappe, et al. (2001). "Mapping protein family interactions: intramolecular and intermolecular
 protein family interaction repertoires in the PDB and yeast." Journal of Molecular Biology 307(3): 929-38
Thursday, February 14, 13
modelling information gain:
     Tandem-Affinity
  Purifications in-silico




Thursday, February 14, 13
modelling information gain:
     Tandem-Affinity
  Purifications in-silico




                              Michael Lappe and Liisa Holm
                              "Unraveling protein interaction networks
                              with near-optimal efficiency." (2004)
                              Nature Biotechnology 22(1): 98-103
Thursday, February 14, 13
Toward interoperable bioscience data
  Susanna-Assunta Sansone et al., Nature Genetics, Feb 2012

                                             “to make full use of research data, the
                                             bioscience community needs to adopt
                                             technologies and reward mechanisms that
                                             support interoperability and promote the
                                             growth of an open ‘data commoning’ culture.”

                                             The open source ISA metadata tracking tools
                                             facilitates standards compliant collection,
                                             curation, local management and reuse of
                                             datasets in an increasingly diverse set of life
                                             science domains.
                                             http://www.isa-tools.org/

                                             http://www.nature.com/ng/journal/v44/n2/pdf/ng.1054.pdf
Thursday, February 14, 13
Free your data ...
                            Biology and BioInformatics are data-driven sciences

                            think beyond your own harddrive and the current paper
                            evaluate and embrace new technologies (LOD, GraphDBs)
                            rethink current incentive systems : no more cargo-cult

                            make it useful, re-useable
                            and sustainable

                            Open Access, Open Source
                            Open Linked Data Mash-Ups

                            focus on your science
Thursday, February 14, 13
Thank you!




                                wood engraving by an unknown artist, in “L'atmosphère:
                                 météorologie populaire” (1888) Camille Flammarion




Hubble Space Telescope / NASA
Thursday, February 14, 13

More Related Content

PDF
Recommender Systems and Linked Open Data
PPT
OMICS tecnology
PDF
Big Data with Data Virtualization (session 3 from Packed Lunch Webinar Series)
PDF
an (OSi) - IMGS 2013
PDF
“Unlock Your Manufacturing Data to Drive Manufacturing Optimisation and Resul...
PDF
Knowledge management for integrative omics data analysis
PPTX
Data analysis & integration challenges in genomics
PPTX
Data Virtualization: An Introduction
Recommender Systems and Linked Open Data
OMICS tecnology
Big Data with Data Virtualization (session 3 from Packed Lunch Webinar Series)
an (OSi) - IMGS 2013
“Unlock Your Manufacturing Data to Drive Manufacturing Optimisation and Resul...
Knowledge management for integrative omics data analysis
Data analysis & integration challenges in genomics
Data Virtualization: An Introduction

Similar to X-omics Data Integration Challenges (20)

PDF
BITS: Overview of important biological databases beyond sequences
PDF
BITS: Basics of sequence databases
PPTX
2014 aus-agta
PPTX
Scott Edmunds: Data publication in the data deluge
PPTX
Scott Edmunds: Data Dissemination in the era of "Big-Data"
PPTX
Ewan Birney Biocuration 2013
PPTX
Scott Edmunds at DataCite 2012: Adventures in Data Citation
PDF
Bda2015 tutorial-part2-data&databases
PPTX
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
PPTX
HKU Data Curation MLIM7350 Class 8
PPT
Trends in Annotation of Genomic Data
PPTX
DNA Sequence Data in Big Data Perspective
PDF
Sabina Leonelli
PPT
There is No Intelligent Life Down Here
PPTX
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
PDF
Bioinformatics manual
PDF
bioinformatics enabling knowledge generation from agricultural omics data
PDF
LECTURE NOTES ON BIOINFORMATICS
PPTX
Scio12 sem web_final
BITS: Overview of important biological databases beyond sequences
BITS: Basics of sequence databases
2014 aus-agta
Scott Edmunds: Data publication in the data deluge
Scott Edmunds: Data Dissemination in the era of "Big-Data"
Ewan Birney Biocuration 2013
Scott Edmunds at DataCite 2012: Adventures in Data Citation
Bda2015 tutorial-part2-data&databases
Scott Edmunds: GigaScience - Big-Data, Data Citation and Future Data Handling
HKU Data Curation MLIM7350 Class 8
Trends in Annotation of Genomic Data
DNA Sequence Data in Big Data Perspective
Sabina Leonelli
There is No Intelligent Life Down Here
Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...
Bioinformatics manual
bioinformatics enabling knowledge generation from agricultural omics data
LECTURE NOTES ON BIOINFORMATICS
Scio12 sem web_final
Ad

More from COST action BM1006 (10)

PDF
An Introduction to Causal Discovery, a Bayesian Network Approach
PDF
Reverse-engineering techniques in Data Integration
PDF
from B-cell Biology to Data Integration
PDF
Mechanisms of Asthma and Allergy (MeDALL): from population based birth cohort...
PDF
Integrative Analysis of Epigenomics and miRNA data in Immune System Models
PDF
Proteomics analysis: Basics and Applications
PDF
Metabolomics Data Analysis
PDF
Metabolomics: data acquisition, pre-processing and quality control
PDF
RNA-seq Analysis
PDF
ChipSeq Data Analysis
An Introduction to Causal Discovery, a Bayesian Network Approach
Reverse-engineering techniques in Data Integration
from B-cell Biology to Data Integration
Mechanisms of Asthma and Allergy (MeDALL): from population based birth cohort...
Integrative Analysis of Epigenomics and miRNA data in Immune System Models
Proteomics analysis: Basics and Applications
Metabolomics Data Analysis
Metabolomics: data acquisition, pre-processing and quality control
RNA-seq Analysis
ChipSeq Data Analysis
Ad

X-omics Data Integration Challenges

  • 1. x-omics Data Integration Challenges Dr. Michael Lappe, Ph.D. Senior Bioinformatics Scientist - Functional Genomics and Systems Biology CLCbio, Denmark Thursday, February 14, 13
  • 2. Michael’s Social Network (partial) Thursday, February 14, 13
  • 3. No more cargo-cult http://en.wikipedia.org/wiki/Cargo_cult_science http://en.wikipedia.org/wiki/Cargo_cult Thursday, February 14, 13
  • 4. Form follows function http://www.youtube.com/watch?v=pQHX-SjgQvQ Do not follow empty ancient rituals that do not serve a useful purpose anymore! Do NOT confuse the container with its content. Database systems are NOT the DATA! Thursday, February 14, 13
  • 5. Data Integration • involves combining data • residing in different sources and • providing users with a unified view [...] (combining research results from different bioinformatics repositories, for example) http://en.wikipedia.org/wiki/Data_integration Thursday, February 14, 13
  • 6. Different Levels of Resolution Ecosystem • Population • Organism • Organ • Tissue • Cell • Organelle • Complexes • Assemblies • Molecule • Atoms www.sciencephoto.com Thursday, February 14, 13
  • 7. Different experimental sources Kühner et al. “Proteome organization in a genome-reduced bacterium.” Science (2009) vol. 326 (5957) pp. 1235 Thursday, February 14, 13
  • 9. www.abcam.com/cancer Henning Stehr*, Seon-Hi J. Jang*, Jose M. Duarte, Christoph Wierling, Hans Lehrach, Michael Lappe, Bodo M.H. Lange (2011) "The structural impact of cancer-associated mutations in oncogenes and tumor suppressors" Molecular Cancer Thursday, February 14, 13
  • 10. www.abcam.com/cancer What are the typical mechanisms at the structural level that cause the de/activation of cancer genes? Henning Stehr*, Seon-Hi J. Jang*, Jose M. Duarte, Christoph Wierling, Hans Lehrach, Michael Lappe, Bodo M.H. Lange (2011) "The structural impact of cancer-associated mutations in oncogenes and tumor suppressors" Molecular Cancer Thursday, February 14, 13
  • 11. Mapping mutations to (modelled) structures ERBB2 MLH1 Thursday, February 14, 13
  • 12. Structural Analysis surface vs. core - binding site - stability - clustering ... ERBB2 MLH1 Thursday, February 14, 13
  • 13. A simple yet robust classification IN- Thursday, February 14, 13
  • 14. • Oncogenes • Tumor-suppressor genes activating gain-of-function de-activating loss-of-function mutations (surface, near mutations (in the core, functional/binding sites) destabilising the structure) ERBB2 MLH1 Thursday, February 14, 13
  • 15. biological Networks - getting to grips with COMPLEXITY Complex (biological) Systems as Networks of Interacting Elements. Graph Life is a graph! G=(V, E) records records Nodes organize Relationships (Vertices) (Edges) have have Properties Thursday, February 14, 13
  • 16. The human disease network. Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabási AL. Proc Natl Acad Sci U S A. 2007 May 22;104(21):8685-90. Thursday, February 14, 13
  • 17. Graph Databases Think of Graphs not as a visualization but as a DATA STRUCTURE http://en.wikipedia.org/wiki/Graph_database http://nosql-database.org/ http://www.neo4j.org/learn#graphs Thursday, February 14, 13
  • 18. Proteins as 1a1m - (Ca 8 A) ResidueInteractionGraphs - Anisotropic Network Model eigen-mode 3 capturing dynamics 1a1m (Xray) 1jnj (NMR, 20 models) oGNM: A protein dynamics online calculation engine using the Gaussian Network Model" Yang, L.-W., Rader, A.J., Liu, X.,  Jursa, C.J., Chen S.C., Karimi, H, Bahar, I. Nucleic Acids Res, 34, W24-31, 2006 Thursday, February 14, 13
  • 19. Geometry & Structure PDB: 1KX5 http://vimeo.com/24047115 S.Daujat, T. Weiss, F.Mohn, U.C.Lange, C.Ziegler-Birling, U.Zeissler, M.Lappe, D.Schubeler, M.E.Torres-Padilla, R.Schneider (2009). "H3K64 trimethylation marks heterochromatin and is dynamically remodeled during developmental reprogramming" Nature Structural and Molecular Biology Thursday, February 14, 13
  • 20. x-omics = Proteomics Metabolomics Regulation [...] + x-Seq Data ChIP = RNA BS ... Thursday, February 14, 13
  • 21. x-omics = Proteomics Metabolomics Regulation [...] + x-Seq Data ChIP = RNA BS ... Thursday, February 14, 13
  • 22. some challenges ... different experiments, protocols, samples, coverage ... isolated information silos different data formats mapping & identifier chaos error propagation / annotation bottleneck statistical criteria for (dis-)similarity knowledge lock-up, literature access redundancy / implicit co-ordination TMI & essential info ? Thursday, February 14, 13 ...
  • 23. "Blind monks examining an elephant" by Itcho Hanabusa 題「衆瞽探象之圖」。英一蝶(はなぶさ・いっちょう 1652 – 1724)の作。 Thursday, February 14, 13
  • 24. Let’s move on ... Thursday, February 14, 13
  • 25. http://5stardata.info/ 5★ Open Data Tim Berners-Lee, the inventor of the Web and Linked Data initiator, suggested a 5 star deployment scheme for Open Data. Thursday, February 14, 13
  • 26. http://5stardata.info/ 5★ Open Data ★ make your stuff available on the Web (whatever format) under an Open License Thursday, February 14, 13
  • 27. http://5stardata.info/ 5★ Open Data ★ ★ make it available as structured data (machine REadable, e.g. Excel*) * http://dontuseexcel.wordpress.com/2013/02/07/dont-use-excel-for-biological-data/ Thursday, February 14, 13
  • 28. http://5stardata.info/ 5★ Open Data ★ ★ ★ use non-proprietary Open Formats (e.g. CSV instead of Excel) Thursday, February 14, 13
  • 29. http://5stardata.info/ 5★ Open Data ★ ★ ★ ★ use URIs to denote things, so that people can point at your stuff Thursday, February 14, 13
  • 30. http://5stardata.info/ 5★ Open Data ★ ★ ★ ★ ★ Link your Data to other data to provide (networked) Thursday, February 14, 13
  • 31. http://5stardata.info/ 5★ Open Data Thursday, February 14, 13
  • 32. Giant Global Graph important related concept that overlaps with GGG is that of the "Semantic Web" - relates to decentralized Information. (≄Web3.0) Thursday, February 14, 13
  • 34. The next Web of open, linked data: Tim Berners-Lee on TED.com http://www.ted.com/talks/tim_berners_lee_on_the_next_web.html http://www.ted.com/talks/tim_berners_lee_the_year_open_data_went_worldwide.html Thursday, February 14, 13
  • 35. Web of biological Data linked open scientific data grass-roots movement Thursday, February 14, 13
  • 36. scale-free Protein Interaction Networks small-world Park, J., M. Lappe, et al. (2001). "Mapping protein family interactions: intramolecular and intermolecular protein family interaction repertoires in the PDB and yeast." Journal of Molecular Biology 307(3): 929-38 Thursday, February 14, 13
  • 37. modelling information gain: Tandem-Affinity Purifications in-silico Thursday, February 14, 13
  • 38. modelling information gain: Tandem-Affinity Purifications in-silico Michael Lappe and Liisa Holm "Unraveling protein interaction networks with near-optimal efficiency." (2004) Nature Biotechnology 22(1): 98-103 Thursday, February 14, 13
  • 39. Toward interoperable bioscience data Susanna-Assunta Sansone et al., Nature Genetics, Feb 2012 “to make full use of research data, the bioscience community needs to adopt technologies and reward mechanisms that support interoperability and promote the growth of an open ‘data commoning’ culture.” The open source ISA metadata tracking tools facilitates standards compliant collection, curation, local management and reuse of datasets in an increasingly diverse set of life science domains. http://www.isa-tools.org/ http://www.nature.com/ng/journal/v44/n2/pdf/ng.1054.pdf Thursday, February 14, 13
  • 40. Free your data ... Biology and BioInformatics are data-driven sciences think beyond your own harddrive and the current paper evaluate and embrace new technologies (LOD, GraphDBs) rethink current incentive systems : no more cargo-cult make it useful, re-useable and sustainable Open Access, Open Source Open Linked Data Mash-Ups focus on your science Thursday, February 14, 13
  • 41. Thank you! wood engraving by an unknown artist, in “L'atmosphère: météorologie populaire” (1888) Camille Flammarion Hubble Space Telescope / NASA Thursday, February 14, 13