SlideShare a Scribd company logo
Integrating Public and Private Data:
Lessons Learned from Unison
http://unison-db.org/
Online access, download, documentation, references.

Reece Hart
Genentech, Inc.

Molecular Medicine Tri-Conference
February 26, 2009
San Francisco, CA
                      Updates available at http://harts.net/reece/pubs/
A Bestiary of Life Sciences Data Types
               Genomics                          Proteomics
          assemblies, transcripts,         sequences, domains, PTMs,
           probes, trans. factors,            localization, structure,
             expression, SNPs,                orthology, predictions
                haplotypes


      Chemistry                                                          Networks
                                                                 interactions, pathways
compounds, HCS, HTS,
     properties



                                                                          LIMS
 Communications                                                animal records, protocols
literature, patents, and                                           request systems,
      presentations                                               personnel, samples



                           Clinincal                     Annotation
                      assays, protocols,
                                                    GO, taxonomy, SCOP,
                       patient records,
                                                       disease, OMIM
                           samples
                                                                                           2
Types of Integration

➢   Semantic Integration
     ●   Integrates fundamentally distinct data types.
     ●   Improves contextual understanding of data.

➢   Source Aggregation
     ●   Aggregates data of the same type from multiple
         sources. e.g., in-house sequences with external
         ones.
     ●   Ensures completeness of data.




                                                           3
A Survey of Integration Methods

Presentation
                                                                                        Mashups
                                                       Link Integration                 AJAX, iframe
                                                       Hypertext links between
                                                                sites




Middle Tier

                                                         Server
                                                        Mashups


Database
Integration
                                                  F                              W
(Federation /
Warehouse)
                                          Federation                                 Warehouse

Source Databases
or Files                                    A                       B                  C
For review, see:
Goble C, Stevens R
J Biomed Inform. 2008 Oct;41(5):687-93.                                                                4
Why is Integration Difficult?

➢   Establishing semantic equivalences and
    relationships are difficult.

➢   Source databases are updated often.
     ●   Volume and frequency of updates are challenging.

➢   Source databases have dynamic structure.




                                                            5
Benefit Lessons

➢   Integrate to enable reasoning based on a
    corpus of data of multiple types and/or
    from multiple origins.
     ●   To analyze biological data in broad context.
     ●   To generate hypotheses by data mining.
     ●   To enable business decisions based on a holistic
         view of decision criteria.

➢   Ancillary benefits:
     ●   Data preparation is hard. Centralization means
         that questions get asked and asked efficiently.
     ●   Integrated data provides a consistent foundation
         on which others can build.
     ●   Integration improves currency.

                                                            6
Unison in a Nutshell




                         Domain,
                                                         Structures
                  Structure & Homology
                                                         & Ligands
                       Predictions

                                        Protein
                                     Sequences and
                                      Annotations

                      Genomes,                          Auxiliary
                    Gene Mapping &                    Annotations
                      Structure,                     GO, RIF, SCOP,
                       Probes                             etc.



      Sequences and Annotations         Auxiliary Data   Precomputed predictions
UniProt, IPI, Ensembl, RefSeq, PDB, HomoloGene, Gene     Domains, homology, structure, TMs,
 PHANTOM, HUGE, ROUGE, MGC,               Ontology,      localization, signals, disorder, etc.
              Derwent, pataa, nr, etc. taxonomy, PDB,    >200M predictions, 23 types,
>13M seqs, >17k species, 69 origins HUGO, SCOP, etc.     ~6 CPU-years                          7
Analysis and Data Mining Have Distinct Needs.
                                                                                (Semantic Integration)
                           sequences non-redundant superset of all sequences   feature types/models HMM, TM, signal, etc.
                                                                                                                                                                 Sequence Analysis
    (Source Integration)



                                                                                                                                                                 i.e., show predictions for a given sequence
                                                                                                                                                                 Typically involves minutes to hours of computing per sequence.
                                                                                 Typically entails days to months of computing results.
                                                                                 i.e., show sequences that contain specified features.

                                                                                                                                          Feature-Based Mining
                                                                                                                                                                    Prediction results
                                                                                                                                                                    method-specific data such as score, e-value, p-
                                                                                                                                                                    value, kinase probability, etc.




                                                                                                                                                                                                                 parameters
                                                                                                                                                                                                 execution arguments/options for every
                                                                                                                                                                                                              prediction type and result



                                                                                                                                                                                                                                           8
Unison has many applications.
Unison Web Tools                                   Other In-House Tools                                                  Ad Hoc Mining



                                                                                                                             Mining and
                                                                                                                             analysis
                                                                                                                             projects




                                              Domain,
                                                                                 Structures
                                       Structure & Homology
                                                                                 & Ligands
                                            Predictions

                                                               Protein
                                                            Sequences and
                                                             Annotations

                                            Genomes,                            Auxiliary
                                          Gene Mapping &                      Annotations
                                            Structure,                       GO, RIF, SCOP,
                                              Probes                              etc.



                          Sequences and Annotations          Auxiliary Data      Precomputed predictions
                     UniProt, IPI, Ensembl, RefSeq, PDB    HomoloGene, Gene      Domains, homology, structure, TMs,
                   STRING, PHANTOM, HUGE, ROUGE,           Ontology, taxonomy,   localization, signals, disorder, etc.
                           MGC, Derwent, pataa, nr, etc.   PDB, HUGO, SCOP,      >200M predictions, 23 types,                             9
                     >13M seqs, >17k species, 69 origins           etc.          ~6 CPU-years
Mining for ITIMs the old way.

                        Ig            TM         ITIM



 ➢   Collect sequences.
 ➢   Prune redundant sequences. (How?!)
 ➢   For each unique sequence, predict
      ●   Immunoglobulin domains.
      ●   Transmembrane domains.
      ●   ITIM domains.
 ➢   Write a program that filters predictions.
 ➢   Summarize hits with external data.
 ➢   Do it again when source data are updated.



For Review: Daëron M Immunol Rev. 2008 Aug;224:11-43    10
Mining for ITIMs the Unison way.

                             Ig                   TM             ITIM
SELECT IG.pseq_id,
        IG.start as ig_start,IG.stop as ig_stop,IG.score,IG.eval,
        TM.start as tm_start,TM.stop as tm_stop,
        ITIM.start as itim_start,ITIM.stop as itim_stop
 FROM pahmm_current_pfam_v IG
 JOIN pftmhmm_tms_v TM ON IG.pseq_id=TM.pseq_id                          AND IG.stop<TM.start
 JOIN pfregexp_v ITIM             ON TM.pseq_id=ITIM.pseq_id AND TM.stop<ITIM.start
WHERE IG.name='ig' AND IG.eval<1e-2
        AND ITIM.acc='MOD_TYR_ITIM';

               Ig     Ig                   TM      Tm    ITIM     ITIM
  pseq_id   start   stop score     eval   start   stop   start    stop best_annotation
      234    262     316    30 7.40E-06    440     462    518      523 UniProtKB/Swiss-Prot:SIGL5_HUMAN (RecName: Fu
      254    158     213    36 1.90E-07    284     306    386      391 UniProtKB/Swiss-Prot:VSIG4_HUMAN (RecName: F
      544    157     215    24 6.60E-04    348     370    431      436 UniProtKB/Swiss-Prot:SIGL9_HUMAN (RecName: Fu
      797    254     312    40 7.60E-09   1099    1121   1361     1366 UniProtKB/Swiss-Prot:DCC_HUMAN (RecName: Ful
     1113     42     102    30 1.20E-05    243     265    300      305 UniProtKB/Swiss-Prot:KI2L2_HUMAN (RecName: Fu
     1114     42     102    30 6.50E-06    243     265    330      335 UniProtKB/Swiss-Prot:KI2L1_HUMAN (RecName: Fu
     1115     42     102    31 4.20E-06    243     265    301      306 UniProtKB/Swiss-Prot:KI2L3_HUMAN (RecName: Fu
     1116     42      97    30 1.10E-05    339     361    396      401 UniProtKB/TrEMBL:Q95368_HUMAN (SubName: Fu
     1134    340     388    26 1.40E-04    603     625    688      693 UniProtKB/Swiss-Prot:PECA1_HUMAN (RecName: F
                                                                                                               11
Data Integration Led to Bcl-2 Discoveries.



                                                  +          sequences,
                                                               models,
                                                           HMM alignments,
                                                             automation




        Custom model building

       Z'fish        Source Database       Human                                   %
                                                     E-value    Score   % Ide
      Protein         and Accession        Protein                              Coverage
        Bax         RefSeq:NP_571637                 2.00E-47   189      51        98


⇒                                                                                          ⇒
                                            BAX
       Bax2     E35:ENSDARP00000040899               1.00E-14   81       33        51
        Bik        UP:Q5RGV6_BRARE          BIK      1.41E+04   20       47        12
       Bmf        RefSeq:NP_001038689                1.00E-05   50       32        91
                                            BMF
      Bmf2      E35:FGENESH00000082230               1.10E-02   42       41        42
      BBC3      E35:FGENESH00000078270     PUMA      2.10E+01   30       25        49

                         4 novel Bcl-2 proteins in zebrafish




Kratz et al., Cell Death Differ. (2006).                                                       12
Unison Web Tools




                   13
Unison Web Tools




                   14
Unison Web Tools




                   15
Unison Web Tools




                   16
Unison Web Tools




                   17
Unison is a platform for diverse tools.




                                    Matt Brauer
                                    Guy Cavet
                                    Josh Kaminker
                                    Scott Lohr
                                    Kathryn Woods
                                    Jean Yuan
                                    Peng Yue 18
Unison is a platform for diverse tools.




                                    Matt Brauer
                                    Guy Cavet
                                    Josh Kaminker
                                    Scott Lohr
                                    Kathryn Woods
                                    Jean Yuan
                                    Peng Yue 19
Unison is a platform for diverse tools.




                                    Matt Brauer
                                    Guy Cavet
                                    Josh Kaminker
                                    Scott Lohr
                                    Kathryn Woods
                                    Jean Yuan
                                    Peng Yue 20
Unison is a platform for diverse tools.




                                    Matt Brauer
                                    Guy Cavet
                                    Josh Kaminker
                                    Scott Lohr
                                    Kathryn Woods
                                    Jean Yuan
                                    Peng Yue 21
Design Lessons

➢   Know what data to integrate, how they'll
    be used, and the converse.

➢   Integrate on simple, intuitively meaningful
    abstract concepts.
     ●   Precise definitions are critical.
     ●   Represent proprietary data elsewhere, if needed.

➢   Design for Integrity.
     ●   Reliability is everything.

➢   Aggregate on data types.
     ●   Corollary: Partitioning on content makes data
         silos.
                                                            22
Unison Contents
  patents                       HUGO
  Geneseq:AAP60074              TNFSF9
  1991-10-29
  SUNTORY
                                TNFSF10
                                TNFSF11
                                                      homologs
                                                      NP_000585.2 NP_036807.1 | RAT
  EP205038-A; New tumour...
                                                      NP_000585.2 NP_038721.1 | MOUSE
                                                      NP_000585.2 XP_858423.1 | CANFA


 GO                                                                                                      SNPs
 Function                                                                                                P84L
   transcription                                                                                         A94T
      initiation
      elongation
                       aliases
                       TNFA_HUMAN
Entrez                 Q1XHZ6
                       IPI00001671.1
                                                      sequences                         protein features
gene_id                                               >Unison:98
                       INCY:1109711.FL1p
symbol                                                MSTESMIRDVE...FGIIAL
                       CCDS4702.1
locus                                                 >Unison:23782
                       gi:25952111
                                                      VRSSSRTPSD...FGIIAL                  1   |    23   |         | SS
                                                                                         108   |   143   | 1.8e-06 | EGF
                                                                                         162   |   184   |         | TM

taxonomy                                                           alignments
                                                                                         133   |   138   |         | ITIM

9606 Homo sapiens
10090 Mus musculus                                                 TNFA 1tnfA
10028 Rattus rattus                                                TNFA 1tnfB
                                                                                                   aa-to-resid
                              loci                                 ...
                                                                   TNFA 5tswF                      MSTESMIR
                                                                                                   DVEFGIIA
                                1 233 6+:31651498-31653288
                                                                                                   TESMIRDV
                                                                                                   IIAMDAC

                                                                                structures
                                                                                1tnf                            SCOP
  genomes                                                                       1a8m                            all alpha
  Hs35
  Hs36
                                            probes                              2tun
                                                                                4tsv
                                                                                                                all beta
                                                                                                                 Ig
                                            HGU133P                             5tsw                             TNF-like
  RAT
                                            WHG                                                                 alpha+beta
                                                                                                                             23
Ex1: Mine for sequences w/conserved features.
  patents                       HUGO
  Geneseq:AAP60074              TNFSF9
  1991-10-29
  SUNTORY
                                TNFSF10
                                TNFSF11
                                                      homologs
                                                      NP_000585.2 NP_036807.1 | RAT
  EP205038-A; New tumour...
                                                      NP_000585.2 NP_038721.1 | MOUSE
                                                      NP_000585.2 XP_858423.1 | CANFA


 GO                                                                                                      SNPs
 Function                                                                                                P84L
   transcription                                                                                         A94T
      initiation
      elongation
                       aliases
                       TNFA_HUMAN
Entrez                 Q1XHZ6
                       IPI00001671.1
                                                      sequences                         protein features
gene_id                                               >Unison:98
                       INCY:1109711.FL1p
symbol                                                MSTESMIRDVE...FGIIAL
                       CCDS4702.1
locus                                                 >Unison:23782
                       gi:25952111
                                                      VRSSSRTPSD...FGIIAL                  1   |    23   |         | SS
                                                                                         108   |   143   | 1.8e-06 | EGF
                                                                                         162   |   184   |         | TM

taxonomy                                                           alignments
                                                                                         133   |   138   |         | ITIM

9606 Homo sapiens
10090 Mus musculus                                                 TNFA 1tnfA
10028 Rattus rattus                                                TNFA 1tnfB
                                                                                                   aa-to-resid
                              loci                                 ...
                                                                   TNFA 5tswF                      MSTESMIR
                                                                                                   DVEFGIIA
                                1 233 6+:31651498-31653288
                                                                                                   TESMIRDV
                                                                                                   IIAMDAC

                                                                                structures
                                                                                1tnf                            SCOP
  genomes                                                                       1a8m                            all alpha
  Hs35
  Hs36
                                            probes                              2tun
                                                                                4tsv
                                                                                                                all beta
                                                                                                                 Ig
                                            HGU133P                             5tsw                             TNF-like
  RAT
                                            WHG                                                                 alpha+beta
                                                                                                                             24
Ex2: Locate SNPs and Domains on Structure
  patents                       HUGO
  Geneseq:AAP60074              TNFSF9
  1991-10-29
  SUNTORY
                                TNFSF10
                                TNFSF11
                                                      homologs
                                                      NP_000585.2 NP_036807.1 | RAT
  EP205038-A; New tumour...
                                                      NP_000585.2 NP_038721.1 | MOUSE
                                                      NP_000585.2 XP_858423.1 | CANFA


 GO                                                                                                      SNPs
 Function                                                                                                P84L
   transcription                                                                                         A94T
      initiation
      elongation
                       aliases
                       TNFA_HUMAN
Entrez                 Q1XHZ6
                       IPI00001671.1
                                                      sequences                         protein features
gene_id                                               >Unison:98
                       INCY:1109711.FL1p
symbol                                                MSTESMIRDVE...FGIIAL
                       CCDS4702.1
locus                                                 >Unison:23782
                       gi:25952111
                                                      VRSSSRTPSD...FGIIAL                  1   |    23   |         | SS
                                                                                         108   |   143   | 1.8e-06 | EGF
                                                                                         162   |   184   |         | TM

taxonomy                                                           alignments
                                                                                         133   |   138   |         | ITIM

9606 Homo sapiens
10090 Mus musculus                                                 TNFA 1tnfA
10028 Rattus rattus                                                TNFA 1tnfB
                                                                                                   aa-to-resid
                              loci                                 ...
                                                                   TNFA 5tswF                      MSTESMIR
                                                                                                   DVEFGIIA
                                1 233 6+:31651498-31653288
                                                                                                   TESMIRDV
                                                                                                   IIAMDAC

                                                                                structures
                                                                                1tnf                            SCOP
  genomes                                                                       1a8m                            all alpha
  Hs35
  Hs36
                                            probes                              2tun
                                                                                4tsv
                                                                                                                all beta
                                                                                                                 Ig
                                            HGU133P                             5tsw                             TNF-like
  RAT
                                            WHG                                                                 alpha+beta
                                                                                                                             25
Unison Form Follows Function.
            Params/Models




Sequences   Results




                                 26
Process Lessons

➢   Explicitly track the provenance of data.
     ●   All data in Unison are tied to an origin –
         predictions, annotations, sequences, models.

➢   Plan for updates.
     ●   Updates are completely automated and
         idempotent.

              idempotent
              i⋅dem⋅po⋅tent (/ˈaɪdəmˈpoʊtnt, ˈɪdəm-/)
              adj. [from mathematical techspeak] Acting as if
              used only once, even if used multiple times.


              idempotent. Dictionary.com. Jargon File 4.2.0.
              http://dictionary.reference.com/browse/idempotent (accessed:
              February 25, 2009).
                                                                             27
Unison Build Process

      Phase 0    Phase 1     Phase 2    Phase 3     Phase 4      Phase 5    Phase 6
     Download    Load Aux     Load      Update       Update       Update    House-
                   Data     Sequences    Sets      Predictions   Analyses   keeping



Makefile                    Makefile
downloads all data          loads auxiliary data
                            loads sequences and annotations
                               (in-house is just another source)
                            updates sequence sets
                            updates precomputed predictions
                               (incremental update!)
                            updates precomputed analyses and mat'd views



 ➢   Runs in a cron job
 ➢   Requires ~10% time of 1 person
 ➢   Consistent, reliable builds

                                                                                      28
Other Lessons

➢   Design security from the start.
     ●   Internal version of Unison use Kerberos.
     ●   Especially important in a world of distributed
         services and data.

➢   Include web services early in the design.




                                                          29
Kiran Mukhyala

Fernando Bazan, Matt Brauer, David
Cavanaugh, Jason Hackney, Pete
Haverty, Ken Jung, Josh Kaminker,
Nandini Krishnamurthy, Li Li, Yun Li,
Scott Lohr, Shiuh-ming Loh, Jinfeng
Liu, Peng Yue, Jianjun Zhang, Yan
Zhang

Simran Hansrai, Marc Lambert,
Dave Windgassen

http://unison-db.org/
Open access web site, downloads,
documentation, references, credits.

unison-db.org:5432
PostgreSQL & odbc/jdbc/sdbc               “Are you sure about this
access                                   Stan? It seems odd that a
                                        pointy head and a long beak
                                         is what makes them fly.”
                                          J. Workman, Science 245:1399 (1989)
                                                                           30
31
Unison facilitates complex mining.




                             Jason Hackney
                             Nandini Krishnamurthy
                             Li Li
                             Yun Li
                             Jinfeng Liu
                             Shiu-ming Loh
                             Kiran Mukhyala     32

More Related Content

PDF
Unison: An Integrated Platform for Computational Biology Discovery
PDF
Hans-Joachim Ruscheweyh: Pooling Metagenomes in MEGAN Based on Environmental ...
PPTX
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen
PPTX
Biocuration2012 Eugeni Belda
PPTX
PDF
Poster Semantic data integration proof of concept
PDF
Biocuration2012 poster P113
PDF
Bot Strata UK 2012-10-02
Unison: An Integrated Platform for Computational Biology Discovery
Hans-Joachim Ruscheweyh: Pooling Metagenomes in MEGAN Based on Environmental ...
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen
Biocuration2012 Eugeni Belda
Poster Semantic data integration proof of concept
Biocuration2012 poster P113
Bot Strata UK 2012-10-02

Similar to Integrating Public and Private Data: Lessons Learned from Unison (20)

PDF
Unison: Enabling easy, rapid, and comprehensive proteomic mining
PDF
NetBioSIG2012 anyatsalenko-en-viz
PDF
BITS: Basics of sequence analysis
PPTX
2013 nas-ehs-data-integration-dc
PPTX
Informal presentation on bioinformatics
PDF
Bm Systems Scientific Epa Conference Heuristic Mathematic Concepts Synergies ...
PDF
PDF
Bioinformatics data mining
PPTX
Bioinformatics t8-go-hmm wim-vancriekinge_v2013
PPT
SooryaKiran Bioinformatics
PPTX
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence Reads
PPTX
Using ontologies to do integrative systems biology
PDF
Apollo annotation guidelines for i5k projects Diaphorina citri
PDF
Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...
PDF
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
PDF
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
PDF
Curation Introduction - Apollo Workshop
PPTX
8. Data mining_warehousing_integration.pptx
PDF
Knowledge management for integrative omics data analysis
Unison: Enabling easy, rapid, and comprehensive proteomic mining
NetBioSIG2012 anyatsalenko-en-viz
BITS: Basics of sequence analysis
2013 nas-ehs-data-integration-dc
Informal presentation on bioinformatics
Bm Systems Scientific Epa Conference Heuristic Mathematic Concepts Synergies ...
Bioinformatics data mining
Bioinformatics t8-go-hmm wim-vancriekinge_v2013
SooryaKiran Bioinformatics
Pipeline Scripting for the Parallel Alignment of Genomic Short Sequence Reads
Using ontologies to do integrative systems biology
Apollo annotation guidelines for i5k projects Diaphorina citri
Automatic Parallelization for Parallel Architectures Using Smith Waterman Alg...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Curation Introduction - Apollo Workshop
8. Data mining_warehousing_integration.pptx
Knowledge management for integrative omics data analysis
Ad

More from Reece Hart (12)

PDF
HGVS 2015 poster: hgvs, uta, variantanalyzer
PDF
Clinical significance of transcript alignment discrepancies gne - 20141016
PDF
The Clinical Significance of Transcript Alignment Discrepancies
PDF
Invitae PSB 2014 poster
PDF
AWS Life Sciences
PDF
ASHG 2012 Poster
PDF
Building a clinical genome interpretation services company
PDF
Bio-IT 2010 Genome Commons
PDF
HVP Critical Assessment of Genome Interpretation
PDF
Introduction to and Applications of Unison, an Open Source Database for Targe...
PDF
A Tour of Research Computing at Genentech
PDF
Mining for Novel TNF Ligands
HGVS 2015 poster: hgvs, uta, variantanalyzer
Clinical significance of transcript alignment discrepancies gne - 20141016
The Clinical Significance of Transcript Alignment Discrepancies
Invitae PSB 2014 poster
AWS Life Sciences
ASHG 2012 Poster
Building a clinical genome interpretation services company
Bio-IT 2010 Genome Commons
HVP Critical Assessment of Genome Interpretation
Introduction to and Applications of Unison, an Open Source Database for Targe...
A Tour of Research Computing at Genentech
Mining for Novel TNF Ligands
Ad

Recently uploaded (20)

PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Electronic commerce courselecture one. Pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Machine learning based COVID-19 study performance prediction
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
KodekX | Application Modernization Development
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
Understanding_Digital_Forensics_Presentation.pptx
Network Security Unit 5.pdf for BCA BBA.
Electronic commerce courselecture one. Pdf
Empathic Computing: Creating Shared Understanding
Building Integrated photovoltaic BIPV_UPV.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
20250228 LYD VKU AI Blended-Learning.pptx
Big Data Technologies - Introduction.pptx
Unlocking AI with Model Context Protocol (MCP)
Diabetes mellitus diagnosis method based random forest with bat algorithm
“AI and Expert System Decision Support & Business Intelligence Systems”
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
MYSQL Presentation for SQL database connectivity
Programs and apps: productivity, graphics, security and other tools
NewMind AI Weekly Chronicles - August'25 Week I
Machine learning based COVID-19 study performance prediction
Advanced methodologies resolving dimensionality complications for autism neur...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
KodekX | Application Modernization Development
Mobile App Security Testing_ A Comprehensive Guide.pdf

Integrating Public and Private Data: Lessons Learned from Unison

  • 1. Integrating Public and Private Data: Lessons Learned from Unison http://unison-db.org/ Online access, download, documentation, references. Reece Hart Genentech, Inc. Molecular Medicine Tri-Conference February 26, 2009 San Francisco, CA Updates available at http://harts.net/reece/pubs/
  • 2. A Bestiary of Life Sciences Data Types Genomics Proteomics assemblies, transcripts, sequences, domains, PTMs, probes, trans. factors, localization, structure, expression, SNPs, orthology, predictions haplotypes Chemistry Networks interactions, pathways compounds, HCS, HTS, properties LIMS Communications animal records, protocols literature, patents, and request systems, presentations personnel, samples Clinincal Annotation assays, protocols, GO, taxonomy, SCOP, patient records, disease, OMIM samples 2
  • 3. Types of Integration ➢ Semantic Integration ● Integrates fundamentally distinct data types. ● Improves contextual understanding of data. ➢ Source Aggregation ● Aggregates data of the same type from multiple sources. e.g., in-house sequences with external ones. ● Ensures completeness of data. 3
  • 4. A Survey of Integration Methods Presentation Mashups Link Integration AJAX, iframe Hypertext links between sites Middle Tier Server Mashups Database Integration F W (Federation / Warehouse) Federation Warehouse Source Databases or Files A B C For review, see: Goble C, Stevens R J Biomed Inform. 2008 Oct;41(5):687-93. 4
  • 5. Why is Integration Difficult? ➢ Establishing semantic equivalences and relationships are difficult. ➢ Source databases are updated often. ● Volume and frequency of updates are challenging. ➢ Source databases have dynamic structure. 5
  • 6. Benefit Lessons ➢ Integrate to enable reasoning based on a corpus of data of multiple types and/or from multiple origins. ● To analyze biological data in broad context. ● To generate hypotheses by data mining. ● To enable business decisions based on a holistic view of decision criteria. ➢ Ancillary benefits: ● Data preparation is hard. Centralization means that questions get asked and asked efficiently. ● Integrated data provides a consistent foundation on which others can build. ● Integration improves currency. 6
  • 7. Unison in a Nutshell Domain, Structures Structure & Homology & Ligands Predictions Protein Sequences and Annotations Genomes, Auxiliary Gene Mapping & Annotations Structure, GO, RIF, SCOP, Probes etc. Sequences and Annotations Auxiliary Data Precomputed predictions UniProt, IPI, Ensembl, RefSeq, PDB, HomoloGene, Gene Domains, homology, structure, TMs, PHANTOM, HUGE, ROUGE, MGC, Ontology, localization, signals, disorder, etc. Derwent, pataa, nr, etc. taxonomy, PDB, >200M predictions, 23 types, >13M seqs, >17k species, 69 origins HUGO, SCOP, etc. ~6 CPU-years 7
  • 8. Analysis and Data Mining Have Distinct Needs. (Semantic Integration) sequences non-redundant superset of all sequences feature types/models HMM, TM, signal, etc. Sequence Analysis (Source Integration) i.e., show predictions for a given sequence Typically involves minutes to hours of computing per sequence. Typically entails days to months of computing results. i.e., show sequences that contain specified features. Feature-Based Mining Prediction results method-specific data such as score, e-value, p- value, kinase probability, etc. parameters execution arguments/options for every prediction type and result 8
  • 9. Unison has many applications. Unison Web Tools Other In-House Tools Ad Hoc Mining Mining and analysis projects Domain, Structures Structure & Homology & Ligands Predictions Protein Sequences and Annotations Genomes, Auxiliary Gene Mapping & Annotations Structure, GO, RIF, SCOP, Probes etc. Sequences and Annotations Auxiliary Data Precomputed predictions UniProt, IPI, Ensembl, RefSeq, PDB HomoloGene, Gene Domains, homology, structure, TMs, STRING, PHANTOM, HUGE, ROUGE, Ontology, taxonomy, localization, signals, disorder, etc. MGC, Derwent, pataa, nr, etc. PDB, HUGO, SCOP, >200M predictions, 23 types, 9 >13M seqs, >17k species, 69 origins etc. ~6 CPU-years
  • 10. Mining for ITIMs the old way. Ig TM ITIM ➢ Collect sequences. ➢ Prune redundant sequences. (How?!) ➢ For each unique sequence, predict ● Immunoglobulin domains. ● Transmembrane domains. ● ITIM domains. ➢ Write a program that filters predictions. ➢ Summarize hits with external data. ➢ Do it again when source data are updated. For Review: Daëron M Immunol Rev. 2008 Aug;224:11-43 10
  • 11. Mining for ITIMs the Unison way. Ig TM ITIM SELECT IG.pseq_id, IG.start as ig_start,IG.stop as ig_stop,IG.score,IG.eval, TM.start as tm_start,TM.stop as tm_stop, ITIM.start as itim_start,ITIM.stop as itim_stop FROM pahmm_current_pfam_v IG JOIN pftmhmm_tms_v TM ON IG.pseq_id=TM.pseq_id AND IG.stop<TM.start JOIN pfregexp_v ITIM ON TM.pseq_id=ITIM.pseq_id AND TM.stop<ITIM.start WHERE IG.name='ig' AND IG.eval<1e-2 AND ITIM.acc='MOD_TYR_ITIM'; Ig Ig TM Tm ITIM ITIM pseq_id start stop score eval start stop start stop best_annotation 234 262 316 30 7.40E-06 440 462 518 523 UniProtKB/Swiss-Prot:SIGL5_HUMAN (RecName: Fu 254 158 213 36 1.90E-07 284 306 386 391 UniProtKB/Swiss-Prot:VSIG4_HUMAN (RecName: F 544 157 215 24 6.60E-04 348 370 431 436 UniProtKB/Swiss-Prot:SIGL9_HUMAN (RecName: Fu 797 254 312 40 7.60E-09 1099 1121 1361 1366 UniProtKB/Swiss-Prot:DCC_HUMAN (RecName: Ful 1113 42 102 30 1.20E-05 243 265 300 305 UniProtKB/Swiss-Prot:KI2L2_HUMAN (RecName: Fu 1114 42 102 30 6.50E-06 243 265 330 335 UniProtKB/Swiss-Prot:KI2L1_HUMAN (RecName: Fu 1115 42 102 31 4.20E-06 243 265 301 306 UniProtKB/Swiss-Prot:KI2L3_HUMAN (RecName: Fu 1116 42 97 30 1.10E-05 339 361 396 401 UniProtKB/TrEMBL:Q95368_HUMAN (SubName: Fu 1134 340 388 26 1.40E-04 603 625 688 693 UniProtKB/Swiss-Prot:PECA1_HUMAN (RecName: F 11
  • 12. Data Integration Led to Bcl-2 Discoveries. + sequences, models, HMM alignments, automation Custom model building Z'fish Source Database Human % E-value Score % Ide Protein and Accession Protein Coverage Bax RefSeq:NP_571637 2.00E-47 189 51 98 ⇒ ⇒ BAX Bax2 E35:ENSDARP00000040899 1.00E-14 81 33 51 Bik UP:Q5RGV6_BRARE BIK 1.41E+04 20 47 12 Bmf RefSeq:NP_001038689 1.00E-05 50 32 91 BMF Bmf2 E35:FGENESH00000082230 1.10E-02 42 41 42 BBC3 E35:FGENESH00000078270 PUMA 2.10E+01 30 25 49 4 novel Bcl-2 proteins in zebrafish Kratz et al., Cell Death Differ. (2006). 12
  • 18. Unison is a platform for diverse tools. Matt Brauer Guy Cavet Josh Kaminker Scott Lohr Kathryn Woods Jean Yuan Peng Yue 18
  • 19. Unison is a platform for diverse tools. Matt Brauer Guy Cavet Josh Kaminker Scott Lohr Kathryn Woods Jean Yuan Peng Yue 19
  • 20. Unison is a platform for diverse tools. Matt Brauer Guy Cavet Josh Kaminker Scott Lohr Kathryn Woods Jean Yuan Peng Yue 20
  • 21. Unison is a platform for diverse tools. Matt Brauer Guy Cavet Josh Kaminker Scott Lohr Kathryn Woods Jean Yuan Peng Yue 21
  • 22. Design Lessons ➢ Know what data to integrate, how they'll be used, and the converse. ➢ Integrate on simple, intuitively meaningful abstract concepts. ● Precise definitions are critical. ● Represent proprietary data elsewhere, if needed. ➢ Design for Integrity. ● Reliability is everything. ➢ Aggregate on data types. ● Corollary: Partitioning on content makes data silos. 22
  • 23. Unison Contents patents HUGO Geneseq:AAP60074 TNFSF9 1991-10-29 SUNTORY TNFSF10 TNFSF11 homologs NP_000585.2 NP_036807.1 | RAT EP205038-A; New tumour... NP_000585.2 NP_038721.1 | MOUSE NP_000585.2 XP_858423.1 | CANFA GO SNPs Function P84L transcription A94T initiation elongation aliases TNFA_HUMAN Entrez Q1XHZ6 IPI00001671.1 sequences protein features gene_id >Unison:98 INCY:1109711.FL1p symbol MSTESMIRDVE...FGIIAL CCDS4702.1 locus >Unison:23782 gi:25952111 VRSSSRTPSD...FGIIAL 1 | 23 | | SS 108 | 143 | 1.8e-06 | EGF 162 | 184 | | TM taxonomy alignments 133 | 138 | | ITIM 9606 Homo sapiens 10090 Mus musculus TNFA 1tnfA 10028 Rattus rattus TNFA 1tnfB aa-to-resid loci ... TNFA 5tswF MSTESMIR DVEFGIIA 1 233 6+:31651498-31653288 TESMIRDV IIAMDAC structures 1tnf SCOP genomes 1a8m all alpha Hs35 Hs36 probes 2tun 4tsv all beta Ig HGU133P 5tsw TNF-like RAT WHG alpha+beta 23
  • 24. Ex1: Mine for sequences w/conserved features. patents HUGO Geneseq:AAP60074 TNFSF9 1991-10-29 SUNTORY TNFSF10 TNFSF11 homologs NP_000585.2 NP_036807.1 | RAT EP205038-A; New tumour... NP_000585.2 NP_038721.1 | MOUSE NP_000585.2 XP_858423.1 | CANFA GO SNPs Function P84L transcription A94T initiation elongation aliases TNFA_HUMAN Entrez Q1XHZ6 IPI00001671.1 sequences protein features gene_id >Unison:98 INCY:1109711.FL1p symbol MSTESMIRDVE...FGIIAL CCDS4702.1 locus >Unison:23782 gi:25952111 VRSSSRTPSD...FGIIAL 1 | 23 | | SS 108 | 143 | 1.8e-06 | EGF 162 | 184 | | TM taxonomy alignments 133 | 138 | | ITIM 9606 Homo sapiens 10090 Mus musculus TNFA 1tnfA 10028 Rattus rattus TNFA 1tnfB aa-to-resid loci ... TNFA 5tswF MSTESMIR DVEFGIIA 1 233 6+:31651498-31653288 TESMIRDV IIAMDAC structures 1tnf SCOP genomes 1a8m all alpha Hs35 Hs36 probes 2tun 4tsv all beta Ig HGU133P 5tsw TNF-like RAT WHG alpha+beta 24
  • 25. Ex2: Locate SNPs and Domains on Structure patents HUGO Geneseq:AAP60074 TNFSF9 1991-10-29 SUNTORY TNFSF10 TNFSF11 homologs NP_000585.2 NP_036807.1 | RAT EP205038-A; New tumour... NP_000585.2 NP_038721.1 | MOUSE NP_000585.2 XP_858423.1 | CANFA GO SNPs Function P84L transcription A94T initiation elongation aliases TNFA_HUMAN Entrez Q1XHZ6 IPI00001671.1 sequences protein features gene_id >Unison:98 INCY:1109711.FL1p symbol MSTESMIRDVE...FGIIAL CCDS4702.1 locus >Unison:23782 gi:25952111 VRSSSRTPSD...FGIIAL 1 | 23 | | SS 108 | 143 | 1.8e-06 | EGF 162 | 184 | | TM taxonomy alignments 133 | 138 | | ITIM 9606 Homo sapiens 10090 Mus musculus TNFA 1tnfA 10028 Rattus rattus TNFA 1tnfB aa-to-resid loci ... TNFA 5tswF MSTESMIR DVEFGIIA 1 233 6+:31651498-31653288 TESMIRDV IIAMDAC structures 1tnf SCOP genomes 1a8m all alpha Hs35 Hs36 probes 2tun 4tsv all beta Ig HGU133P 5tsw TNF-like RAT WHG alpha+beta 25
  • 26. Unison Form Follows Function. Params/Models Sequences Results 26
  • 27. Process Lessons ➢ Explicitly track the provenance of data. ● All data in Unison are tied to an origin – predictions, annotations, sequences, models. ➢ Plan for updates. ● Updates are completely automated and idempotent. idempotent i⋅dem⋅po⋅tent (/ˈaɪdəmˈpoʊtnt, ˈɪdəm-/) adj. [from mathematical techspeak] Acting as if used only once, even if used multiple times. idempotent. Dictionary.com. Jargon File 4.2.0. http://dictionary.reference.com/browse/idempotent (accessed: February 25, 2009). 27
  • 28. Unison Build Process Phase 0 Phase 1 Phase 2 Phase 3 Phase 4 Phase 5 Phase 6 Download Load Aux Load Update Update Update House- Data Sequences Sets Predictions Analyses keeping Makefile Makefile downloads all data loads auxiliary data loads sequences and annotations (in-house is just another source) updates sequence sets updates precomputed predictions (incremental update!) updates precomputed analyses and mat'd views ➢ Runs in a cron job ➢ Requires ~10% time of 1 person ➢ Consistent, reliable builds 28
  • 29. Other Lessons ➢ Design security from the start. ● Internal version of Unison use Kerberos. ● Especially important in a world of distributed services and data. ➢ Include web services early in the design. 29
  • 30. Kiran Mukhyala Fernando Bazan, Matt Brauer, David Cavanaugh, Jason Hackney, Pete Haverty, Ken Jung, Josh Kaminker, Nandini Krishnamurthy, Li Li, Yun Li, Scott Lohr, Shiuh-ming Loh, Jinfeng Liu, Peng Yue, Jianjun Zhang, Yan Zhang Simran Hansrai, Marc Lambert, Dave Windgassen http://unison-db.org/ Open access web site, downloads, documentation, references, credits. unison-db.org:5432 PostgreSQL & odbc/jdbc/sdbc “Are you sure about this access Stan? It seems odd that a pointy head and a long beak is what makes them fly.” J. Workman, Science 245:1399 (1989) 30
  • 31. 31
  • 32. Unison facilitates complex mining. Jason Hackney Nandini Krishnamurthy Li Li Yun Li Jinfeng Liu Shiu-ming Loh Kiran Mukhyala 32