SlideShare a Scribd company logo
A search engine for phylogenetic tree databases David Fernández-Baca Joint work with Mukul Bansal, Duhong Chen (Computer Science, ISU) and J. Gordon Burleigh (NESCent)
PhyloFinder http://pilin.cs.iastate.edu/phylofinder/
Outline Introduction PhyloFinder queries Implementation Future directions Acknowledgements
Outline Introduction PhyloFinder queries Implementation Future directions Acknowledgements
Issues in Phylogenetic Databases Taxonomic consistency Species may appear in multiple trees by different but synonymous names. Homonyms Misspellings Querying capability Storage/representation Exploiting classification trees (e.g., NCBI) Clustering capabilities Distance measures Aggregation (synthesis) capabilities Supertrees Visualization
Classification trees and phylogenies
Exploiting taxonomic classifications The leaves in phylogenetic trees may represent different taxonomic levels A classification tree can allows us to locate trees that contain a taxon, as well as its descendants or ancestors.  E.g., a “Pinaceae" query would identify trees that contain “Pinus thunbergii ”  or “Abies alba ”
TreeBASE (Piel, Donoghue, & Sanderson, 1996)
TreeBASE capabilities Search by taxon   author citation study accession number matrix accession number structure (topology) Tree surfing
TreeBASE limitations Taxonomic name consistency Querying Few options Does not exploit classification Can’t identify ancestors/descendants  Visualization Clustering and aggregation (supertrees)
PhyloFinder A search engine for tree databases Not a database  Allows powerful phylogenetic queries Handles synonymous taxonomic names (via TBMap)  Handles misspellings. Exploits taxonomic classification Offers precise options for identifying different types of subtrees and metrics for identifying similar trees.  Provides a visualization tool with links to GenBank and TBMap.  Fast Efficient storage and filtering
PhyloFinder Design Uses simple but powerful techniques Inverted index for filtering Nested-set representation of trees Least common ancestor queries directly on database Off-the-shelf spell-checking technology Can be used with  any   phylogenetic database E.g., PhyLoTA browser However, set-up is not (yet) automatic
Outline Introduction Queries Storage and querying Acknowledgements
PhyloFinder Queries Taxonomic queries  involve a single taxon or set of taxa.  Phylogenetic queries  take as input a phylogenetic tree  Locate trees that match it in some specified way.
Taxonomic Queries Contains:   Given a list of taxa, return all trees that contain  all  or  any  of these names.  Similar to Boolean “AND” and “OR” searches. Automatically searches for synonymous taxa  Related:   Given a taxon, find all trees involving it or any of its descendants in the NCBI taxonomy.  E.g., if the query taxon is “birds " , identify all trees that contain bird taxa. Pathlength:   Given a pair of taxa, return all trees containing them, along with the distance between them in each tree.
Taxonomic Queries: Contains
Phylogenetic Queries Tree mining:  Given a query tree Q, find the database trees that exhibit Q   in some way.  Options: Return the trees that have Q   as an  embedded   subtree.  Return the trees that  refine  Q. Similarity:  Given a query tree Q and a specified  similarity measure , return trees in database ranked by decreasing similarity from Q. Requires at least 3 taxon overlap
Phylogenetic queries: Notation 1 T(A) is the minimal subtree of T that contains the leaves in A. a b c d e f g
Phylogenetic queries: Notation 1 T(A) is the minimal subtree of T that contains the leaves in A. a b c d e f g
Phylogenetic queries: Notation 1 T(A) is the minimal subtree of T that contains the leaves in A. a b c d e f g
Phylogenetic queries: Notation 2 T|A is obtained from T(A) by suppressing all internal nodes that have only one child. a b c d e f g
Phylogenetic queries Let Q be a query tree with leaf set A.  Q is  an embedded subtree  of T if and only if it is identical to T|A. Q is  refined by  T (T refines Q) if T|A is a refinement of Q.
Phylogenetic queries
Phylogenetic queries: Embedded
Phylogenetic queries: Refined by
Phylogenetic queries: Refined by Q embedded in T     Q refined by T
Phylogenetic queries: Embedded
Similarity queries Return trees ranked by a  similarity score Score is a percentage between 0 and 100% reflecting how similar query tree is to candidate tree. PhyloFinder’s similarity measures: Robinson-Foulds (RF) similarity  Least common ancestor (LCA) similarity Score takes degree of taxon overlap into account.
Outline Introduction PhyloFinder queries Implementation Future directions Acknowledgements
System architecture
Least Common Ancestors (LCAs) a b c d e f g
Least Common Ancestors (LCAs) a b c d e f g LCA( b , e )
Storage: Nested intervals Ancestor/descendant relationship is easy to determine The  between  predicate defines subtrees LCAs are easily computed Find common ancestor with largest Node_ID a b d c e f (1,10) (2,9) (3,5) (10,10) (4,4) (5,5) (6,6) (7,9) (8,8) (9,9) (Node_ID,RMD_ID)
Storage: Inverted index For each taxon, store a list of all trees that contain it. Easy to find trees containing any or all elements in a list of taxa Used as a filter Cornus Spigelia Hedera 1 2 3 5 8 13 21 34 2 4 8 16 32 64 128 13 16
Building the inverted index Input trees: 1:  (((man,pan),gorilla),pongo),   2.  (((human, coprinus),cryptomonas),zea_mays),     . . . , N:  (((dogs,homo_sapiens),pig),lambs) Convert trees into lists of taxa:  man pan gorilla pongo  ,   human coprinus  . . .    . . .   Synonymy preprocessing: Replace names by TBMap name clusters:     tc1 tc2 tc3 tc4   ,   tc1 tc5 . . .    . . .   Build index consisting of (i)  dictionary  (mapping of taxon names to name clusters) and (ii)  postings  (lists of tree IDs). tc1  1  2  3  4   tc2  1
Schema
Query Processing: Outline Consult inverted index Q: Candidate trees: Results : Compare against Q using LCA queries
Implementing Phylogenetic Queries Idea:  Use LCA queries to compare ancestor-descendant relationships in Q with those in T.   M(x) and M(y) have the same relationship in T as x and y have in Q    Q can be  embedded  in T.  Advantage:  Database trees need not be read into main memory.
Implementing Taxonomic Queries Use Boolean (union/intersection) operations on the inverted index Example : Querying for “birds"  Find all bird species in the database trees using the NCBI taxonomy tree.  Use inverted index to retrieves the tree ID lists for each bird species. Return the  union  of these lists.
Tree visualization
Tree visualization Other tree visualization tools are available:  Hillis, Heath, & St. John 2005. Syst. Biol. 54: 471-482. Sanderson 2006.  Bioinformatics  22: 1004-1006. Zmasek & Eddy. 2001.  Bioinformatics  17: 383-384. We developed our own to avoid plug-ins, and easily highlight query results and provide outlinks to GenBank and TBMap.
Spelling Suggestions come from TreeBASE and NCBI Uses GNU Aspell Modified to handle special characters (`-', `&', '.') and compound words.
Outline Introduction PhyloFinder queries Implementation Future directions Acknowledgements
Under construction Unrooted trees Supertree methods MRP, MRF, MMC Desktop version Automatic update Suggestions?
Outline Introduction PhyloFinder queries Implementation Future directions Acknowledgements
Thanks to Rod Page for TBMap Bill Piel for TreeBASE data Mike Sanderson Oliver Eulenstein National Science Foundation (grant EF-0334832)

More Related Content

PPTX
Perl for Phyloinformatics
DOCX
Phylogenetics Questions Answers
PDF
PPTX
2016 bioinformatics i_phylogenetics_wim_vancriekinge
PPT
Phylogenetic analysis
PPTX
Phylogenetic tree
PPTX
philogenetic tree
PDF
Survey of softwares for phylogenetic analysis
Perl for Phyloinformatics
Phylogenetics Questions Answers
2016 bioinformatics i_phylogenetics_wim_vancriekinge
Phylogenetic analysis
Phylogenetic tree
philogenetic tree
Survey of softwares for phylogenetic analysis

What's hot (10)

PPT
Cg7 trees
PPTX
Phylogenetic tree in microbial taxonomy
PPT
Phylogenetic trees
PPT
Multiple Sequence Alignment-just glims of viewes on bioinformatics.
PPTX
Distance based method
PPTX
Parsimony analysis
PPTX
The tree of life
PPTX
Computational phylogenetics theoretical concepts, methods with practical on C...
PPTX
Bio info
PPTX
Types of Tree in Data Structure in C++
Cg7 trees
Phylogenetic tree in microbial taxonomy
Phylogenetic trees
Multiple Sequence Alignment-just glims of viewes on bioinformatics.
Distance based method
Parsimony analysis
The tree of life
Computational phylogenetics theoretical concepts, methods with practical on C...
Bio info
Types of Tree in Data Structure in C++
Ad

Viewers also liked (20)

PPS
Phylogenetic tree
PPT
Biological databases
PPT
What is a phylogenetic tree
PPTX
Open-Access Mega-journals - STM conference, 2016, Frankfurt
PPSX
2009 MSc Presentation for Parallel-MEGA
PPTX
Taxonomy and seo sla 05-06-10(jc)
PDF
Sk rndm grmmrs
PPT
MEME – An Integrated Tool For Advanced Computational Experiments
PDF
Bioinformatics.Assignment
DOCX
Report on Phylogenetic tree
PPTX
Presentation for blast algorithm bio-informatice
PPTX
Protein structure 2
PDF
MEGA (Molecular Evolutionary Genetics Analysis)
PPT
RNA secondary structure prediction
PPT
Phylogeny
PPT
Phylogeny
PPTX
blast bioinformatics
PPT
PPT
Sequence Alignment In Bioinformatics
PDF
Basics of bioinformatics
Phylogenetic tree
Biological databases
What is a phylogenetic tree
Open-Access Mega-journals - STM conference, 2016, Frankfurt
2009 MSc Presentation for Parallel-MEGA
Taxonomy and seo sla 05-06-10(jc)
Sk rndm grmmrs
MEME – An Integrated Tool For Advanced Computational Experiments
Bioinformatics.Assignment
Report on Phylogenetic tree
Presentation for blast algorithm bio-informatice
Protein structure 2
MEGA (Molecular Evolutionary Genetics Analysis)
RNA secondary structure prediction
Phylogeny
Phylogeny
blast bioinformatics
Sequence Alignment In Bioinformatics
Basics of bioinformatics
Ad

Similar to A search engine for phylogenetic tree databases - D. Fernándes-Baca (20)

PPT
Phylo finder: an intelligent search engine for phylogenetic tree databases
PPT
lecture.ppt..........................................
PPTX
Phylogeny in R - Bianca Santini Sheffield R Users March 2015
PPT
Phylogenetic analyses1
PPTX
Phylogenetic tree by Dr. Amrita Saxena.pptx
PPT
Basics of constructing Phylogenetic tree.ppt
PPT
Phylogenetic alignment analysis an important tool in computational biology
PDF
AnMicro-TBRC Seminar on Phylogenetic Analysis (EP.1)
PDF
phylogenetics.pdf
PPT
iPlant Tree of Life
PPTX
Phylogenetic tree and it's types
PPT
Working with Trees in the Phyloinformatic Age. WH Piel
PPTX
Phtogenetics ppt for basics understanding
PDF
MIB200A at UCDavis Module: Microbial Phylogeny; Class 2
PPTX
PhyloTastic: names-based phyloinformatic data integration
PPT
4 phylogeny-ch26
PDF
Talevich bosc2010 bio-phylo
PDF
Bio.Phylo: Phylogenetics in Biopython (BOSC 2010)
PPTX
Franz. 2014. Explaining taxonomy's legacy to computers – how and why?
PPTX
Basic Terminologies of Tree and Tree Traversal methods.pptx
Phylo finder: an intelligent search engine for phylogenetic tree databases
lecture.ppt..........................................
Phylogeny in R - Bianca Santini Sheffield R Users March 2015
Phylogenetic analyses1
Phylogenetic tree by Dr. Amrita Saxena.pptx
Basics of constructing Phylogenetic tree.ppt
Phylogenetic alignment analysis an important tool in computational biology
AnMicro-TBRC Seminar on Phylogenetic Analysis (EP.1)
phylogenetics.pdf
iPlant Tree of Life
Phylogenetic tree and it's types
Working with Trees in the Phyloinformatic Age. WH Piel
Phtogenetics ppt for basics understanding
MIB200A at UCDavis Module: Microbial Phylogeny; Class 2
PhyloTastic: names-based phyloinformatic data integration
4 phylogeny-ch26
Talevich bosc2010 bio-phylo
Bio.Phylo: Phylogenetics in Biopython (BOSC 2010)
Franz. 2014. Explaining taxonomy's legacy to computers – how and why?
Basic Terminologies of Tree and Tree Traversal methods.pptx

More from Roderic Page (20)

PPTX
ALEC (A List of Everything Cool)
PPTX
Wikidata and the Biodiversity Knowledge Graph
PPTX
BioStor Next
PPTX
Ozymandias - from an atlas to a knowledge graph of living Australia
PPTX
SLiDInG6 talk on biodiversity knowledge graph
PPTX
Wild idea for TDWG17 Bitcoins, biodiversity and micropayments
PPTX
Towards a biodiversity knowledge graph
PPTX
The Sam Adams talk
PPTX
Unknown knowns, long tails, and long data
PPTX
In praise of grumpy old men: Open versus closed data and the challenge of cre...
PPTX
BHL, BioStor, and beyond
PPTX
Cisco Digital Catapult
PPTX
Built in the 19th century, rebuilt for the 21st
PPTX
Two graphs, three responses
PPTX
GrBio Workshop talk
PPTX
Biodiversity Knowledge Graphs
PPTX
Visualing phylogenies: a personal view
PPTX
Biodiversity informatics: digitising the living world
PPTX
Ebbe Nielsen Challenge GBIF #gb21
PPTX
GBIF Science Committee Report GB21, Delhi, India
ALEC (A List of Everything Cool)
Wikidata and the Biodiversity Knowledge Graph
BioStor Next
Ozymandias - from an atlas to a knowledge graph of living Australia
SLiDInG6 talk on biodiversity knowledge graph
Wild idea for TDWG17 Bitcoins, biodiversity and micropayments
Towards a biodiversity knowledge graph
The Sam Adams talk
Unknown knowns, long tails, and long data
In praise of grumpy old men: Open versus closed data and the challenge of cre...
BHL, BioStor, and beyond
Cisco Digital Catapult
Built in the 19th century, rebuilt for the 21st
Two graphs, three responses
GrBio Workshop talk
Biodiversity Knowledge Graphs
Visualing phylogenies: a personal view
Biodiversity informatics: digitising the living world
Ebbe Nielsen Challenge GBIF #gb21
GBIF Science Committee Report GB21, Delhi, India

Recently uploaded (20)

PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Tartificialntelligence_presentation.pptx
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Spectroscopy.pptx food analysis technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
1. Introduction to Computer Programming.pptx
PDF
Getting Started with Data Integration: FME Form 101
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Machine learning based COVID-19 study performance prediction
PPTX
A Presentation on Artificial Intelligence
PPTX
Machine Learning_overview_presentation.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Tartificialntelligence_presentation.pptx
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Empathic Computing: Creating Shared Understanding
Spectroscopy.pptx food analysis technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Dropbox Q2 2025 Financial Results & Investor Presentation
Assigned Numbers - 2025 - Bluetooth® Document
Mobile App Security Testing_ A Comprehensive Guide.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
The Rise and Fall of 3GPP – Time for a Sabbatical?
1. Introduction to Computer Programming.pptx
Getting Started with Data Integration: FME Form 101
SOPHOS-XG Firewall Administrator PPT.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Machine learning based COVID-19 study performance prediction
A Presentation on Artificial Intelligence
Machine Learning_overview_presentation.pptx

A search engine for phylogenetic tree databases - D. Fernándes-Baca

  • 1. A search engine for phylogenetic tree databases David Fernández-Baca Joint work with Mukul Bansal, Duhong Chen (Computer Science, ISU) and J. Gordon Burleigh (NESCent)
  • 3. Outline Introduction PhyloFinder queries Implementation Future directions Acknowledgements
  • 4. Outline Introduction PhyloFinder queries Implementation Future directions Acknowledgements
  • 5. Issues in Phylogenetic Databases Taxonomic consistency Species may appear in multiple trees by different but synonymous names. Homonyms Misspellings Querying capability Storage/representation Exploiting classification trees (e.g., NCBI) Clustering capabilities Distance measures Aggregation (synthesis) capabilities Supertrees Visualization
  • 7. Exploiting taxonomic classifications The leaves in phylogenetic trees may represent different taxonomic levels A classification tree can allows us to locate trees that contain a taxon, as well as its descendants or ancestors. E.g., a “Pinaceae" query would identify trees that contain “Pinus thunbergii ” or “Abies alba ”
  • 8. TreeBASE (Piel, Donoghue, & Sanderson, 1996)
  • 9. TreeBASE capabilities Search by taxon author citation study accession number matrix accession number structure (topology) Tree surfing
  • 10. TreeBASE limitations Taxonomic name consistency Querying Few options Does not exploit classification Can’t identify ancestors/descendants Visualization Clustering and aggregation (supertrees)
  • 11. PhyloFinder A search engine for tree databases Not a database Allows powerful phylogenetic queries Handles synonymous taxonomic names (via TBMap) Handles misspellings. Exploits taxonomic classification Offers precise options for identifying different types of subtrees and metrics for identifying similar trees. Provides a visualization tool with links to GenBank and TBMap. Fast Efficient storage and filtering
  • 12. PhyloFinder Design Uses simple but powerful techniques Inverted index for filtering Nested-set representation of trees Least common ancestor queries directly on database Off-the-shelf spell-checking technology Can be used with any phylogenetic database E.g., PhyLoTA browser However, set-up is not (yet) automatic
  • 13. Outline Introduction Queries Storage and querying Acknowledgements
  • 14. PhyloFinder Queries Taxonomic queries involve a single taxon or set of taxa. Phylogenetic queries take as input a phylogenetic tree Locate trees that match it in some specified way.
  • 15. Taxonomic Queries Contains: Given a list of taxa, return all trees that contain all or any of these names. Similar to Boolean “AND” and “OR” searches. Automatically searches for synonymous taxa Related: Given a taxon, find all trees involving it or any of its descendants in the NCBI taxonomy. E.g., if the query taxon is “birds " , identify all trees that contain bird taxa. Pathlength: Given a pair of taxa, return all trees containing them, along with the distance between them in each tree.
  • 17. Phylogenetic Queries Tree mining: Given a query tree Q, find the database trees that exhibit Q in some way. Options: Return the trees that have Q as an embedded subtree. Return the trees that refine Q. Similarity: Given a query tree Q and a specified similarity measure , return trees in database ranked by decreasing similarity from Q. Requires at least 3 taxon overlap
  • 18. Phylogenetic queries: Notation 1 T(A) is the minimal subtree of T that contains the leaves in A. a b c d e f g
  • 19. Phylogenetic queries: Notation 1 T(A) is the minimal subtree of T that contains the leaves in A. a b c d e f g
  • 20. Phylogenetic queries: Notation 1 T(A) is the minimal subtree of T that contains the leaves in A. a b c d e f g
  • 21. Phylogenetic queries: Notation 2 T|A is obtained from T(A) by suppressing all internal nodes that have only one child. a b c d e f g
  • 22. Phylogenetic queries Let Q be a query tree with leaf set A. Q is an embedded subtree of T if and only if it is identical to T|A. Q is refined by T (T refines Q) if T|A is a refinement of Q.
  • 26. Phylogenetic queries: Refined by Q embedded in T  Q refined by T
  • 28. Similarity queries Return trees ranked by a similarity score Score is a percentage between 0 and 100% reflecting how similar query tree is to candidate tree. PhyloFinder’s similarity measures: Robinson-Foulds (RF) similarity Least common ancestor (LCA) similarity Score takes degree of taxon overlap into account.
  • 29. Outline Introduction PhyloFinder queries Implementation Future directions Acknowledgements
  • 31. Least Common Ancestors (LCAs) a b c d e f g
  • 32. Least Common Ancestors (LCAs) a b c d e f g LCA( b , e )
  • 33. Storage: Nested intervals Ancestor/descendant relationship is easy to determine The between predicate defines subtrees LCAs are easily computed Find common ancestor with largest Node_ID a b d c e f (1,10) (2,9) (3,5) (10,10) (4,4) (5,5) (6,6) (7,9) (8,8) (9,9) (Node_ID,RMD_ID)
  • 34. Storage: Inverted index For each taxon, store a list of all trees that contain it. Easy to find trees containing any or all elements in a list of taxa Used as a filter Cornus Spigelia Hedera 1 2 3 5 8 13 21 34 2 4 8 16 32 64 128 13 16
  • 35. Building the inverted index Input trees: 1: (((man,pan),gorilla),pongo), 2. (((human, coprinus),cryptomonas),zea_mays), . . . , N: (((dogs,homo_sapiens),pig),lambs) Convert trees into lists of taxa:  man pan gorilla pongo  ,  human coprinus . . .  . . . Synonymy preprocessing: Replace names by TBMap name clusters:  tc1 tc2 tc3 tc4  ,  tc1 tc5 . . .  . . . Build index consisting of (i) dictionary (mapping of taxon names to name clusters) and (ii) postings (lists of tree IDs). tc1 1 2 3 4 tc2 1
  • 37. Query Processing: Outline Consult inverted index Q: Candidate trees: Results : Compare against Q using LCA queries
  • 38. Implementing Phylogenetic Queries Idea: Use LCA queries to compare ancestor-descendant relationships in Q with those in T. M(x) and M(y) have the same relationship in T as x and y have in Q  Q can be embedded in T. Advantage: Database trees need not be read into main memory.
  • 39. Implementing Taxonomic Queries Use Boolean (union/intersection) operations on the inverted index Example : Querying for “birds" Find all bird species in the database trees using the NCBI taxonomy tree. Use inverted index to retrieves the tree ID lists for each bird species. Return the union of these lists.
  • 41. Tree visualization Other tree visualization tools are available: Hillis, Heath, & St. John 2005. Syst. Biol. 54: 471-482. Sanderson 2006. Bioinformatics 22: 1004-1006. Zmasek & Eddy. 2001. Bioinformatics 17: 383-384. We developed our own to avoid plug-ins, and easily highlight query results and provide outlinks to GenBank and TBMap.
  • 42. Spelling Suggestions come from TreeBASE and NCBI Uses GNU Aspell Modified to handle special characters (`-', `&', '.') and compound words.
  • 43. Outline Introduction PhyloFinder queries Implementation Future directions Acknowledgements
  • 44. Under construction Unrooted trees Supertree methods MRP, MRF, MMC Desktop version Automatic update Suggestions?
  • 45. Outline Introduction PhyloFinder queries Implementation Future directions Acknowledgements
  • 46. Thanks to Rod Page for TBMap Bill Piel for TreeBASE data Mike Sanderson Oliver Eulenstein National Science Foundation (grant EF-0334832)

Editor's Notes

  • #2: NESCent = The National Evolutionary Synthesis Center