SlideShare a Scribd company logo
Using BOLD Data in Bioinformatics Workflows Dr. Justin Schonfeld Biodiversity Institute of Ontario
DNA Barcodes 166 Full Eukaryotic genomes 2,471 Metazoan Mitochondrial Genomes 1,444,076 Barcodes - ~118,000 species DNA Barcodes represent an enormous resource for researchers of all types.
Applications Species Identification Taxonomy Building the Reference Library Ecology Proteomics Comparative Genomics Teaching Music
High level data flow Museums Private collections Regulatory Agencies Researchers CCDB BOLD Genbank Mirrors Educators Researchers Regulatory Agencies Australian Museum
Typical Informatics Workflow Filtered Data Aligned Data Cleaned Data BOLD Align Data Identify Problematic Sequences Analyze Data Extract Data Local Copy Filter Data
Extracting Data: BOLD Public Easy to use Flexible search tool Search by taxonomic name, geographic region, collector, etc. Example Searches: “Hymenoptera”, “Lepidoptera Canada”
Extracting Data: BOLD Public Provides data in .tsv, fasta, and xml formats. Can select sequence data, trace files, specimen data, combined data.
Extracting Data: web services Provides data in tsv (tab separated value) and xml formats Sequence data or full records Can be used to provide a complete dump of all public BOLD data http://services.boldsystems.org/
Extracting Data: web services Working with the raw data allows for custom queries Not all fields are available as search terms in BOLD Public Requires scripting knowledge, or a lot of patience with excel Example: All plants above 2000 ft, etc.
Filter Data The Barcode data is collected from a wide variety of independent investigations High degree of taxonomic bias Tentative Names Variable sequence quality
Impact of Alignment Alignment Build Phylogenetic Trees Nearest Neighbor Analysis Clustering Distance Matrices
Impact of Alignment Pairwise Sequence Alignment Muscle Multiple Sequence Alignment
Aligning Animal Barcode Data CO1 Barcode Short CO1 3’ CO1’ Full CO1 sequence Barcode Even a gene as straightforward as CO1 can provide alignment challenges. 5’ 3’
Aligning Barcode Data Multiple Sequence Alignment Accurate Slow (a thousand sequences can take hours) Trouble with variable sequences Pairwise Sequence Alignment Fast (Thousands of sequences in minutes) Inconsistent placement of indels Highly dependent on choosing the right reference Parameters Amino Acid vs Nucleotide Gap Penalty
Uploading your alignment to BOLD Upload in fasta format Edit sequence permission on the records
Identifying Problems Stop codons – Automatically annotated for coding regions Even stop codons can be tricky Frame shifts  Ambiguous characters Chimeric sequences
Identifying Problems: Frame Shifts Frame-shifts in the middle of the sequence are disruptive and easy to spot Frame-shifts at the ends of the sequence are more challenging
Identifying Problems: Chimeric Sequences Identify change points Split the sequence at the point of discontinuity Blast each part Hymenoptera Hymenoptera Lepidoptera Chimera Lepidoptera
Cleaning Data:  Updating BOLD BOLD is curated by the community Re-upload sequences Delete sequences Annotate sequences Flag sequences BOLD Genbank Mirrors Educators Researchers Regulatory Agencies
Example Workflow: Occurrence of Indels Download public BOLD  Hymenoptera ecords using webservices Select sequences with full taxonomy Align sequences using MAAFT, Muscle, Transalign Select one representative per species Remove problematic Sequences Tree Map sequences onto phylogeny
Example Workflow: Code shifts Download public BOLD  Hymenoptera ecords using webservices 80,000 sequences – Align pairwise Scan sequences for code shifts Remove problematic sequences Analyze results
Acknowledgements Paul Hebert Sujeeven Ratnasingham The BOLD Team

More Related Content

PPTX
Dr David Schindel and Mike Trizna - BOL Data Portal
PPTX
Getting More Phylotastic
PDF
Two Clinical Workflows - From Unfiltered Variants to a Clinical Report
PPT
American Society for Mass Spectrometry Conference 2013
PPT
One tagger, many uses: Simple text-mining strategies for biomedicine
PPTX
[13.07.07] karst mewe13 dna_extraction_nonotes
PDF
Introduction to Bioinformatics.
PPT
Integration of biomedical data and electronic publications
Dr David Schindel and Mike Trizna - BOL Data Portal
Getting More Phylotastic
Two Clinical Workflows - From Unfiltered Variants to a Clinical Report
American Society for Mass Spectrometry Conference 2013
One tagger, many uses: Simple text-mining strategies for biomedicine
[13.07.07] karst mewe13 dna_extraction_nonotes
Introduction to Bioinformatics.
Integration of biomedical data and electronic publications

What's hot (20)

PPT
One tagger, many uses: Illustrating the power of dictionary-based named entit...
PPT
iPlant Tree of Life
PPT
Extract 2.0: Text-mining-assisted interactive annotation
PPTX
Oboyski cal bug_ecn_2012
PPT
Biological databases: Challenges in organization and usability
PPT
Bioinformatics Databases
PPTX
Data101 pmcb retreat_09-20-13_final
PPT
Andrew Polaszek - ZooBank: ICZN’s open-access web-based register of all new a...
PPTX
Database technologies in bioinformatics
PDF
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
PPT
Protease Phylogeny
PPTX
Introduction to Bayesian phylogenetics and BEAST
PPT
Introduction to Bioinformatics Slides
PPTX
EOL China Center status
PDF
Final Acb All Hands 26 11 07.Key
PPT
Global Names Architecture - Remsen
PPT
Thesaurus based Index Term Extraction
PPTX
Biological databases
PPT
John La Salle - Opening Plenary
PDF
Ontologies for life sciences: examples from the gene ontology
One tagger, many uses: Illustrating the power of dictionary-based named entit...
iPlant Tree of Life
Extract 2.0: Text-mining-assisted interactive annotation
Oboyski cal bug_ecn_2012
Biological databases: Challenges in organization and usability
Bioinformatics Databases
Data101 pmcb retreat_09-20-13_final
Andrew Polaszek - ZooBank: ICZN’s open-access web-based register of all new a...
Database technologies in bioinformatics
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Protease Phylogeny
Introduction to Bayesian phylogenetics and BEAST
Introduction to Bioinformatics Slides
EOL China Center status
Final Acb All Hands 26 11 07.Key
Global Names Architecture - Remsen
Thesaurus based Index Term Extraction
Biological databases
John La Salle - Opening Plenary
Ontologies for life sciences: examples from the gene ontology
Ad

Viewers also liked (20)

PPT
Kallio Chipster Bosc2009
PPT
مهارات+1
PDF
Supporting bioinformatics applications with hybrid multi-cloud services
PDF
الهوية الرقمية على مواقع التواصل الاجتماعي
PPTX
Delivering Bioinformatics MapReduce Applications in the Cloud
PDF
Caravane Bio [Mohammed Benbouida, AMBS, Morocco]
PPTX
Lt npsti process-and_forms_april_2011
PDF
استراتيجيات العلوم والتكنولوجيا والتجديد العالمية المعاصرة (ST&I)
PPTX
Present
PPTX
Dr. Dario Lijtmaer - Data Sharing/Collaboration and Publication using BOLD
PDF
e justice
PPTX
Visual Studio
PPT
Bioinformatics lecture 1
PPTX
Brin bws13 quiz mmc
PDF
تسويق خدمات المعلومات
PDF
الثقافة المعلوماتية في الجامعات مكتبة جامعة 6 أكتوبر نوفمبر 2012م
PPT
الثقافة التقنية والمواطنة الالكترونية
PPS
From Sunset To Sunrise
PPTX
ABT 609 PPT
PDF
دور القطاع الخاص في تعزيز مفاهيم الثقافة المعلوماتية و المعرفية
Kallio Chipster Bosc2009
مهارات+1
Supporting bioinformatics applications with hybrid multi-cloud services
الهوية الرقمية على مواقع التواصل الاجتماعي
Delivering Bioinformatics MapReduce Applications in the Cloud
Caravane Bio [Mohammed Benbouida, AMBS, Morocco]
Lt npsti process-and_forms_april_2011
استراتيجيات العلوم والتكنولوجيا والتجديد العالمية المعاصرة (ST&I)
Present
Dr. Dario Lijtmaer - Data Sharing/Collaboration and Publication using BOLD
e justice
Visual Studio
Bioinformatics lecture 1
Brin bws13 quiz mmc
تسويق خدمات المعلومات
الثقافة المعلوماتية في الجامعات مكتبة جامعة 6 أكتوبر نوفمبر 2012م
الثقافة التقنية والمواطنة الالكترونية
From Sunset To Sunrise
ABT 609 PPT
دور القطاع الخاص في تعزيز مفاهيم الثقافة المعلوماتية و المعرفية
Ad

Similar to Dr Justin Schonfeld - Bioinformatics Applications (20)

PPTX
Ondex: Data integration and visualisation
PDF
DNA SEQUENCING_BASICS_NGS_SANGER_NGS_SLIDES
PPT
PPTX
Introduction to Bioinformatics: Part 3
PPTX
Informal presentation on bioinformatics
PPTX
Understanding Genome
PDF
RNA Seq Data Analysis
PPT
databaseofptoreinsteycturrdescribing.ppt
PPTX
Dr Robert Hanner - Barcode Data standards for animals, plants & fungi
PPT
BiDiBlast Tool Presentation
PPT
Folker Meyer: Metagenomic Data Annotation
PDF
Big data in the research life cycle: technologies, infrastructures, policies
PPT
CAMERA Presentation at KNAW ICoMM Colloquium May 2008
PPT
Prediction of protein function
PPTX
Exploring DNA/RNA-Seq Analysis Results with Golden Helix GenomeBrowse and SVS
PPT
BioWeka
PPTX
Schindel i evobio norman ok - jun 11
PPTX
2016 02 23_biological_databases_part1
PDF
Lightning fast genomics with Spark, Adam and Scala
PPTX
Rare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts
Ondex: Data integration and visualisation
DNA SEQUENCING_BASICS_NGS_SANGER_NGS_SLIDES
Introduction to Bioinformatics: Part 3
Informal presentation on bioinformatics
Understanding Genome
RNA Seq Data Analysis
databaseofptoreinsteycturrdescribing.ppt
Dr Robert Hanner - Barcode Data standards for animals, plants & fungi
BiDiBlast Tool Presentation
Folker Meyer: Metagenomic Data Annotation
Big data in the research life cycle: technologies, infrastructures, policies
CAMERA Presentation at KNAW ICoMM Colloquium May 2008
Prediction of protein function
Exploring DNA/RNA-Seq Analysis Results with Golden Helix GenomeBrowse and SVS
BioWeka
Schindel i evobio norman ok - jun 11
2016 02 23_biological_databases_part1
Lightning fast genomics with Spark, Adam and Scala
Rare Variant Analysis Workflows: Analyzing NGS Data in Large Cohorts

More from Consortium for the Barcode of Life (CBOL) (20)

PDF
Andrew Lowe - Opening Plenary
PDF
Axel Hausmann - Invertebrates Plenary
PDF
Hannah McPherson - Plants Plenary
PDF
Rebecca Johnson - Opening Plenary
PDF
K.A. Seifert - Algae, Protists & Fungi Plenary
PDF
Scott Miller - Opening Plenary
PDF
Bruce Deagle - Opening Plenary
PDF
Ralph Imondi - Opening Plenary
PDF
Damon Little - Opening Plenary
PDF
Natasha de Vere - Plants Plenary
PPTX
Robert Hanner - Closing Plenary
PDF
Paul Hebert - Saturday Closing Plenary
PPTX
Conrad Schoch - Saturday Closing Plenary
PPTX
Xin Zhou - Saturday Closing Plenary
PDF
Pierre Taberlet - Saturday Closing Plenary
PPT
Stoeckle - All Birds Barcoding Initiative
PPT
Weiland Meyer - Algae, Protists & Fungi Plenary
PPTX
Alain Franc - Algae, Protists & Fungi Plenary
PPT
Marieka Gryzenhout - Algae, Protists & Fungi Plenary
PPTX
Todd Osmundson - Algae, Protists & Fungi Plenary
Andrew Lowe - Opening Plenary
Axel Hausmann - Invertebrates Plenary
Hannah McPherson - Plants Plenary
Rebecca Johnson - Opening Plenary
K.A. Seifert - Algae, Protists & Fungi Plenary
Scott Miller - Opening Plenary
Bruce Deagle - Opening Plenary
Ralph Imondi - Opening Plenary
Damon Little - Opening Plenary
Natasha de Vere - Plants Plenary
Robert Hanner - Closing Plenary
Paul Hebert - Saturday Closing Plenary
Conrad Schoch - Saturday Closing Plenary
Xin Zhou - Saturday Closing Plenary
Pierre Taberlet - Saturday Closing Plenary
Stoeckle - All Birds Barcoding Initiative
Weiland Meyer - Algae, Protists & Fungi Plenary
Alain Franc - Algae, Protists & Fungi Plenary
Marieka Gryzenhout - Algae, Protists & Fungi Plenary
Todd Osmundson - Algae, Protists & Fungi Plenary

Recently uploaded (20)

PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
Cell Structure & Organelles in detailed.
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
01-Introduction-to-Information-Management.pdf
PPTX
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PPTX
Orientation - ARALprogram of Deped to the Parents.pptx
PPTX
GDM (1) (1).pptx small presentation for students
PDF
Yogi Goddess Pres Conference Studio Updates
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Pharma ospi slides which help in ospi learning
DOC
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
Lesson notes of climatology university.
PDF
Classroom Observation Tools for Teachers
PDF
Complications of Minimal Access Surgery at WLH
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
O7-L3 Supply Chain Operations - ICLT Program
Cell Structure & Organelles in detailed.
Final Presentation General Medicine 03-08-2024.pptx
Microbial diseases, their pathogenesis and prophylaxis
Microbial disease of the cardiovascular and lymphatic systems
01-Introduction-to-Information-Management.pdf
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Orientation - ARALprogram of Deped to the Parents.pptx
GDM (1) (1).pptx small presentation for students
Yogi Goddess Pres Conference Studio Updates
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Supply Chain Operations Speaking Notes -ICLT Program
Pharma ospi slides which help in ospi learning
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
O5-L3 Freight Transport Ops (International) V1.pdf
Lesson notes of climatology university.
Classroom Observation Tools for Teachers
Complications of Minimal Access Surgery at WLH

Dr Justin Schonfeld - Bioinformatics Applications

  • 1. Using BOLD Data in Bioinformatics Workflows Dr. Justin Schonfeld Biodiversity Institute of Ontario
  • 2. DNA Barcodes 166 Full Eukaryotic genomes 2,471 Metazoan Mitochondrial Genomes 1,444,076 Barcodes - ~118,000 species DNA Barcodes represent an enormous resource for researchers of all types.
  • 3. Applications Species Identification Taxonomy Building the Reference Library Ecology Proteomics Comparative Genomics Teaching Music
  • 4. High level data flow Museums Private collections Regulatory Agencies Researchers CCDB BOLD Genbank Mirrors Educators Researchers Regulatory Agencies Australian Museum
  • 5. Typical Informatics Workflow Filtered Data Aligned Data Cleaned Data BOLD Align Data Identify Problematic Sequences Analyze Data Extract Data Local Copy Filter Data
  • 6. Extracting Data: BOLD Public Easy to use Flexible search tool Search by taxonomic name, geographic region, collector, etc. Example Searches: “Hymenoptera”, “Lepidoptera Canada”
  • 7. Extracting Data: BOLD Public Provides data in .tsv, fasta, and xml formats. Can select sequence data, trace files, specimen data, combined data.
  • 8. Extracting Data: web services Provides data in tsv (tab separated value) and xml formats Sequence data or full records Can be used to provide a complete dump of all public BOLD data http://services.boldsystems.org/
  • 9. Extracting Data: web services Working with the raw data allows for custom queries Not all fields are available as search terms in BOLD Public Requires scripting knowledge, or a lot of patience with excel Example: All plants above 2000 ft, etc.
  • 10. Filter Data The Barcode data is collected from a wide variety of independent investigations High degree of taxonomic bias Tentative Names Variable sequence quality
  • 11. Impact of Alignment Alignment Build Phylogenetic Trees Nearest Neighbor Analysis Clustering Distance Matrices
  • 12. Impact of Alignment Pairwise Sequence Alignment Muscle Multiple Sequence Alignment
  • 13. Aligning Animal Barcode Data CO1 Barcode Short CO1 3’ CO1’ Full CO1 sequence Barcode Even a gene as straightforward as CO1 can provide alignment challenges. 5’ 3’
  • 14. Aligning Barcode Data Multiple Sequence Alignment Accurate Slow (a thousand sequences can take hours) Trouble with variable sequences Pairwise Sequence Alignment Fast (Thousands of sequences in minutes) Inconsistent placement of indels Highly dependent on choosing the right reference Parameters Amino Acid vs Nucleotide Gap Penalty
  • 15. Uploading your alignment to BOLD Upload in fasta format Edit sequence permission on the records
  • 16. Identifying Problems Stop codons – Automatically annotated for coding regions Even stop codons can be tricky Frame shifts Ambiguous characters Chimeric sequences
  • 17. Identifying Problems: Frame Shifts Frame-shifts in the middle of the sequence are disruptive and easy to spot Frame-shifts at the ends of the sequence are more challenging
  • 18. Identifying Problems: Chimeric Sequences Identify change points Split the sequence at the point of discontinuity Blast each part Hymenoptera Hymenoptera Lepidoptera Chimera Lepidoptera
  • 19. Cleaning Data: Updating BOLD BOLD is curated by the community Re-upload sequences Delete sequences Annotate sequences Flag sequences BOLD Genbank Mirrors Educators Researchers Regulatory Agencies
  • 20. Example Workflow: Occurrence of Indels Download public BOLD Hymenoptera ecords using webservices Select sequences with full taxonomy Align sequences using MAAFT, Muscle, Transalign Select one representative per species Remove problematic Sequences Tree Map sequences onto phylogeny
  • 21. Example Workflow: Code shifts Download public BOLD Hymenoptera ecords using webservices 80,000 sequences – Align pairwise Scan sequences for code shifts Remove problematic sequences Analyze results
  • 22. Acknowledgements Paul Hebert Sujeeven Ratnasingham The BOLD Team

Editor's Notes

  • #3: 94 fungi, 55 plant, 83 other
  • #17: Does BOLD filter stop codons