SlideShare a Scribd company logo
Basic bioinformatics concepts,
                 databases and tools

                        Introduction to the training
                          and Sequence databases

                                        Joachim Jacob
                                    http://www.bits.vib.be

Updated 22 February 2012
http://dl.dropbox.com/u/18352887/BITS_training_material/Link%20to%20mod1-intro_H1_2012_SeqDBs.pdf
Scope
        Introductory training to Bioinformatics


        Exploring and understanding
        databases and software
        for everyday bioinformatics use



        If there is any term which is unclear,
             please stop me and ask me!
Bioinformatics ...

         Bio
         all data is derived from living samples

         Informatics
         that data is stored and analyzed in and with computers to obtain
           understanding



         Extremely broad description, for which however we
           will extract common principles during the course
Bioinformatics is present into every aspect
of life sciences research
Bioinformatics is present into every aspect
of life sciences research
Bioinformatics is present into every aspect
of life sciences research



                                         , sequences
Bioinformatics is present into every aspect
of life sciences research
Bioinformatics is present into every aspect
of life sciences research
Bioinformatics is present into every aspect
of life sciences research
Bioinformatics is present into every aspect
of life sciences research
Bioinformatics is present into every aspect
of life sciences research
Bioinformatics is present into every aspect
of life sciences research
Bioinformatics ...

       Bio
               - different types of living samples
       Informatics
               - storing and categorizing the information
         and          making it easily accessible
               - interpreting that information reliably
Bioinformatics … and his companion

      Bio
              - different types of living samples
      Informatics
              - storing and categorizing the information
        and          making it easily accessible
              - interpreting that information reliably
      Statistics
              - large numbers, observational data
The siblings of Bioinformatics
       Based on the biological component extracted from life, the
         measured properties and the ultimate goal of the
         analysis, different sub-disciplines of bioinformatics exist.


DNA           RNA           proteins metabolites
Genomics
              Transcriptomics
                          Proteomics
                                                  Metabolomics

Epigenomics          Structural bioinformatics
Systems biology      Microbiomics       Interactomics
Metagenomics         Functional genomics Comparative gx
Mere data is worth nothing

CGCTACGCATATCGCT                Data = symbols

- Dasypus novemcinctus          Information = data that are processed to be useful;
- found in my garden               provides answers to "who", "what", "where", and
- Part of genome
- sequenced on June 2010           "when" questions. Also called metadata.

This species seems to be        Knowledge: application of data and information;
related to my neighbor's pet,
because it has also this          answers "how" questions
sequence

Has the same mother             Understanding: appreciation of "why"

                                Wisdom

                                              http://www.systems-thinking.org/dikw/dikw.htm
?                                   !        Life sciences
                                                 research as major
                                                 'end user' for the
          data              knowledge            bioinformatics tools
                                                 and conclusions
                                                 'tool user'
          Tools and approaches




                                                 Bioinformatics
                                                 research, as a
                                                 specific branch on
Biology          Computer           Statistics   the boundary of life
                                                 science,
                                                 mathematics and
                                                 computer science
                                                 'tool manufacturer'
This course is organised in several modules

Module 1: Sequence databases: what, where, how
Module 2: Sequence comparisons: searching, aligning
Module 3: Sequence analysis – domains in protein sequences and
 predicting functionality, standardisation and useful links
Module 4: Beyond sequences - additional important data sources
Module 5: Genome Browsers - integrating biological data and performing
 reproducible bioinformatics research in the Galaxy
Overview of the crash course
One tip for the future

          Be prepared for change...
            Information is fluid
            So are bioinfo tools


          Learn how to accommodate for change
            Major resources are more stable
            Important concepts do not change often
Module 1

           Sequence databases
Module 1: Sequence databases

        Sequence databases store DNA and RNA sequences. In
        Bioinformatics, they are by far (still) the largest
        collections of biological data, and used by many
        subdisciplines of bioinformatics.




                            http://www.ebi.ac.uk/embl/Services/DBStats/
... and growing




                  http://www.ebi.ac.uk/embl/Services/DBStats/
Three major nucleotide databanks host primary
sequence data
      European Nucleotide Archive (ENA) at EBI - http://www.ebi.ac.uk/
       Division EMBL-bank (European Molecular Biology Laboratory) (single)
       Trace Archive
       SRA Archive



      GenBank at NCBI - http://www.ncbi.nlm.nih.gov/
       maintained at NCBI (National Center for Biotechnology Information,
       (USA)



      DDBJ (DNA Data Bank of Japan) - http://www.ddbj.nig.ac.jp/
       maintained at NIG/CIB (National Institute of Genetics, Center for
       Information Biology, Mishima, Japan)
These databases are filled with NA sequence
   information by scientists and consortia
               Large-scale      Individual      Patent
               sequencing       scientists      Offices                ACTGCTGCTA
                                                                       GCTAGCTGAT
                 projects                                              CTATGCTAGC
                                                                       TGTAGCTGAG




                                                                           Primary
                                                                        sequence data

                           each primary sequence
                                     =
                               one experiment                              Primary
                                                                           sequence
                           Basically, all 'source' nucleotide
                                        material                           database


Jennifer McDowall - http://www.biotnet.org/training-materials/nucleotide-sequence-databases-ena
Primary NA sequence can be produced by
   Sanger-based technologies or NGS technologies

                                     Sanger
            sample
                                     Low output in number of seqs, high quality, 400-850 bp.
                                     Read profiles in .abi format. Stored in Trace Archive.
      RNA            DNA
          RT
                                     NGS
                                     Different technologies. Extremely high output rate, low
     cDNA                            quality, 30 bp – 600 bp. Reads in .fastq format, stored in
                                     the SRA.

                                     These techniques can only read DNA strands,
                                     so RNA needs first to be converted to cDNA
                                     with reverse transcriptases prior to loading to
                                     the machines.


Sanger overview: http://www.bio.davidson.edu/Courses/Molbio/MolStudents/spring2003/Obenrader
NGS overview: http://seqanswers.com/forums/showthread.php?t=3561
Overview major DNA reading technologies




            Dennis Wall, NGS Data Analysis and Computation I course, Wall Lab
In the primary sequence dbs a major distinction
can be made in two major categories
             High quality single submission (Sanger)
               - gene sequence (genomic – 'STD' data class)
               - mRNA sequence (via cDNA – 'STD')
               - BAC/YAC/cosmid sequences
               - genome sequencing projects (contigs,
               assemblies, WGS)
 DNA
cDNA   RNA     - genome markers, STS (sequence tagged
               sites, unique short sequences from a
               genome)

             Low quality batch submissions
               - Expressed Sequence Tags (EST)
               - Genome Survey Sequences (GSS)
               - high-throughput sequence data (e.g. NGS)
                                  http://www.ebi.ac.uk/ena/about/formats
The batch submissions originate mostly from
sequencing centers
         Large-scale
         sequencing
           projects                                            chromosome

                                                    fragment


                                                               sequencing library


         submission                                            sequence reads
       e.g. whole genome shotgun




         submission                                                 assemble
                                                                    sequence

         submission                                                 annotation
                                   cyp30   cyp309            insv
                                                     cg343
Each primary database stores their sequences
and batch submissions in their own way...
           - NCBI: ESTs are stored in dbEST (separate database)
           - ENA: ESTs are part of EMBL-bank in 'EST' data class

           Similar for GSS (see dbGSS at NCBI)


           ESTs : expressed sequence tag, often partial sequence
             derived from RNA in batch. See example
                                                 >est1
                                                 ATCGACTAGCATCA
  sample                                         >est2
                                                 TCGACTAGCGACTA
                               RNA-seq           >est3
                   RNA                           CAGCATCATCGAC
http://www.biotnet.org/sites/biotnet.org/files/documents/17/2010_ena_v2.0.ppt

Batch submissions are marked and/or stored
differently than single submissions
                                                                   Data class ESTs are
ENA-Annotation:                                                   also batch submissions
Feature annotation


                                                    1) EMBL-Bank

ENA-Assembly:
Assembly information
                                                                           Batch submissions


ENA-Reads:                                      2) Trace Archive
Sequencing and                                     - Raw data (capillary sequencing)
sampling information
                                                3) Sequence Read Archive
                                                  - Raw data (Next Gen sequencing)


  TIER                    CLASS                        TYPE                ENA structure
The 'normal' submissions are a minority in
primary sequence databases




             http://www.ebi.ac.uk/ena/about/statistics#embl_bases_per_dataclass
Primary sequence dbs are synchronised and
every sequence receives a unique identifier
      All database maintainers assign and share a unique accession number (AC) to each
      sequence – besides their own ID number – (info at NCBI). Sequences can get updated,
      and the accession number is extended with a version number, e.g. .1 (see SVA)
       Example of acc number: BC010109.2



http://www.insdc.org/
Collaboration on         GenBank                          DDBJ
Features, taxonomy,...    + SRA

                                                                          Synchronized
                               International nucleotide
                           Sequence databases collaboration               daily

                                                               All use the same
                                                               - Accession Ids
                                          ENA                  - Project Ids
                                                               - Feature tables (see later)




                               http://en.wikipedia.org/wiki/Accession_number_(bioinformatics)
One sequence entry contains three categories
of different types of information

   1. Info about sequence, submitters and literature (metadata)
   2. Annotations of the sequence (metadata related to the seq)
   3. Stretch of ATGC / AUGC sequence (the 'data', at the bottom)
   •
       A sequence record is called 'annotated' when biological information is
       added and linked to a position in the sequence
   •
       Annotations, also called 'features', are abbreviated as codes, which
       can be found in the Feature Tables




                                        http://www.ebi.ac.uk/embl/Documentation/FT_d
This sequence information can be written in
  different formats
   (plain) Text format, e.g. GenBank
         1. General info

                                                     Official shared accession


                                                     Genbank specific identifier
                                                     (just sums up with each new)

                                                     A lot of different identifiers!
                                                     ~number of databases
                                                     → conversion tools can translate
                                                     identifiers needed (see exercises)

*In humans: HUGO Nomenclature committee determines the right gene
name
                             http://mobyle.pasteur.fr/cgi-bin/portal.py#tutorials::seqfmt
2. Annotation
                                  db_xref = cross references,

                                  = links to records of other
                                  databases which are related
                                  to this record (see later). The
                                  format dbname:identifier




Feature name     Qualifier name
3. Sequence




        Each protein sequence receives also an
        accession number
Other sequence formats
                 Fasta (minimal metadata, basically only sequence)
                      >genename And a description
                      ATCGATGCAGCTATATCCTCGCGATCAGC
                      CGGACAGCTCTCGAGCGCATCGACGACGAC
                 ASN.1       Abstract Syntax Notation (ASN.1)


                 EMBL :all info as in gb, online referred to as 'plain text'
                 XML
                 Fastq : sequence info and base 'call' quality
Important
'Format' has nothing to do with which program you save your file! You don't
have a choice: it needs to be 'plain text format' (.txt - not a file which can be
opened with MS Word such as .doc or .rtf files). Wordpad is a good choice for
this. 'Format' in bioinfo is all about how the information is structured and written
down in the plain text file.
                         http://emboss.sourceforge.net/docs/themes/SequenceFormats.html
http://www.biotnet.org/sites/biotnet.org/files/documents/17/2010_ena_v2.0.ppt


Degree of annotation differs between entries
                                                Batch submitted sequences are
ENA-Annotation:                                 annotated poorly, single
Feature annotation
                                                submissions are annotated better

                                                      Good seq
                                                    1) EMBL-Bank
                                                     annotations
ENA-Assembly:
Assembly information




ENA-Reads:                                      2)Experiment information
                                                   Trace Archive
                                                 is- of most(capillary sequencing)
                                                     Raw data importance in
Sequencing and
sampling information                              batch submissions (e.g.
                                                3) Sequence Read which
                                                    which species, Archive
                                                  - Raw data (Next Gen sequencing)
                                                        technique, ...)

  TIER                    CLASS                        TYPE              ENA structure
SRA contains batch submitted records of which
experiment information is of most importance




    Since the sequences are barely (not) annotated, is
    experiment description important: which machine, which
    organism, which tissue, which developmental stage,
    disease, treatment, …
How to get sequences into the db, and back out

 Submit                                                  Retrieve
 Always submit your sequence data (mostly                One or few sequences
 obliged by journals) and include your ACC
 number in articles (not any other number).               → Use one of the
                                                         numerous webbased tools
                                                         GenBank: Entrez
                                                         EMBL: EB-eye
                                                         MRS: developed for easy
Sequin (GenBank                                          retrieval
stand alone)
                                              retrieve   Many sequences (Batch
Bankit (GenBank submit
web tool)                                                retrieval)
Webin (EMBL                                              → use ftp (file transfer
                                                         protocol)
online submission)                                       → use perl (flexible pro-
                                                         gramming language)
                                                         → BioMart
                                                         http://www.biomart.org/
Example of a primary NA sequence record (ENA)




                           http://www.ebi.ac.uk/ena/about/formats
Example of a primary NA sequence record (ENA)
                                           Text format




 Code usable for   Data linked to that
   searching             code




                                         http://www.ebi.ac.uk/ena/about/formats
Primary sequence data contains a lot of
redundancy!

                                                             Chromosome sequence

                                                             Several gene sequences
                                                             from different labs

                                                             EST sequences
                                                             from transcripts

                                                             cDNA sequence



       Al match to the same gene. Often you end up in your
       database search with all these sequences...
       A lot of redundancy!
The primary sequences are the basis for
analyses that generate derived sequence data
       Scientists/Consortia → primary databases
             –   Source for further analyses. Which?
                   •   Create protein sequences
                   •   Curate the sequence database
                   •   Assemble genomes
                   •   Searching similarities
                   •   Aggregate information about one gene
                   •   …


                           Results stored in derived databases
Protein databases come in two kinds
The most important protein db is UniProt and
contains 'automatic' and manual entries
    UniProt Knowledge Base - 'the best annotated protein
      database of the world'
      http://www.uniprot.org/
The most important protein db is UniProt and
contains 'automatic' and manual entries
Refseq - The NCBI way to reduce redundancy in
primary sequence data
   RefSeq is NCBI 'Reference Sequences' (prot and nuc)
      Redundancy from primary sequence data is reduced both
       automatically and by manual annotation of NA and protein
       sequences. 'one natural biological molecule = one entry'. Links
       back to the original primary sequences. Hugely popular and a
       basis for a lot of analyses.




                                                               Click to apply
                                                               refseq filter in
                                                               entrez search


                                           http://www.ncbi.nlm.nih.gov/RefSeq/
RefSeq has its own identifiers, not to be mixed
up with accession numbers
    Refseq entry codes looks similar as ACC numbers (but are not ACC numbers –
      underscore!); and RefSeq is also in GenBank format. Note: in 'Features'
      section one can find the raw sequences from what is was derived. (typical
      mistake: search with refseq code in uniprot)
    NC_*   (curated) complete genomic element (chromosome, plasmid,...)
    NT_*   (automated) intermediate assembly from BAC
    NZ_*   (automated) incomplete genomic sequence from WGS
    NW_*   (automated) intermediate assembly from WGS
    NG_*   (curated) incomplete genomic element corresponding to gene
    NM_*   (curated) mRNA
    NR_*   (curated) non-coding RNA or predicted transcript of pseudogene
    NP_*   (curated) protein
    ZP_*   (automated) protein predicted from WGS sequence (NZ_*)
    YP_*   (curated) other predicted protein sequences from NCBI Genome Annotation Pipeline
    XM_*   (automated) mRNA
    XR_*   (automated) non-coding RNA or predicted transcript of pseudogene
    XP_*   (automated) protein

                                              http://www.ncbi.nlm.nih.gov/RefSeq/key.html
                                                       http://www.ncbi.nlm.nih.gov/RefSeq/
UniRef – UniProt redundancy reducing system for
proteins sequences

      Non redundant protein sequences from
       UniProt
        ~ refseq
        Hiding redundant sequences by clustering them
        •
            UniRef100 = complete identical sequences
        •
            UniRef90 = 90% identical sequences
        •
            UniRef50 = 50% identical sequences
        See http://www.uniprot.org/help/uniref
NCBI's Gene – summarizes gene information
including sequence information from primary dbs
      Example of the gene NPR1 from A. thaliana
UniGene – summarizes transcriptomic
information around genes
And a lot more derived databases with
sequence information exist
         Repbase :
         repeats (Alu, …), maintained by Jerzy Jurka at the Genetic
           Information Research Institute (Mountain View CA, USA).
           CENSOR server allows to "clean" sequences.
           http://www.girinst.org/repbase
         MiRBase → published miRNA sequences
         http://www.mirbase.org/
         Eukaryotic promoter database
         http://www.epd.isb-sib.ch/
         UniVec
         GenBank subset + some sequences from commercial sources -
           ftp://ftp.ncbi.nih.gov/pub/UniVec/
The most important sequence databases
overview

                                            Integrated
      Prim seq data
                                              Search
                      Derive    Curat
                      d         ed            Portals
           GB         GenPept   RefSeq          Entrez

           ENA        trEMBL
                                              ENA search
                                                EB-eye
          DDBJ
                      UNIPROT   SwissProt      UniProt
Common gene annotations on sequences

 Genome sequence: e.g. Chr6

           Enhancers/promotors                                        terminator

                                         Intron
 Gene sequence                    exon




 mRNA                                                 AAAAAAAAAAAAA

                                 5'UTR     CDS    3'UTR    poly(A) tail


 protein                                             Genetic code tables
Searching the database for your gene of interest

          First you have to determine for yourself
            which information you want

            - NA sequences vs. protein sequences
            - If NA, genomic sequences, or RNA derived
            - All possible sequences that exists, or curated ones
            - Protein sequences of which quality
            - ...
Entrez is a starting point for searches at NCBI
                           http://www.ncbi.nlm.nih.gov/sites/gquery
Visualising the db_xrefs in records at NCBI
ENA has its text-search portal
                                 http://www.ebi.ac.uk/ena/
Results from an ENA search are organised
following the ENA database structure
UniProt has a simple search box leading to a
sophisticated search results page
Complex searches can be achieved by using the
index codes in the database
                             e.g.

                              “oc=Primates and
                              de=complete and
                              de=cds and
                              de=MHC”

 Code usable for             Could answer: give me
   searching                   all coding sequence
                               of MHC available in
                               primates.
Meta-search tools can search different
sequence databases at once.
    MRS
   Open Source, developed by Maarten Hekkelman at Radboud U.
   (Nijmegen, the Netherlands). Allows searching in different databases at
   once, and provides also statistics on the databases.




Alternatives: ACNUC, SRS
Logical operators
         Searching involves making combinations of conditions.
         Here the difference between a logic and, or and not explained by
         venn diagrams.




            Q1 AND Q2
                &


            Q1 NOT Q2
                !



              Q1 OR Q2
                  |
Hands-on!

        Every module ends with an exercise
          session.

        We will now explore how data is stored in different
         sequence databases. You get …. minutes for this
         exercise.
            Afterwards, we summarizes some of the difficulties
              some of you might have experienced.
Summary
          This course is organised in several modules
          Module 1: Sequence databases
          Three major nucleotide databanks host primary sequence data
          These databases are filled with NA sequence information by scientists and consortia
          The batch submissions originate mostly from sequencing centers
          Each primary database stores their sequences and batch submissions in their own way...
          Batch submissions are marked and/or stored differently than single submissions
          The 'normal' submissions are a minority in primary sequence databases
          Primary sequence dbs are synchronised and every sequence receives a unique identifier
          One sequence entry contains three categories of different types of information
          This sequence information can be written in different formats
          Degree of annotation differs between entries
          SRA contains batch submitted records of which experiment information is of most importance
          How to get sequences into the db, and back out
          Primary sequence data contains a lot of redundancy!
          The primary sequences are the basis for analyses that generate derived sequence data
          Protein databases come in two kinds
          The most important protein db is UniProt and contains 'automatic' and manual entries
          Refseq - The NCBI way to reduce redundancy in primary sequence data
          RefSeq has its own identifiers, not to be mixed up with accession numbers
          UniRef – UniProt redundancy reducing system for proteins sequences
          NCBI's Gene – summarizes gene information including sequence information from primary dbs
          UniGene – summarizes transcriptomic information around genes
          And a lot more derived databases with sequence information exist
          Searching the database for your gene of interest
          Entrez is a starting point for searches at NCBI
          Visualising the db_xrefs in records at NCBI
          ENA has its text-search portal
          Results from an ENA search are organised following the ENA database structure
          UniProt has a simple search box leading to a sophisticated search results page
          Complex searches can be achieved by using the index codes in the database
          Meta-search tools can search different sequence databases at once.
          Hands-on!

More Related Content

PPTX
Fistan materi 1 metabolit primer dan sekunder
PPTX
Sequence similarity tools.pptx
PPTX
Genetic Engineering
PPTX
Clustal W - Multiple Sequence alignment
PDF
Biochemical kidney function tests with their clinical applications
PPTX
protein data bank
PPTX
Catalyzing Plant Science Research with RNA-seq
Fistan materi 1 metabolit primer dan sekunder
Sequence similarity tools.pptx
Genetic Engineering
Clustal W - Multiple Sequence alignment
Biochemical kidney function tests with their clinical applications
protein data bank
Catalyzing Plant Science Research with RNA-seq

What's hot (20)

PPTX
BIOSAFETY AND INDIAN ETHICS (DRA)
PPTX
Knockout mice
PPTX
FERMENTERS( BIOREACTORS) AND THEIR TYPES
PPTX
Gene transfer technologies
PPT
Probe labelling
PPTX
Distribution of microbes in aquatic environment
PPTX
Recombinant vaccines
PPTX
Marine Biotechnology.pptx
PDF
photo bioreactor types,advantage,disadvantage,contruction
PPTX
Application of animal biotechnology
PPTX
Oyster farming
PPTX
Xenotransplantation
PPTX
Trnasgenic fish
PPTX
Transgenic animals, mice and fish
PPTX
Applied genetics of cultured fishes
PPT
Pathogens of fish
PPTX
Gynogenesis in fishes
PPTX
PPTX
Biodegradation And Bioremediation
PPTX
Bioremediation and phytoremediation
BIOSAFETY AND INDIAN ETHICS (DRA)
Knockout mice
FERMENTERS( BIOREACTORS) AND THEIR TYPES
Gene transfer technologies
Probe labelling
Distribution of microbes in aquatic environment
Recombinant vaccines
Marine Biotechnology.pptx
photo bioreactor types,advantage,disadvantage,contruction
Application of animal biotechnology
Oyster farming
Xenotransplantation
Trnasgenic fish
Transgenic animals, mice and fish
Applied genetics of cultured fishes
Pathogens of fish
Gynogenesis in fishes
Biodegradation And Bioremediation
Bioremediation and phytoremediation
Ad

Viewers also liked (20)

PPT
Bioinformatics
PDF
Basics of bioinformatics
PPTX
databases in bioinformatics
PPT
BITs: Genome browsers and interpretation of gene lists.
PDF
BITS: Basics of Sequence similarity
PDF
BITS: Basics of sequence analysis
PPTX
Bioinformatics Analysis of Nucleotide Sequences
PPTX
Bioinformatics Final Presentation
PDF
BITS: Overview of important biological databases beyond sequences
PPT
Biological databases
PPTX
Drug discovery and development
PPT
L01 ecture 01-
PPT
Protein structure alignment beyond spatial proximity 3 dsig_2012
PDF
Bioinformatics in dermato-oncology
PPT
STRING - Protein networks from data and text mining
PPT
B.sc biochem i bobi u 3.2 algorithm + blast
PDF
PMR metabolomics and transcriptomics database and its RESTful web APIs: A dat...
PPTX
B.sc biochem i bobi u 4 gene prediction
DOCX
Bioinformatics Final Report
Bioinformatics
Basics of bioinformatics
databases in bioinformatics
BITs: Genome browsers and interpretation of gene lists.
BITS: Basics of Sequence similarity
BITS: Basics of sequence analysis
Bioinformatics Analysis of Nucleotide Sequences
Bioinformatics Final Presentation
BITS: Overview of important biological databases beyond sequences
Biological databases
Drug discovery and development
L01 ecture 01-
Protein structure alignment beyond spatial proximity 3 dsig_2012
Bioinformatics in dermato-oncology
STRING - Protein networks from data and text mining
B.sc biochem i bobi u 3.2 algorithm + blast
PMR metabolomics and transcriptomics database and its RESTful web APIs: A dat...
B.sc biochem i bobi u 4 gene prediction
Bioinformatics Final Report
Ad

Similar to BITS: Basics of sequence databases (20)

PDF
Introduction to Bioinformatics-1.pdf
PPTX
Lecture_1_Introduction_Bioinformatics.pptx
PPTX
Bioinformatics
PPTX
BIOINFO unit 1.pptx
PPTX
Cloud bioinformatics 2
PPTX
Bioinformatics_1_ChenS.pptx
PPTX
EiTESAL eHealth Conference 14&15 May 2017
PPTX
Introduction to databases.pptx
PPTX
Bioinformatica 29-09-2011-t1-bioinformatics
PDF
A Reliable Password-based User Authentication Scheme for Web-based Human Geno...
PPTX
Introduction to Biological database ppt(1).pptx
PDF
Protein function and bioinformatics
PPT
Role of bioinformatics in life sciences research
PPTX
Data base in detail
PPTX
Informal presentation on bioinformatics
PDF
Bioinformatics: History of Bioinformatics, Components of Bioinformatics, Geno...
PDF
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
PPTX
Biological database ppt(1).pptx Introuction
PPTX
Biological database ppt(1).pptx Introuction
Introduction to Bioinformatics-1.pdf
Lecture_1_Introduction_Bioinformatics.pptx
Bioinformatics
BIOINFO unit 1.pptx
Cloud bioinformatics 2
Bioinformatics_1_ChenS.pptx
EiTESAL eHealth Conference 14&15 May 2017
Introduction to databases.pptx
Bioinformatica 29-09-2011-t1-bioinformatics
A Reliable Password-based User Authentication Scheme for Web-based Human Geno...
Introduction to Biological database ppt(1).pptx
Protein function and bioinformatics
Role of bioinformatics in life sciences research
Data base in detail
Informal presentation on bioinformatics
Bioinformatics: History of Bioinformatics, Components of Bioinformatics, Geno...
Genomic Big Data Management, Integration and Mining - Emanuel Weitschek
Biological database ppt(1).pptx Introuction
Biological database ppt(1).pptx Introuction

More from BITS (20)

PDF
RNA-seq for DE analysis: detecting differential expression - part 5
PDF
RNA-seq for DE analysis: extracting counts and QC - part 4
PDF
RNA-seq for DE analysis: the biology behind observed changes - part 6
PDF
RNA-seq: analysis of raw data and preprocessing - part 2
PDF
RNA-seq: general concept, goal and experimental design - part 1
PDF
RNA-seq: Mapping and quality control - part 3
PDF
Productivity tips - Introduction to linux for bioinformatics
PDF
Text mining on the command line - Introduction to linux for bioinformatics
PDF
The structure of Linux - Introduction to Linux for bioinformatics
PDF
Managing your data - Introduction to Linux for bioinformatics
PDF
Introduction to Linux for bioinformatics
PDF
BITS - Genevestigator to easily access transcriptomics data
PDF
BITS - Comparative genomics: the Contra tool
PDF
BITS - Comparative genomics on the genome level
PDF
BITS - Comparative genomics: gene family analysis
PDF
BITS - Introduction to comparative genomics
PDF
BITS - Protein inference from mass spectrometry data
PDF
BITS - Overview of sequence databases for mass spectrometry data analysis
PDF
BITS - Search engines for mass spec data
PDF
BITS - Introduction to proteomics
RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: extracting counts and QC - part 4
RNA-seq for DE analysis: the biology behind observed changes - part 6
RNA-seq: analysis of raw data and preprocessing - part 2
RNA-seq: general concept, goal and experimental design - part 1
RNA-seq: Mapping and quality control - part 3
Productivity tips - Introduction to linux for bioinformatics
Text mining on the command line - Introduction to linux for bioinformatics
The structure of Linux - Introduction to Linux for bioinformatics
Managing your data - Introduction to Linux for bioinformatics
Introduction to Linux for bioinformatics
BITS - Genevestigator to easily access transcriptomics data
BITS - Comparative genomics: the Contra tool
BITS - Comparative genomics on the genome level
BITS - Comparative genomics: gene family analysis
BITS - Introduction to comparative genomics
BITS - Protein inference from mass spectrometry data
BITS - Overview of sequence databases for mass spectrometry data analysis
BITS - Search engines for mass spec data
BITS - Introduction to proteomics

Recently uploaded (20)

PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
Pharma ospi slides which help in ospi learning
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
PDF
Insiders guide to clinical Medicine.pdf
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PPTX
master seminar digital applications in india
PDF
RMMM.pdf make it easy to upload and study
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
2.FourierTransform-ShortQuestionswithAnswers.pdf
Pharmacology of Heart Failure /Pharmacotherapy of CHF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Renaissance Architecture: A Journey from Faith to Humanism
Anesthesia in Laparoscopic Surgery in India
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
human mycosis Human fungal infections are called human mycosis..pptx
Pharma ospi slides which help in ospi learning
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
STATICS OF THE RIGID BODIES Hibbelers.pdf
Abdominal Access Techniques with Prof. Dr. R K Mishra
Microbial diseases, their pathogenesis and prophylaxis
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
Insiders guide to clinical Medicine.pdf
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
master seminar digital applications in india
RMMM.pdf make it easy to upload and study

BITS: Basics of sequence databases

  • 1. Basic bioinformatics concepts, databases and tools Introduction to the training and Sequence databases Joachim Jacob http://www.bits.vib.be Updated 22 February 2012 http://dl.dropbox.com/u/18352887/BITS_training_material/Link%20to%20mod1-intro_H1_2012_SeqDBs.pdf
  • 2. Scope Introductory training to Bioinformatics Exploring and understanding databases and software for everyday bioinformatics use If there is any term which is unclear, please stop me and ask me!
  • 3. Bioinformatics ... Bio all data is derived from living samples Informatics that data is stored and analyzed in and with computers to obtain understanding Extremely broad description, for which however we will extract common principles during the course
  • 4. Bioinformatics is present into every aspect of life sciences research
  • 5. Bioinformatics is present into every aspect of life sciences research
  • 6. Bioinformatics is present into every aspect of life sciences research , sequences
  • 7. Bioinformatics is present into every aspect of life sciences research
  • 8. Bioinformatics is present into every aspect of life sciences research
  • 9. Bioinformatics is present into every aspect of life sciences research
  • 10. Bioinformatics is present into every aspect of life sciences research
  • 11. Bioinformatics is present into every aspect of life sciences research
  • 12. Bioinformatics is present into every aspect of life sciences research
  • 13. Bioinformatics ... Bio - different types of living samples Informatics - storing and categorizing the information and making it easily accessible - interpreting that information reliably
  • 14. Bioinformatics … and his companion Bio - different types of living samples Informatics - storing and categorizing the information and making it easily accessible - interpreting that information reliably Statistics - large numbers, observational data
  • 15. The siblings of Bioinformatics Based on the biological component extracted from life, the measured properties and the ultimate goal of the analysis, different sub-disciplines of bioinformatics exist. DNA RNA proteins metabolites Genomics Transcriptomics Proteomics Metabolomics Epigenomics Structural bioinformatics Systems biology Microbiomics Interactomics Metagenomics Functional genomics Comparative gx
  • 16. Mere data is worth nothing CGCTACGCATATCGCT Data = symbols - Dasypus novemcinctus Information = data that are processed to be useful; - found in my garden provides answers to "who", "what", "where", and - Part of genome - sequenced on June 2010 "when" questions. Also called metadata. This species seems to be Knowledge: application of data and information; related to my neighbor's pet, because it has also this answers "how" questions sequence Has the same mother Understanding: appreciation of "why" Wisdom http://www.systems-thinking.org/dikw/dikw.htm
  • 17. ? ! Life sciences research as major 'end user' for the data knowledge bioinformatics tools and conclusions 'tool user' Tools and approaches Bioinformatics research, as a specific branch on Biology Computer Statistics the boundary of life science, mathematics and computer science 'tool manufacturer'
  • 18. This course is organised in several modules Module 1: Sequence databases: what, where, how Module 2: Sequence comparisons: searching, aligning Module 3: Sequence analysis – domains in protein sequences and predicting functionality, standardisation and useful links Module 4: Beyond sequences - additional important data sources Module 5: Genome Browsers - integrating biological data and performing reproducible bioinformatics research in the Galaxy
  • 19. Overview of the crash course
  • 20. One tip for the future Be prepared for change... Information is fluid So are bioinfo tools Learn how to accommodate for change Major resources are more stable Important concepts do not change often
  • 21. Module 1 Sequence databases
  • 22. Module 1: Sequence databases Sequence databases store DNA and RNA sequences. In Bioinformatics, they are by far (still) the largest collections of biological data, and used by many subdisciplines of bioinformatics. http://www.ebi.ac.uk/embl/Services/DBStats/
  • 23. ... and growing http://www.ebi.ac.uk/embl/Services/DBStats/
  • 24. Three major nucleotide databanks host primary sequence data European Nucleotide Archive (ENA) at EBI - http://www.ebi.ac.uk/ Division EMBL-bank (European Molecular Biology Laboratory) (single) Trace Archive SRA Archive GenBank at NCBI - http://www.ncbi.nlm.nih.gov/ maintained at NCBI (National Center for Biotechnology Information, (USA) DDBJ (DNA Data Bank of Japan) - http://www.ddbj.nig.ac.jp/ maintained at NIG/CIB (National Institute of Genetics, Center for Information Biology, Mishima, Japan)
  • 25. These databases are filled with NA sequence information by scientists and consortia Large-scale Individual Patent sequencing scientists Offices ACTGCTGCTA GCTAGCTGAT projects CTATGCTAGC TGTAGCTGAG Primary sequence data each primary sequence = one experiment Primary sequence Basically, all 'source' nucleotide material database Jennifer McDowall - http://www.biotnet.org/training-materials/nucleotide-sequence-databases-ena
  • 26. Primary NA sequence can be produced by Sanger-based technologies or NGS technologies Sanger sample Low output in number of seqs, high quality, 400-850 bp. Read profiles in .abi format. Stored in Trace Archive. RNA DNA RT NGS Different technologies. Extremely high output rate, low cDNA quality, 30 bp – 600 bp. Reads in .fastq format, stored in the SRA. These techniques can only read DNA strands, so RNA needs first to be converted to cDNA with reverse transcriptases prior to loading to the machines. Sanger overview: http://www.bio.davidson.edu/Courses/Molbio/MolStudents/spring2003/Obenrader NGS overview: http://seqanswers.com/forums/showthread.php?t=3561
  • 27. Overview major DNA reading technologies Dennis Wall, NGS Data Analysis and Computation I course, Wall Lab
  • 28. In the primary sequence dbs a major distinction can be made in two major categories High quality single submission (Sanger) - gene sequence (genomic – 'STD' data class) - mRNA sequence (via cDNA – 'STD') - BAC/YAC/cosmid sequences - genome sequencing projects (contigs, assemblies, WGS) DNA cDNA RNA - genome markers, STS (sequence tagged sites, unique short sequences from a genome) Low quality batch submissions - Expressed Sequence Tags (EST) - Genome Survey Sequences (GSS) - high-throughput sequence data (e.g. NGS) http://www.ebi.ac.uk/ena/about/formats
  • 29. The batch submissions originate mostly from sequencing centers Large-scale sequencing projects chromosome fragment sequencing library submission sequence reads e.g. whole genome shotgun submission assemble sequence submission annotation cyp30 cyp309 insv cg343
  • 30. Each primary database stores their sequences and batch submissions in their own way... - NCBI: ESTs are stored in dbEST (separate database) - ENA: ESTs are part of EMBL-bank in 'EST' data class Similar for GSS (see dbGSS at NCBI) ESTs : expressed sequence tag, often partial sequence derived from RNA in batch. See example >est1 ATCGACTAGCATCA sample >est2 TCGACTAGCGACTA RNA-seq >est3 RNA CAGCATCATCGAC
  • 31. http://www.biotnet.org/sites/biotnet.org/files/documents/17/2010_ena_v2.0.ppt Batch submissions are marked and/or stored differently than single submissions Data class ESTs are ENA-Annotation: also batch submissions Feature annotation 1) EMBL-Bank ENA-Assembly: Assembly information Batch submissions ENA-Reads: 2) Trace Archive Sequencing and - Raw data (capillary sequencing) sampling information 3) Sequence Read Archive - Raw data (Next Gen sequencing) TIER CLASS TYPE ENA structure
  • 32. The 'normal' submissions are a minority in primary sequence databases http://www.ebi.ac.uk/ena/about/statistics#embl_bases_per_dataclass
  • 33. Primary sequence dbs are synchronised and every sequence receives a unique identifier All database maintainers assign and share a unique accession number (AC) to each sequence – besides their own ID number – (info at NCBI). Sequences can get updated, and the accession number is extended with a version number, e.g. .1 (see SVA) Example of acc number: BC010109.2 http://www.insdc.org/ Collaboration on GenBank DDBJ Features, taxonomy,... + SRA Synchronized International nucleotide Sequence databases collaboration daily All use the same - Accession Ids ENA - Project Ids - Feature tables (see later) http://en.wikipedia.org/wiki/Accession_number_(bioinformatics)
  • 34. One sequence entry contains three categories of different types of information 1. Info about sequence, submitters and literature (metadata) 2. Annotations of the sequence (metadata related to the seq) 3. Stretch of ATGC / AUGC sequence (the 'data', at the bottom) • A sequence record is called 'annotated' when biological information is added and linked to a position in the sequence • Annotations, also called 'features', are abbreviated as codes, which can be found in the Feature Tables http://www.ebi.ac.uk/embl/Documentation/FT_d
  • 35. This sequence information can be written in different formats (plain) Text format, e.g. GenBank 1. General info Official shared accession Genbank specific identifier (just sums up with each new) A lot of different identifiers! ~number of databases → conversion tools can translate identifiers needed (see exercises) *In humans: HUGO Nomenclature committee determines the right gene name http://mobyle.pasteur.fr/cgi-bin/portal.py#tutorials::seqfmt
  • 36. 2. Annotation db_xref = cross references, = links to records of other databases which are related to this record (see later). The format dbname:identifier Feature name Qualifier name
  • 37. 3. Sequence Each protein sequence receives also an accession number
  • 38. Other sequence formats Fasta (minimal metadata, basically only sequence) >genename And a description ATCGATGCAGCTATATCCTCGCGATCAGC CGGACAGCTCTCGAGCGCATCGACGACGAC ASN.1 Abstract Syntax Notation (ASN.1) EMBL :all info as in gb, online referred to as 'plain text' XML Fastq : sequence info and base 'call' quality Important 'Format' has nothing to do with which program you save your file! You don't have a choice: it needs to be 'plain text format' (.txt - not a file which can be opened with MS Word such as .doc or .rtf files). Wordpad is a good choice for this. 'Format' in bioinfo is all about how the information is structured and written down in the plain text file. http://emboss.sourceforge.net/docs/themes/SequenceFormats.html
  • 39. http://www.biotnet.org/sites/biotnet.org/files/documents/17/2010_ena_v2.0.ppt Degree of annotation differs between entries Batch submitted sequences are ENA-Annotation: annotated poorly, single Feature annotation submissions are annotated better Good seq 1) EMBL-Bank annotations ENA-Assembly: Assembly information ENA-Reads: 2)Experiment information Trace Archive is- of most(capillary sequencing) Raw data importance in Sequencing and sampling information batch submissions (e.g. 3) Sequence Read which which species, Archive - Raw data (Next Gen sequencing) technique, ...) TIER CLASS TYPE ENA structure
  • 40. SRA contains batch submitted records of which experiment information is of most importance Since the sequences are barely (not) annotated, is experiment description important: which machine, which organism, which tissue, which developmental stage, disease, treatment, …
  • 41. How to get sequences into the db, and back out Submit Retrieve Always submit your sequence data (mostly One or few sequences obliged by journals) and include your ACC number in articles (not any other number). → Use one of the numerous webbased tools GenBank: Entrez EMBL: EB-eye MRS: developed for easy Sequin (GenBank retrieval stand alone) retrieve Many sequences (Batch Bankit (GenBank submit web tool) retrieval) Webin (EMBL → use ftp (file transfer protocol) online submission) → use perl (flexible pro- gramming language) → BioMart http://www.biomart.org/
  • 42. Example of a primary NA sequence record (ENA) http://www.ebi.ac.uk/ena/about/formats
  • 43. Example of a primary NA sequence record (ENA) Text format Code usable for Data linked to that searching code http://www.ebi.ac.uk/ena/about/formats
  • 44. Primary sequence data contains a lot of redundancy! Chromosome sequence Several gene sequences from different labs EST sequences from transcripts cDNA sequence Al match to the same gene. Often you end up in your database search with all these sequences... A lot of redundancy!
  • 45. The primary sequences are the basis for analyses that generate derived sequence data Scientists/Consortia → primary databases – Source for further analyses. Which? • Create protein sequences • Curate the sequence database • Assemble genomes • Searching similarities • Aggregate information about one gene • … Results stored in derived databases
  • 46. Protein databases come in two kinds
  • 47. The most important protein db is UniProt and contains 'automatic' and manual entries UniProt Knowledge Base - 'the best annotated protein database of the world' http://www.uniprot.org/
  • 48. The most important protein db is UniProt and contains 'automatic' and manual entries
  • 49. Refseq - The NCBI way to reduce redundancy in primary sequence data RefSeq is NCBI 'Reference Sequences' (prot and nuc) Redundancy from primary sequence data is reduced both automatically and by manual annotation of NA and protein sequences. 'one natural biological molecule = one entry'. Links back to the original primary sequences. Hugely popular and a basis for a lot of analyses. Click to apply refseq filter in entrez search http://www.ncbi.nlm.nih.gov/RefSeq/
  • 50. RefSeq has its own identifiers, not to be mixed up with accession numbers Refseq entry codes looks similar as ACC numbers (but are not ACC numbers – underscore!); and RefSeq is also in GenBank format. Note: in 'Features' section one can find the raw sequences from what is was derived. (typical mistake: search with refseq code in uniprot) NC_* (curated) complete genomic element (chromosome, plasmid,...) NT_* (automated) intermediate assembly from BAC NZ_* (automated) incomplete genomic sequence from WGS NW_* (automated) intermediate assembly from WGS NG_* (curated) incomplete genomic element corresponding to gene NM_* (curated) mRNA NR_* (curated) non-coding RNA or predicted transcript of pseudogene NP_* (curated) protein ZP_* (automated) protein predicted from WGS sequence (NZ_*) YP_* (curated) other predicted protein sequences from NCBI Genome Annotation Pipeline XM_* (automated) mRNA XR_* (automated) non-coding RNA or predicted transcript of pseudogene XP_* (automated) protein http://www.ncbi.nlm.nih.gov/RefSeq/key.html http://www.ncbi.nlm.nih.gov/RefSeq/
  • 51. UniRef – UniProt redundancy reducing system for proteins sequences Non redundant protein sequences from UniProt ~ refseq Hiding redundant sequences by clustering them • UniRef100 = complete identical sequences • UniRef90 = 90% identical sequences • UniRef50 = 50% identical sequences See http://www.uniprot.org/help/uniref
  • 52. NCBI's Gene – summarizes gene information including sequence information from primary dbs Example of the gene NPR1 from A. thaliana
  • 53. UniGene – summarizes transcriptomic information around genes
  • 54. And a lot more derived databases with sequence information exist Repbase : repeats (Alu, …), maintained by Jerzy Jurka at the Genetic Information Research Institute (Mountain View CA, USA). CENSOR server allows to "clean" sequences. http://www.girinst.org/repbase MiRBase → published miRNA sequences http://www.mirbase.org/ Eukaryotic promoter database http://www.epd.isb-sib.ch/ UniVec GenBank subset + some sequences from commercial sources - ftp://ftp.ncbi.nih.gov/pub/UniVec/
  • 55. The most important sequence databases overview Integrated Prim seq data Search Derive Curat d ed Portals GB GenPept RefSeq Entrez ENA trEMBL ENA search EB-eye DDBJ UNIPROT SwissProt UniProt
  • 56. Common gene annotations on sequences Genome sequence: e.g. Chr6 Enhancers/promotors terminator Intron Gene sequence exon mRNA AAAAAAAAAAAAA 5'UTR CDS 3'UTR poly(A) tail protein Genetic code tables
  • 57. Searching the database for your gene of interest First you have to determine for yourself which information you want - NA sequences vs. protein sequences - If NA, genomic sequences, or RNA derived - All possible sequences that exists, or curated ones - Protein sequences of which quality - ...
  • 58. Entrez is a starting point for searches at NCBI http://www.ncbi.nlm.nih.gov/sites/gquery
  • 59. Visualising the db_xrefs in records at NCBI
  • 60. ENA has its text-search portal http://www.ebi.ac.uk/ena/
  • 61. Results from an ENA search are organised following the ENA database structure
  • 62. UniProt has a simple search box leading to a sophisticated search results page
  • 63. Complex searches can be achieved by using the index codes in the database e.g. “oc=Primates and de=complete and de=cds and de=MHC” Code usable for Could answer: give me searching all coding sequence of MHC available in primates.
  • 64. Meta-search tools can search different sequence databases at once. MRS Open Source, developed by Maarten Hekkelman at Radboud U. (Nijmegen, the Netherlands). Allows searching in different databases at once, and provides also statistics on the databases. Alternatives: ACNUC, SRS
  • 65. Logical operators Searching involves making combinations of conditions. Here the difference between a logic and, or and not explained by venn diagrams. Q1 AND Q2 & Q1 NOT Q2 ! Q1 OR Q2 |
  • 66. Hands-on! Every module ends with an exercise session. We will now explore how data is stored in different sequence databases. You get …. minutes for this exercise. Afterwards, we summarizes some of the difficulties some of you might have experienced.
  • 67. Summary This course is organised in several modules Module 1: Sequence databases Three major nucleotide databanks host primary sequence data These databases are filled with NA sequence information by scientists and consortia The batch submissions originate mostly from sequencing centers Each primary database stores their sequences and batch submissions in their own way... Batch submissions are marked and/or stored differently than single submissions The 'normal' submissions are a minority in primary sequence databases Primary sequence dbs are synchronised and every sequence receives a unique identifier One sequence entry contains three categories of different types of information This sequence information can be written in different formats Degree of annotation differs between entries SRA contains batch submitted records of which experiment information is of most importance How to get sequences into the db, and back out Primary sequence data contains a lot of redundancy! The primary sequences are the basis for analyses that generate derived sequence data Protein databases come in two kinds The most important protein db is UniProt and contains 'automatic' and manual entries Refseq - The NCBI way to reduce redundancy in primary sequence data RefSeq has its own identifiers, not to be mixed up with accession numbers UniRef – UniProt redundancy reducing system for proteins sequences NCBI's Gene – summarizes gene information including sequence information from primary dbs UniGene – summarizes transcriptomic information around genes And a lot more derived databases with sequence information exist Searching the database for your gene of interest Entrez is a starting point for searches at NCBI Visualising the db_xrefs in records at NCBI ENA has its text-search portal Results from an ENA search are organised following the ENA database structure UniProt has a simple search box leading to a sophisticated search results page Complex searches can be achieved by using the index codes in the database Meta-search tools can search different sequence databases at once. Hands-on!