Introduction to Biological databases

BIOLOGICAL DATABASES
Dr. K. RameshKumar
Assistant Professor
PG & Research Dept. of Zoology
Vivekananda College
Tiruvedakam- West
MADURAI - 625234

WHAT IS DATABASE?
 A database is any collection of related data.
 A Computerized archive used to store and organize
data in such a way that information can be retrieved
easily.
 A database is a collection of interrelated data store
together without harmful and unnecessary
redundancy (duplicate data) to serve multiple
applications
 Retrieving is called firing a query

 Structured collection of information.
 Consists of basic units called records or entries.
 Each record consists of fields, which hold pre-
defined data related to the record.
 For example, a protein database would have
protein entries as records and protein properties as
fields (e.g., name of protein, length, amino-acid
sequence)

THE ‘PERFECT’ DATABASE
 Comprehensive, but easy to search.
 Annotated, but not “too annotated”.
 A simple, easy to understand structure.
 Cross-referenced.
 Minimum redundancy.
 Easy retrieval of data.

 Data Heterogeneity – diversity in the data types
 High Volume of data – highly heterogeneous-
voluminous to support comprehensive investigation in
various fields
 Uncertainty – Biological phenomena are observed and
assumed to be true
 Data Curation – inconsistencies are due to a lack of
knowledge in the desired field
 Large scale data integration – data collected from lab
worldwide after years of research
 Data sharing – Scientific community examination and
inspection
 Dynamic and subject to continual change – Everyday
in various lab

 Need for storing and communicating large
datasets has grown
 Make biological data available to scientists.
 To make biological data available in computer-
readable form.

DIFFERENT CLASSIFICATIONS OF
DATABASES
 Type of data
 nucleotide sequences
 protein sequences
 proteins sequence patterns or motifs
 macromolecular 3D structure
 gene expression data
 metabolic pathways

DIFFERENT CLASSIFICATIONS OF DATABASES….
 Primary or derived databases
 Primary databases: experimental results directly into
database
 Secondary databases: results of analysis of primary
databases
 Aggregate of many databases
 Links to other data items
 Combination of data
 Consolidation of data

DIFFERENT CLASSIFICATIONS OF DATABASES….
 Availability
 Publicly available, no restrictions
 Available, but with copyright
 Accessible, but not downloadable
 Academic, but not freely available
 Proprietary, commercial; possibly free for academics

THE CENTRAL DOGMA & BIOLOGICAL DATA
Protein structures
-Experiments
-Models (homologues)
Literature information
Original DNA Sequences
(Genomes)
Protein Sequences
-Inferred
-Direct sequencing
Expressed DNA sequence
( = mRNA Sequences
= cDNA sequences)
Expressed Sequence Tag
(ESTs)

NATIONAL CENTER FOR
BIOTECHNOLOGY INFORMATION
Created in 1988 as a part of the
National Library of Medicine at NIH
– Establish public databases
– Research in computational biology
– Develop software tools for sequence analysis
– Disseminate biomedical information
Bethesda,MD

WEB ACCESS: WWW.NCBI.NLM.NIH.GOV
New Homepage
Common footer
New pages!

18
NCBI AND ENTREZ
 One of the largest and most comprehensive
databases belonging to the NIH – national
institute of health (USA)
 Entrez is the search engine of NCBI
 Search for :
genes, proteins, genomes, structures, diseases,
publications and more.
 http://www.ncbi.nlm.nih.gov/

PRIMARY DATABASES
 This databases contains the raw nucleic acid
sequence data which are produced and submitted
by researchers worldwide.
• Nucleic acid
EMBL
GenBank
DDBJ (DNA Data Bank of Japan)
• Protein
PIR
MIPS
SWISS-PROT
TrEMBL
NRL-3D

NUCLEOTIDE SEQUENCE DATABASES
 EMBL, GenBank, and DDBJ are the three primary
nucleotide sequence databases
 EMBL www.ebi.ac.uk/embl/
 GenBank www.ncbi.nlm.nih.gov/Genbank/
 DDBJ www.ddbj.nig.ac.jp
 They together constitute the International
Nucleotide Sequence database callaboration.

GENBANK
 An annotated collection of all publicly available
nucleotide and proteins
 Set up in 1979 at the LANL (Los Alamos).
 Maintained since 1992 NCBI (Bethesda).
 http://www.ncbi.nlm.nih.gov

EMBL NUCLEOTIDE SEQUENCE
DATABASE
 An annotated collection of all publicly available
nucleotide and protein sequences
 Created in 1980 at the European Molecular
Biology Laboratory in Heidelberg.
 Maintained since 1994 by EBI- Cambridge.
 http://www.ebi.ac.uk/embl.html

DDBJ–DNA DATA BANK OF JAPAN
 An annotated collection of all publicly
available nucleotide and protein sequences
 Started, 1984 at the National Institute of
Genetics (NIG) in Mishima.
 Still maintained in this institute a team led
by Takashi Gojobori.
 http://www.ddbj.nig.ac.jp

NCBI DATABASES AND SERVICES
 GenBank primary sequence database
 Free public access to biomedical literature
 PubMed free Medline (3 million searches per day)
 PubMed Central full text online access
 Entrez integrated molecular and literature databases

TYPES OF MOLECULAR DATABASES
 Primary Databases
 Original submissions by experimentalists
 Content controlled by the submitter
 Examples: GenBank, Trace, SRA, SNP, GEO
 Derivative Databases
 Derived from primary data
 Content controlled by third party (NCBI)
 Examples: NCBI Protein, Refseq, TPA, RefSNP, GEO datasets, UniGene,
Homologene, Structure, Conserved Domain

PRIMARY VS. DERIVATIVE SEQUENCE
DATABASES
GenBank
Sequencing
Centers
TATAGCCG TATAGCCG
TATAGCCG TATAGCCG
Labs
Algorithms
UniGene
Curators
RefSeq
Genome
Assembly
TATAGCCG
AGCTCCGATA
CCGATGACAA
Updated
continually
by NCBI
Updated ONLY
by submitters

SEQUENCE DATABASES AT NCBI
 Primary
 GenBank: NCBI’s primary sequence database
 Trace Archive: reads from capillary sequencers
 Sequence Read Archive: next generation data
 Derivative
 GenPept (GenBank translations)
 Outside Protein (UniProt—Swiss-Prot, PDB)
 NCBI Reference Sequences (RefSeq)

GENBANK - PRIMARY SEQUENCE DB
 Nucleotide only sequence database
 Archival in nature
 Historical
 Reflective of submitter point of view (subjective)
 Redundant
 Data
 Direct submissions (traditional records)
 Batch submissions
 FTP accounts (genome data)

GENBANK - PRIMARY SEQUENCE DB (2)
 Three collaborating databases
1. GenBank
2. DNA Database of Japan (DDBJ)
3. European Molecular Biology Laboratory (EMBL) Database

TRADITIONAL GENBANK RECORD
ACCESSION U07418
VERSION U07418.1 GI:466461
Accession
•Stable
•Reportable
•Universal
Version
Tracks changes in sequence
GI number
NCBI internal use
well annotated
the sequence is the data

FEATURES Location/Qualifiers
source 1..2484
/organism="Homo sapiens"
/mol_type="mRNA"
/db_xref="taxon:9606"
/chromosome="3"
/map="3p22-p23"
gene 1..2484
/gene="MLH1"
CDS 22..2292
/gene="MLH1"
/note="homolog of S. cerevisiae PMS1 (Swiss-Prot Accession
Number P14242), S. cerevisiae MLH1 (GenBank Accession
Number U07187), E. coli MUTL (Swiss-Prot Accession Number
P23367), Salmonella typhimurium MUTL (Swiss-Prot Accession
Number P14161) and Streptococcus pneumoniae (Swiss-Prot
Accession Number P14160)"
/codon_start=1
/product="DNA mismatch repair protein homolog"
/protein_id="AAC50285.1"
/db_xref="GI:463989"
/translation="MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKS
TSIQVIVKEGGLKLIQIQDNGTGIRKEDLDIVCERFTTSKLQSFEDLASISTYGFRGE
ALASISHVAHVTITTKTADGKCAYRASYSDGKLKAPPKPCAGNQGTQITVEDLFYNIA
TRRKALKNPSEEYGKILEVVGRYSVHNAGISFSVKKQGETVADVRTLPNASTVDNIRS
GENPEPT: GENBANK CDS TRANSLATIONS
>gi|463989|gb|AAC50285.1| DNA mismatch repair prote...
MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV...
EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD...

REFSEQ: DERIVATIVE SEQUENCE DATABASE
 Curated transcripts and proteins
 Model transcripts and proteins
 Assembled Genomic Regions
 Chromosome records
 Human genome
 microbial
 organelle
ftp://ftp.ncbi.nih.gov/refseq/release/

SELECTED REFSEQ ACCESSION NUMBERS
mRNAs and Proteins
NM_123456 Curated mRNA
NP_123456 Curated Protein
NR_123456 Curated non-coding RNA
XM_123456 Predicted mRNA
XP_123456 Predicted Protein
XR_123456 Predicted non-coding RNA
Gene Records
NG_123456 Reference Genomic Sequence
Chromosome
NC_123455 Microbial replicons, organelle
AC_123455 Alternate assemblies
Assemblies
NT_123456 Contig
NW_123456 WGS Supercontig

REFSEQS: ANNOTATION REAGENTS
Genomic DNA
(NC, NT, NW)
Model mRNA (XM)
(XR)
Curated mRNA (NM)
(NR)
Model protein (XP)
Curated Protein (NP)
Scanning....
= ?
GenBank
Sequences
RefSeq

REFSEQ BENEFITS
 Non-redundancy
 Updates to reflect current sequence data and biology
 Data validation
 Format consistency
 Distinct accession series
 Stewardship by NCBI staff and collaborators

OTHER DERIVATIVE DATABASES
 Expressed Sequences
 dbSNP
 Structure
 Gene
 and more…

ENTREZ
FINDING RELEVANT INFORMATION
IN NCBI DATABASES

ENTREZ: A DISCOVERY SYSTEM
Gene
Taxonomy
PubMed
abstracts
Nucleotide
sequences
Protein
sequences
3-D
Structure
3 -D
Structure
Word weight
VAST
BLAST
BLAST
Phylogeny
Hard Link
Neighbors
Related Sequences
Neighbors
Related Sequences
BLink
Domains
Neighbors
Related Structures
Pre-computed and pre-compiled data.
•A potential “gold mine” of undiscovered
relationships.
•Used less than expected.

GLOBAL QUERY: ALL NCBI DATABASES
The Entrez system: 38 (and counting) integrated databases

TRADITIONAL METHOD: THE LINKS MENU
DNA Sequence
Nucleotide – Protein Link
Related Proteins
Protein – Structure Link
3-D Structure

THE PROBLEM
 Rapidly growing databases with complex and changing
relationships
 Rapidly changing interfaces to match the above
Result
 Many people don’t know:
 Where to begin
 Where to click on a Web page
 Why it might be useful to click there

GLOBAL NCBI (ENTREZ) SEARCH
colon cancer

ENTREZ TIP: START SEARCHES IN GENE
Other Entrez DBs
HomoloGene
Entrez
Protein
Gene
UniGene
BLink
Homologene:
Gene Neighbors

PRECISE RESULTS
MLH1[Gene Name] AND Human[Organism]

GENEVIEW: HUMAN MLH1 VARIATIONS
ATPase domain

‘TAKE HOME MESSAGE’ ADVANTAGES OF
DATA INTEGRATION
 More relevant inter-related information in one place
 Makes it easier to find additional relevant
information related to your initial query
 Potentially find information indirectly linked, but
relevant to your subject of interest
 uncover non-obvious genetic features that explain
phenotype or disease
 Easier to build a ‘story’ based on multiple pieces of
biological evidence

Introduction to Biological databases

More Related Content

What's hot (20)

Similar to Introduction to Biological databases (20)

More from Dr.K.RameshKumar, Assistant Professor,Vivekananda College,Tiruvedakam West, Madurai (15)

Recently uploaded (20)

Introduction to Biological databases