SlideShare a Scribd company logo
Biological Databases
SMT. P.SANGEETHA
LECTURER IN BIOTECHNOLOGY
KVRGCW(A), KURNOOL
Biological Databases
A biological database is a large, organized body of persistent data, usually
associated with computerized software designed to update, query, and retrieve
components of the data stored within the system.
The chief objective of the development of a database is to organize data in a set of
structured records to enable easy retrieval of information.
Example. A few popular databases are GenBank from NCBI (National Center for
Biotechnology Information), SwissProt from the Swiss Institute
of Bioinformatics and PIR from the Protein Information Resource.
Importance of Databases
1. Databases act as a store house of information.
2. Databases are used to store and organize data in such a way that
information can be retrieved easily via a variety of search criteria.
3. It facilitates the discovery of new biological insights from raw data.
Importance of Databases
4. Secondary databases have become the molecular biologist’s reference
library over the past decade or so, providing a wealth of information on just
about any gene or gene product that has been investigated by the research
community.
5. It helps to solve cases where many users want to access the same entries of
data.
6. Allows the indexing of data.
7. It helps to remove redundancy of data.
Types of Biological Databases
1. Based on content of biological data
2. Based on the nature of data.
1. Based on content of biological data
1. Primary databases
2. Secondary databases
1. Primary databases
 Primary databases are also called as Archieval Database.
 They are populated with experimentally derived data such as nucleotide
sequence, protein sequence or macromolecular structure.
 Experimental results are submitted directly into the database by researchers, and
the data are essentially archival in nature.
 Once given a database accession number, the data in primary databases are never
changed: they form part of the scientific record.
1. Primary databases
Examples
GenBank and DDBJ (nucleotide sequence)
Protein Data Bank (PDB; coordinates of three-dimensional macromolecular
structures)
2. Secondary databases
Secondary databases comprise data derived from the results of analysing primary
data.
Secondary databases often draw upon information from numerous sources,
including other databases (primary and secondary), controlled vocabularies and
the scientific literature.
They are highly curated, often using a complex combination of computational
algorithms and manual analysis and interpretation to derive new knowledge from
the public record of science.
2. Secondary databases
Examples
InterPro (protein families, motifs and domains)
UniProt Knowledgebase (sequence and functional information on proteins)
Ensembl (variation, function, regulation and more layered onto whole
genome sequences)
2.Based on the nature of data
1. Structural database
2. Sequence database
i. Protein sequence databases
ii. Nucleic Acid sequence databases
1.Structural databases
The structural databases contain structural information for each material
derived from analysis of diffraction data.
EX. PDB, CATH and SCOP
PDB(Protein Data Bank)
www.rcsb.org/pdb/
 The PDB was established in1970’s at the Brookehaven Lab on Long island, New
York State, US.
 In 1999, the management was moved to the Research Collaboratory for
Structural Bioinformatics(RCSB – a joint organisation between Rutgers University,
San Diego Super Computer Centre).
The PDB entries contain the atomic coordinates, and some structural parameters
connected with the atoms or computed from the structures(secondary structure).
PDB(Protein Data Bank)
 The PDB entries contain some annotations, but it is not as comprehensive
as in SWISS PROT.
 There are no legal restrictions on the use of the data in PDB.
 The Protein Data Bank is an archive of experimentally determined three
dimensional structures (3D) of biological macromolecules, serving a global
community of researchers, educators, and students.
PDB(Protein Data Bank)
 The archives contain atomic coordinates, bibliographic citations, primary
and secondary structure information as well as crystallographic structure
factors and NMR(Nuclear Magnetic Resonance) experimental data.
 PDB is the main primary database for 3D structures of biological
macromolecules determined by X-Ray Crystallography and NMR.
PDB(Protein Data Bank)
Structural biologists usually deposit their structures in the PDB on
publication and some scientific journals require this before accepting a
paper.
 It also accepts the experimental data used to determine the structures(X-
Ray Crystallography and NMR) and homology models.
2. Sequence databases
A sequence database is a type of biological database that is composed of a
large collection of computerised nucleic acid sequences or other polymer
sequences stored on a computer. These include
I. Nucleotide databases
II. Protein databases
NCBI(National Centre for Biotechnological Information)
www.ncbi.nlm.nih.gov
 NCBI is a public available tool on web. NCBI was established in November
1988 at the National Library of Medicine in the United States.
 The NLM was chosen because it had experience in creating and
maintaining biomedical databases and as part of the National Institute of
Health(NIH) , it could establish a research program in computational
molecular biology.
NCBI(National Centre for Biotechnological Information)
 The mission of NCBI is to develop new information technologies to aid in understanding of
fundamental molecular and genetic process that control health and disease.
 More specifically, NCBI has been charged with creating automated systems for storing
and analysing knowledge about molecular biology, biochemistry and genetics; facilitating
the use of such databases and software by the research and medical community,
coordinating efforts to gather biotechnology information both nationally and internationally
and performing research into advanced methods of computer based information processing
for analysing the structure and function of biologically important molecules.
NCBI maintains several databases. They are as
follows
 Literature databases
 Entrez databases
 Nucleotide databases
 Genome specific resources
 Tools for data mining
NCBI maintains several databases. They are as
follows
 Tools for Sequence Analysis
 Tools for 3D structure display and Similarity Searching
 Maps
 Resource Statistics
 Collaborative Cancer Research
 FTP (File Transfer Protocol)
1.Nucleotide databases
The nucleotide database is a collection of sequences from several sources including
GenBank, RefSeq,etc.
I.PRIMARY DATABASES OF NUCLEOTIDE SEQUENCES:
These are the chief databases that store and make available raw nucleic acid sequences to
the public and researchers. They are referred to as primary nucleotide sequence databases
since they are the repository of all the nucleic acid sequences.
Ex. GenBank,DDBJ,EMBL
1.EMBL (European Molecular Biological
Laboratory)
www.ebi.ac.uk
 EMBL is the nucleotide sequence database from EBI(European Bioinformatics
Institute).
 The EBI institute manages databases of biological data including nucleic acid,
protein sequences and macromolecular structures.
 The EBI is a pioneer of novel and developmental bioinformatics research.
 The EBI is a centre for research and services in bioinformatics.
1.EMBL (European Molecular Biological
Laboratory)
 The mission of EBI is to ensure that the growing body of information from
molecular biology and genome research is placed in the public domain and is
accessible freely.
 The databases is produced in collaboration with DDBJ and Gen Bank.
 Information can be retrieved from EMBL using the SRS(Sequence Retrieval
System) ; this links the principal DNA and the protein sequence databases with
motif, structure, mapping and other specialist databases.
1.EMBL (European Molecular Biological
Laboratory)
 SRS is one of the most powerful data browsing retrieval tools available.SRS
provides rapid, user friendly access to the large volumes of diverse and
heterogeneous life science data stored in more than 400 internal and public domain
databases.
 It can be used to browse the various biological sequence and literature databases.
 The EBI provides access to many tools for browsing and retrieving biological
related sequence and literature data.
2.DDBJ (DNA Data Bank of Japan)
www.ddbj.nig.ac.jp
 DDBJ began in 1986 as a collaboration with EMBL and GenBank. The database
is produced, maintained and distributed at the National Institute of Genetics.
 Sequences may be submitted to it from all corners of the world by means of a web
based data submission tool.
 The Web is also used to provide standard search tools such as Fast A and BLAST.
2.DDBJ (DNA Data Bank of Japan)
 DDBJ is a sole DNA Databank of Japan which is officially certified to collect the DNA
sequences from researchers and to issue the internationally recognised accession number to
data submitters.
 DDBJ is one of the International DNA databases including EBI responsible for EMBL
database and NCBI responsible for GenBank database.
 Consequently, DDBJ has been collaborating with the two databanks through exchanging
data and information on Internet, and by holding two meetings, the International DNA
DataBank Advisory Meeting and the International DNA DataBanks Collaborative
Meeting(IAM and ICM).
3. GenBank
 GenBank, the DNA database from NCBI incorporates sequences from publicly
available sources.
 Information can be retrieved from GenBank using the Entrez Integrated
Retrieval system; this combines data from the principal DNA and protein sequence
databases with the information from genome maps and protein structures.
 Additional information on sequences can be accessed via MEDLINE facility
which provides abstracts from the original published articles.
3. GenBank
 GenBank may be searched with the user query sequence by means of
NCBI’s web interface to the BLAST suite of programs.
A GenBank includes the sequence files, indices created on various database
fields and information derived from database(Ex.Gen Pept, a database of
translated coding sequences in FastA format). Most commonly used is the
sequence entry file, which contains the sequence itself and descriptive
information relating to it.
3. GenBank
 A GenBank entry consists of keywords, relevant associated sub key words,
and an optional Feature Table, it end is indicated by a // terminator.
 The entry continues with BASE COUNT record which details the
frequency of occurrence of the different base types in the sequence.
2.Secondary databases of nucleotide
sequences
Many of the secondary databases are simply the sub-collection of sequences culled from one
or other of the primary databases such as GenBank or EMBL.
1.Omniome databases:
2. Fly Base Database
3. ACeDB
2.Secondary databases of nucleotide
sequences
1.Omniome databases:
 is a comprehensive microbial resource maintained by TIGR(The Institute for
Genomic Research].
 It has not only the sequence and annotation of each of the completed genomes,
but also has associated information about the organisms[such as taxon and gram
stain pattern], the structure and composition of their DNA molecules and many
other attributes of protein sequences predicted from the DNA sequences.
2.Secondary databases of nucleotide
sequences
2.Fly Base Database :
A consortium sequenced the entire genome of the fruitfly D.melanogaster to
a high degree of completeness and quality.
3.ACeDB :
It is a repository of not only the sequence but also the genetic map as well as
phenotypic information about the C.elegans nematode worm.
II. PROTEIN DATABASES:
A protein database is one or more datasets about protein’s aminoacid
sequence, conformation, structure and features such as active sites.
1.Primary databases of proteins :
The primary databases hold the experimentally determined protein
sequences inferred from the conceptual translation of nucleotide sequences.
1.PIR (Protein Information Resource)
www.pir.georgetown.edu
 The Protein Sequence Database was developed at the National Biomedical
Research Foundation (NBRF) in US.
 It is involved in collaboration with Martinsred Institute for Protein Sequences
(MIPS), Japan International Protein Information database (JIPID).
 PIR was developed by Margaret Dayhoff as a collection of sequences for
investigating evolutionary relationships among proteins.
1.PIR (Protein Information Resource)
The PIR database is split into four distinct sections – PIR1 to PIR4 which
differ in terms of the quality of data, and level of annotation provided.
PIR 1 – contains fully classified and annotated entries
PIR 2 – includes preliminary entries which have not been thoroughly
reviewed and may contain redundancy
PIR 3 – contains unverified entries, which have not been reviewed
1.PIR (Protein Information Resource)
PIR 4 entries fall into 4 categories :
1. Conceptual translations of artefactual sequences.
2. Conceptual translations of sequences that are not transcribed or translated.
3. Protein sequences or conceptual translations that are genetically engineered.
4. Sequence that are not genetically encoded and produced on ribosomes.
One can search for entries or do sequences similarity searches at the PIR site. The database
can be downloaded as a set of files.
2. SWISS PROT
www.expasy.ch/sprot/
 Swiss Prot is a protein sequence database, established in 1986, was
produced collaboratively by the Department of Medical Biochemistry at the
University of Geneva and the EMBL ; after 1994, the collaboration moved to
EMBL’s UK outstation, EBI.
 In 1998, the collaboration moved to Swiss Institute of
Bioinformatics(SIB). Hence, the database is now maintained collaboratively
by SIB and EBI/EMBL.
2. SWISS PROT
 Swiss Prot is a protein sequence database which strives to provide a high
level of annotations such as the description of the function of a protein, its
domain structure, post translational modifications, variants, etc, a minimal
level of redundancy and high level of integration with other databases.
 In 1996, a computer annotated supplement to SWISSPROT was created,
termed TrEMBL.
2. SWISS PROT
In SWISS PROT , as in many sequence databases, two classes of data can be
distinguished :
1. Core data : Core data consists of :
1. Sequence data
2. Citation information(bibliographic references)
3. Taxonomic data(description of the biological source of the protein)
2. SWISS PROT
2. Annotation :
1. Function of protein
2. Post translational modifications
3. Domains and sites
4. Secondary structure
2. SWISS PROT
2. Annotation :
5. Quaternary structure
6. Similarities to other proteins
7. Diseases associated with any member of deficiencies in the protein
8. Sequence conflicts, variants
2. SWISS PROT
Sequence Entry File
 Each line is flagged with a two letter code, which helps to present the
information in a structured way.
 Entries begin with the identification(ID) line and end with a // terminator.
 ID codes can some times change, so an additional identifier, an accession
number(AC NO.), is also provided which ought to remain static between
database releases.
2. SWISS PROT
Sequence Entry File
 Next, the DT lines provide information about data of entry of the sequence
of database and details of when it was last modified.
 The following lines give the gene name(GN), the Organism Species(OS),
and the Organism Classification(OC) within the biological kingdoms.
2. SWISS PROT
Sequence Entry File
CC- Comment lines denote the function of protein, post translational
modifications, similarity and tissue specificity.
 Database cross reference(DR) lines follow the comment field. These provide links
to other biomolecular databases.
 Following the DR lines; (KW) key words and then a number of FT lines are
present.
2. SWISS PROT
Sequence Entry File
 FT line is Feature Table line which highlights the regions of interest in the
sequence including secondary structure, ligand binding sites, post translational
modifications.
 The final section of database entry includes the sequence(SQ) itself. The entry
ends with a //terminator.
SWISS PROT has become the most widely used protein sequence database in the world.
3. PubMed
PubMed is a free resource supporting the search and retrieval of biomedical and
life sciences literature with the aim of improving health–both globally and
personally.
1.The PubMed database contains more than 33 million citations and abstracts of
biomedical literature.
2.It does not include full text journal articles; however, links to the full text are
often present when available from other sources, such as the publisher's website
or PubMed Central (PMC).
3. PubMed
3. It is available to the public online since 1996.
4. PubMed was developed and is maintained by the National Centre for
Biotechnology Information (NCBI), at the U.S. National Library of Medicine
(NLM), located at the National Institutes of Health (NIH).
5. Citations in PubMed primarily stem from the biomedicine and health fields, and
related disciplines such as life sciences, behavioural sciences, chemical sciences, and
bioengineering.
3. PubMed
PubMed facilitates searching across several NLM literature resources:
1.Medline 2. PubMed Central (PMC) 3. Bookshelf
1. MEDLINE
MEDLINE is the largest component of PubMed and consists primarily of
citations from journals selected for MEDLINE; articles indexed with MeSH
(Medical Subject Headings) and curated with funding, genetic, chemical and
other metadata.
3. PubMed
2. PubMed Central (PMC)
Citations for PubMed Central (PMC) articles make up the second largest
component of PubMed.
PMC is a full text archive that includes articles from journals reviewed and
selected by NLM for archiving (current and historical), as well as individual
articles collected for archiving in compliance with funder policies.
3. PubMed
3. Bookshelf
The final component of PubMed is citations for books and some individual
chapters available on Bookshelf.
Bookshelf is a full text archive of books, reports, databases, and other
documents related to biomedical, health, and life sciences.
1. Secondary databases of proteins
The secondary databases are so termed because they contain the results of analysis of the sequences held in
primary databases.
1. PROSITE:
 A set of databases collects together patterns found in protein sequences rather than the complete
sequences.
 PROSITE is one such pattern database.
 The protein motif and pattern are encoded as regular expressions.
The information corresponding to each entry in PROSITE is of two forms – the patterns and the related
descriptive text.
1. Secondary databases of proteins
2. PRINTS:
In the PRINTS database, the protein sequence patterns, are stored as “finger prints”. The information
includes :
1. The first section contains cross links to other databases that have more information about the
characterised family.
2. The second section provides a table showing how many of the motifs that makeup the finger print occurs
in how many of the sequences of that family.
3. The last section of the entry contains the actual fingerprints that are stored as multiple aligned sets of
sequences , the alignment is made without gaps.
1. Secondary databases of proteins
3.Pfam :
Pfam contains the profiles used using Hidden Markov Models(HMM)
.HMM builds the model of the pattern as a series of the match, substitute,
insert or delete state, with scores assigned for alignment to go from one state
to another.
1. Secondary databases of proteins
4.TrEMBL :
 TrEMBL(Translated EMBL) was created in 1996 as a computer annotated
supplement to SWISS –PROT.
 It contains translations of all the coding sequences (COS) in EMBL.
 TrEMBL was designed to address the need for a well structured SWISS PROT
link resource that would allow very rapid access to sequence data from the genome
projects.
THANK
YOU

More Related Content

PPT
Biological databases
PDF
BIOLOGICAL DATABASE AND ITS TYPES,IMPORTANCE OF BIOLOGICAL DATABASE
PPTX
Primary and secondary databases ppt by puneet kulyana
PPTX
Database in bioinformatics
PPT
Primary and secondary database
PPTX
Protein data bank
PPTX
Nucleic Acid Databases (NDB ) of bioinformatics pptx
PPTX
Protein Databases
Biological databases
BIOLOGICAL DATABASE AND ITS TYPES,IMPORTANCE OF BIOLOGICAL DATABASE
Primary and secondary databases ppt by puneet kulyana
Database in bioinformatics
Primary and secondary database
Protein data bank
Nucleic Acid Databases (NDB ) of bioinformatics pptx
Protein Databases

What's hot (20)

PPTX
History and devolopment of bioinfomatics.ppt (1)
PDF
Nucleic Acid Sequence databases
PPTX
European molecular biology laboratory (EMBL)
PPTX
Introduction to NCBI
PPTX
Entrez databases
PPTX
Bioinformatics
PPTX
History and scope in bioinformatics
PDF
Protein Structure Prediction
PDF
Tools and database of NCBI
PPTX
Protein database
PPTX
Swiss prot database
PDF
Structural databases
PDF
PPTX
History of animal cell culture, cell final
PPTX
PPTX
Needleman-Wunsch Algorithm
PPTX
Genomic databases
PPTX
BIOLOGICAL SEQUENCE DATABASES
History and devolopment of bioinfomatics.ppt (1)
Nucleic Acid Sequence databases
European molecular biology laboratory (EMBL)
Introduction to NCBI
Entrez databases
Bioinformatics
History and scope in bioinformatics
Protein Structure Prediction
Tools and database of NCBI
Protein database
Swiss prot database
Structural databases
History of animal cell culture, cell final
Needleman-Wunsch Algorithm
Genomic databases
BIOLOGICAL SEQUENCE DATABASES
Ad

Similar to Biological databases.pptx (20)

PDF
Bioinformatics introduction
PPTX
Introduction OF BIOLOGICAL DATABASE
PPTX
Biological database ppt(1).pptx Introuction
PPTX
Primary Bioinformatics Database.pptx
PPTX
Presentation on Biological database By Elufer Akram @ University Of Science ...
PPTX
Biological database
PPTX
DATABASES...............................pptx
PPTX
biological databases.pptx
PPTX
BIOINFORMATICS BIOLOGICAL DATABASES DATA BASES.pptx
PPTX
Primary Databases.pptx
PPT
Bioinformatics in biotechnology by kk sahu
PPTX
Nucleic acid and protein databanks
PPTX
Biological databases
PPTX
Biological databases
PPTX
Sequence and Structural Databases of DNA and Protein, and its significance in...
PPTX
Sequence and Structural Databases of DNA and Protein, and its significance in...
PPTX
Share_Introduction to Bioinformatics-WPS_Office.pptx
PPTX
Nucleic acid database
PPTX
Biological databases
PDF
Bioinformatics biological databases
Bioinformatics introduction
Introduction OF BIOLOGICAL DATABASE
Biological database ppt(1).pptx Introuction
Primary Bioinformatics Database.pptx
Presentation on Biological database By Elufer Akram @ University Of Science ...
Biological database
DATABASES...............................pptx
biological databases.pptx
BIOINFORMATICS BIOLOGICAL DATABASES DATA BASES.pptx
Primary Databases.pptx
Bioinformatics in biotechnology by kk sahu
Nucleic acid and protein databanks
Biological databases
Biological databases
Sequence and Structural Databases of DNA and Protein, and its significance in...
Sequence and Structural Databases of DNA and Protein, and its significance in...
Share_Introduction to Bioinformatics-WPS_Office.pptx
Nucleic acid database
Biological databases
Bioinformatics biological databases
Ad

More from PagudalaSangeetha (10)

PPTX
6.2 Organic Container Gardening.pptx
PPTX
6.1 Urban Farming.pptx
PPTX
Bioinformatics.pptx
PPTX
OMICS.pptx
PPTX
PHYLOGENETIC TREE CONSTRUCTION.pptx
PPTX
Fatty acid biosynthesis.pptx
PPTX
Ketone bodies.pptx
PPTX
Sequence alignment.pptx
PPTX
Sequence similarity tools.pptx
PPTX
Vaccines.pptx
6.2 Organic Container Gardening.pptx
6.1 Urban Farming.pptx
Bioinformatics.pptx
OMICS.pptx
PHYLOGENETIC TREE CONSTRUCTION.pptx
Fatty acid biosynthesis.pptx
Ketone bodies.pptx
Sequence alignment.pptx
Sequence similarity tools.pptx
Vaccines.pptx

Recently uploaded (20)

PPTX
Introduction to Cardiovascular system_structure and functions-1
PPTX
Application of enzymes in medicine (2).pptx
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PPTX
Microbiology with diagram medical studies .pptx
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PPTX
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PPTX
C1 cut-Methane and it's Derivatives.pptx
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PDF
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PDF
lecture 2026 of Sjogren's syndrome l .pdf
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PPTX
Science Quipper for lesson in grade 8 Matatag Curriculum
PPTX
BIOMOLECULES PPT........................
PDF
Biophysics 2.pdffffffffffffffffffffffffff
Introduction to Cardiovascular system_structure and functions-1
Application of enzymes in medicine (2).pptx
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
Microbiology with diagram medical studies .pptx
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
ECG_Course_Presentation د.محمد صقران ppt
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
C1 cut-Methane and it's Derivatives.pptx
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
POSITIONING IN OPERATION THEATRE ROOM.ppt
lecture 2026 of Sjogren's syndrome l .pdf
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
Science Quipper for lesson in grade 8 Matatag Curriculum
BIOMOLECULES PPT........................
Biophysics 2.pdffffffffffffffffffffffffff

Biological databases.pptx

  • 1. Biological Databases SMT. P.SANGEETHA LECTURER IN BIOTECHNOLOGY KVRGCW(A), KURNOOL
  • 2. Biological Databases A biological database is a large, organized body of persistent data, usually associated with computerized software designed to update, query, and retrieve components of the data stored within the system. The chief objective of the development of a database is to organize data in a set of structured records to enable easy retrieval of information. Example. A few popular databases are GenBank from NCBI (National Center for Biotechnology Information), SwissProt from the Swiss Institute of Bioinformatics and PIR from the Protein Information Resource.
  • 3. Importance of Databases 1. Databases act as a store house of information. 2. Databases are used to store and organize data in such a way that information can be retrieved easily via a variety of search criteria. 3. It facilitates the discovery of new biological insights from raw data.
  • 4. Importance of Databases 4. Secondary databases have become the molecular biologist’s reference library over the past decade or so, providing a wealth of information on just about any gene or gene product that has been investigated by the research community. 5. It helps to solve cases where many users want to access the same entries of data. 6. Allows the indexing of data. 7. It helps to remove redundancy of data.
  • 5. Types of Biological Databases 1. Based on content of biological data 2. Based on the nature of data.
  • 6. 1. Based on content of biological data 1. Primary databases 2. Secondary databases
  • 7. 1. Primary databases  Primary databases are also called as Archieval Database.  They are populated with experimentally derived data such as nucleotide sequence, protein sequence or macromolecular structure.  Experimental results are submitted directly into the database by researchers, and the data are essentially archival in nature.  Once given a database accession number, the data in primary databases are never changed: they form part of the scientific record.
  • 8. 1. Primary databases Examples GenBank and DDBJ (nucleotide sequence) Protein Data Bank (PDB; coordinates of three-dimensional macromolecular structures)
  • 9. 2. Secondary databases Secondary databases comprise data derived from the results of analysing primary data. Secondary databases often draw upon information from numerous sources, including other databases (primary and secondary), controlled vocabularies and the scientific literature. They are highly curated, often using a complex combination of computational algorithms and manual analysis and interpretation to derive new knowledge from the public record of science.
  • 10. 2. Secondary databases Examples InterPro (protein families, motifs and domains) UniProt Knowledgebase (sequence and functional information on proteins) Ensembl (variation, function, regulation and more layered onto whole genome sequences)
  • 11. 2.Based on the nature of data 1. Structural database 2. Sequence database i. Protein sequence databases ii. Nucleic Acid sequence databases
  • 12. 1.Structural databases The structural databases contain structural information for each material derived from analysis of diffraction data. EX. PDB, CATH and SCOP
  • 13. PDB(Protein Data Bank) www.rcsb.org/pdb/  The PDB was established in1970’s at the Brookehaven Lab on Long island, New York State, US.  In 1999, the management was moved to the Research Collaboratory for Structural Bioinformatics(RCSB – a joint organisation between Rutgers University, San Diego Super Computer Centre). The PDB entries contain the atomic coordinates, and some structural parameters connected with the atoms or computed from the structures(secondary structure).
  • 14. PDB(Protein Data Bank)  The PDB entries contain some annotations, but it is not as comprehensive as in SWISS PROT.  There are no legal restrictions on the use of the data in PDB.  The Protein Data Bank is an archive of experimentally determined three dimensional structures (3D) of biological macromolecules, serving a global community of researchers, educators, and students.
  • 15. PDB(Protein Data Bank)  The archives contain atomic coordinates, bibliographic citations, primary and secondary structure information as well as crystallographic structure factors and NMR(Nuclear Magnetic Resonance) experimental data.  PDB is the main primary database for 3D structures of biological macromolecules determined by X-Ray Crystallography and NMR.
  • 16. PDB(Protein Data Bank) Structural biologists usually deposit their structures in the PDB on publication and some scientific journals require this before accepting a paper.  It also accepts the experimental data used to determine the structures(X- Ray Crystallography and NMR) and homology models.
  • 17. 2. Sequence databases A sequence database is a type of biological database that is composed of a large collection of computerised nucleic acid sequences or other polymer sequences stored on a computer. These include I. Nucleotide databases II. Protein databases
  • 18. NCBI(National Centre for Biotechnological Information) www.ncbi.nlm.nih.gov  NCBI is a public available tool on web. NCBI was established in November 1988 at the National Library of Medicine in the United States.  The NLM was chosen because it had experience in creating and maintaining biomedical databases and as part of the National Institute of Health(NIH) , it could establish a research program in computational molecular biology.
  • 19. NCBI(National Centre for Biotechnological Information)  The mission of NCBI is to develop new information technologies to aid in understanding of fundamental molecular and genetic process that control health and disease.  More specifically, NCBI has been charged with creating automated systems for storing and analysing knowledge about molecular biology, biochemistry and genetics; facilitating the use of such databases and software by the research and medical community, coordinating efforts to gather biotechnology information both nationally and internationally and performing research into advanced methods of computer based information processing for analysing the structure and function of biologically important molecules.
  • 20. NCBI maintains several databases. They are as follows  Literature databases  Entrez databases  Nucleotide databases  Genome specific resources  Tools for data mining
  • 21. NCBI maintains several databases. They are as follows  Tools for Sequence Analysis  Tools for 3D structure display and Similarity Searching  Maps  Resource Statistics  Collaborative Cancer Research  FTP (File Transfer Protocol)
  • 22. 1.Nucleotide databases The nucleotide database is a collection of sequences from several sources including GenBank, RefSeq,etc. I.PRIMARY DATABASES OF NUCLEOTIDE SEQUENCES: These are the chief databases that store and make available raw nucleic acid sequences to the public and researchers. They are referred to as primary nucleotide sequence databases since they are the repository of all the nucleic acid sequences. Ex. GenBank,DDBJ,EMBL
  • 23. 1.EMBL (European Molecular Biological Laboratory) www.ebi.ac.uk  EMBL is the nucleotide sequence database from EBI(European Bioinformatics Institute).  The EBI institute manages databases of biological data including nucleic acid, protein sequences and macromolecular structures.  The EBI is a pioneer of novel and developmental bioinformatics research.  The EBI is a centre for research and services in bioinformatics.
  • 24. 1.EMBL (European Molecular Biological Laboratory)  The mission of EBI is to ensure that the growing body of information from molecular biology and genome research is placed in the public domain and is accessible freely.  The databases is produced in collaboration with DDBJ and Gen Bank.  Information can be retrieved from EMBL using the SRS(Sequence Retrieval System) ; this links the principal DNA and the protein sequence databases with motif, structure, mapping and other specialist databases.
  • 25. 1.EMBL (European Molecular Biological Laboratory)  SRS is one of the most powerful data browsing retrieval tools available.SRS provides rapid, user friendly access to the large volumes of diverse and heterogeneous life science data stored in more than 400 internal and public domain databases.  It can be used to browse the various biological sequence and literature databases.  The EBI provides access to many tools for browsing and retrieving biological related sequence and literature data.
  • 26. 2.DDBJ (DNA Data Bank of Japan) www.ddbj.nig.ac.jp  DDBJ began in 1986 as a collaboration with EMBL and GenBank. The database is produced, maintained and distributed at the National Institute of Genetics.  Sequences may be submitted to it from all corners of the world by means of a web based data submission tool.  The Web is also used to provide standard search tools such as Fast A and BLAST.
  • 27. 2.DDBJ (DNA Data Bank of Japan)  DDBJ is a sole DNA Databank of Japan which is officially certified to collect the DNA sequences from researchers and to issue the internationally recognised accession number to data submitters.  DDBJ is one of the International DNA databases including EBI responsible for EMBL database and NCBI responsible for GenBank database.  Consequently, DDBJ has been collaborating with the two databanks through exchanging data and information on Internet, and by holding two meetings, the International DNA DataBank Advisory Meeting and the International DNA DataBanks Collaborative Meeting(IAM and ICM).
  • 28. 3. GenBank  GenBank, the DNA database from NCBI incorporates sequences from publicly available sources.  Information can be retrieved from GenBank using the Entrez Integrated Retrieval system; this combines data from the principal DNA and protein sequence databases with the information from genome maps and protein structures.  Additional information on sequences can be accessed via MEDLINE facility which provides abstracts from the original published articles.
  • 29. 3. GenBank  GenBank may be searched with the user query sequence by means of NCBI’s web interface to the BLAST suite of programs. A GenBank includes the sequence files, indices created on various database fields and information derived from database(Ex.Gen Pept, a database of translated coding sequences in FastA format). Most commonly used is the sequence entry file, which contains the sequence itself and descriptive information relating to it.
  • 30. 3. GenBank  A GenBank entry consists of keywords, relevant associated sub key words, and an optional Feature Table, it end is indicated by a // terminator.  The entry continues with BASE COUNT record which details the frequency of occurrence of the different base types in the sequence.
  • 31. 2.Secondary databases of nucleotide sequences Many of the secondary databases are simply the sub-collection of sequences culled from one or other of the primary databases such as GenBank or EMBL. 1.Omniome databases: 2. Fly Base Database 3. ACeDB
  • 32. 2.Secondary databases of nucleotide sequences 1.Omniome databases:  is a comprehensive microbial resource maintained by TIGR(The Institute for Genomic Research].  It has not only the sequence and annotation of each of the completed genomes, but also has associated information about the organisms[such as taxon and gram stain pattern], the structure and composition of their DNA molecules and many other attributes of protein sequences predicted from the DNA sequences.
  • 33. 2.Secondary databases of nucleotide sequences 2.Fly Base Database : A consortium sequenced the entire genome of the fruitfly D.melanogaster to a high degree of completeness and quality. 3.ACeDB : It is a repository of not only the sequence but also the genetic map as well as phenotypic information about the C.elegans nematode worm.
  • 34. II. PROTEIN DATABASES: A protein database is one or more datasets about protein’s aminoacid sequence, conformation, structure and features such as active sites. 1.Primary databases of proteins : The primary databases hold the experimentally determined protein sequences inferred from the conceptual translation of nucleotide sequences.
  • 35. 1.PIR (Protein Information Resource) www.pir.georgetown.edu  The Protein Sequence Database was developed at the National Biomedical Research Foundation (NBRF) in US.  It is involved in collaboration with Martinsred Institute for Protein Sequences (MIPS), Japan International Protein Information database (JIPID).  PIR was developed by Margaret Dayhoff as a collection of sequences for investigating evolutionary relationships among proteins.
  • 36. 1.PIR (Protein Information Resource) The PIR database is split into four distinct sections – PIR1 to PIR4 which differ in terms of the quality of data, and level of annotation provided. PIR 1 – contains fully classified and annotated entries PIR 2 – includes preliminary entries which have not been thoroughly reviewed and may contain redundancy PIR 3 – contains unverified entries, which have not been reviewed
  • 37. 1.PIR (Protein Information Resource) PIR 4 entries fall into 4 categories : 1. Conceptual translations of artefactual sequences. 2. Conceptual translations of sequences that are not transcribed or translated. 3. Protein sequences or conceptual translations that are genetically engineered. 4. Sequence that are not genetically encoded and produced on ribosomes. One can search for entries or do sequences similarity searches at the PIR site. The database can be downloaded as a set of files.
  • 38. 2. SWISS PROT www.expasy.ch/sprot/  Swiss Prot is a protein sequence database, established in 1986, was produced collaboratively by the Department of Medical Biochemistry at the University of Geneva and the EMBL ; after 1994, the collaboration moved to EMBL’s UK outstation, EBI.  In 1998, the collaboration moved to Swiss Institute of Bioinformatics(SIB). Hence, the database is now maintained collaboratively by SIB and EBI/EMBL.
  • 39. 2. SWISS PROT  Swiss Prot is a protein sequence database which strives to provide a high level of annotations such as the description of the function of a protein, its domain structure, post translational modifications, variants, etc, a minimal level of redundancy and high level of integration with other databases.  In 1996, a computer annotated supplement to SWISSPROT was created, termed TrEMBL.
  • 40. 2. SWISS PROT In SWISS PROT , as in many sequence databases, two classes of data can be distinguished : 1. Core data : Core data consists of : 1. Sequence data 2. Citation information(bibliographic references) 3. Taxonomic data(description of the biological source of the protein)
  • 41. 2. SWISS PROT 2. Annotation : 1. Function of protein 2. Post translational modifications 3. Domains and sites 4. Secondary structure
  • 42. 2. SWISS PROT 2. Annotation : 5. Quaternary structure 6. Similarities to other proteins 7. Diseases associated with any member of deficiencies in the protein 8. Sequence conflicts, variants
  • 43. 2. SWISS PROT Sequence Entry File  Each line is flagged with a two letter code, which helps to present the information in a structured way.  Entries begin with the identification(ID) line and end with a // terminator.  ID codes can some times change, so an additional identifier, an accession number(AC NO.), is also provided which ought to remain static between database releases.
  • 44. 2. SWISS PROT Sequence Entry File  Next, the DT lines provide information about data of entry of the sequence of database and details of when it was last modified.  The following lines give the gene name(GN), the Organism Species(OS), and the Organism Classification(OC) within the biological kingdoms.
  • 45. 2. SWISS PROT Sequence Entry File CC- Comment lines denote the function of protein, post translational modifications, similarity and tissue specificity.  Database cross reference(DR) lines follow the comment field. These provide links to other biomolecular databases.  Following the DR lines; (KW) key words and then a number of FT lines are present.
  • 46. 2. SWISS PROT Sequence Entry File  FT line is Feature Table line which highlights the regions of interest in the sequence including secondary structure, ligand binding sites, post translational modifications.  The final section of database entry includes the sequence(SQ) itself. The entry ends with a //terminator. SWISS PROT has become the most widely used protein sequence database in the world.
  • 47. 3. PubMed PubMed is a free resource supporting the search and retrieval of biomedical and life sciences literature with the aim of improving health–both globally and personally. 1.The PubMed database contains more than 33 million citations and abstracts of biomedical literature. 2.It does not include full text journal articles; however, links to the full text are often present when available from other sources, such as the publisher's website or PubMed Central (PMC).
  • 48. 3. PubMed 3. It is available to the public online since 1996. 4. PubMed was developed and is maintained by the National Centre for Biotechnology Information (NCBI), at the U.S. National Library of Medicine (NLM), located at the National Institutes of Health (NIH). 5. Citations in PubMed primarily stem from the biomedicine and health fields, and related disciplines such as life sciences, behavioural sciences, chemical sciences, and bioengineering.
  • 49. 3. PubMed PubMed facilitates searching across several NLM literature resources: 1.Medline 2. PubMed Central (PMC) 3. Bookshelf 1. MEDLINE MEDLINE is the largest component of PubMed and consists primarily of citations from journals selected for MEDLINE; articles indexed with MeSH (Medical Subject Headings) and curated with funding, genetic, chemical and other metadata.
  • 50. 3. PubMed 2. PubMed Central (PMC) Citations for PubMed Central (PMC) articles make up the second largest component of PubMed. PMC is a full text archive that includes articles from journals reviewed and selected by NLM for archiving (current and historical), as well as individual articles collected for archiving in compliance with funder policies.
  • 51. 3. PubMed 3. Bookshelf The final component of PubMed is citations for books and some individual chapters available on Bookshelf. Bookshelf is a full text archive of books, reports, databases, and other documents related to biomedical, health, and life sciences.
  • 52. 1. Secondary databases of proteins The secondary databases are so termed because they contain the results of analysis of the sequences held in primary databases. 1. PROSITE:  A set of databases collects together patterns found in protein sequences rather than the complete sequences.  PROSITE is one such pattern database.  The protein motif and pattern are encoded as regular expressions. The information corresponding to each entry in PROSITE is of two forms – the patterns and the related descriptive text.
  • 53. 1. Secondary databases of proteins 2. PRINTS: In the PRINTS database, the protein sequence patterns, are stored as “finger prints”. The information includes : 1. The first section contains cross links to other databases that have more information about the characterised family. 2. The second section provides a table showing how many of the motifs that makeup the finger print occurs in how many of the sequences of that family. 3. The last section of the entry contains the actual fingerprints that are stored as multiple aligned sets of sequences , the alignment is made without gaps.
  • 54. 1. Secondary databases of proteins 3.Pfam : Pfam contains the profiles used using Hidden Markov Models(HMM) .HMM builds the model of the pattern as a series of the match, substitute, insert or delete state, with scores assigned for alignment to go from one state to another.
  • 55. 1. Secondary databases of proteins 4.TrEMBL :  TrEMBL(Translated EMBL) was created in 1996 as a computer annotated supplement to SWISS –PROT.  It contains translations of all the coding sequences (COS) in EMBL.  TrEMBL was designed to address the need for a well structured SWISS PROT link resource that would allow very rapid access to sequence data from the genome projects.