understanding the Human DNA components database design

HME 2228: BIOINFORMATICS
BIOLOGICAL DATABASES
• These are database that store biological data
• The data is collected from scientific experiments and published literature on various
biological topics
• Biological data refers to information derived from study of biological systems,
organisms, and their components
• This can include a wide range of things, from the DNA, RNA and protein
sequence data, medicinal compound made from living organisms, such as
antibiotics, vaccines

Example of categories of biological data include;
Genetic data:
• This includes information about the DNA sequences of organisms, such as their genes and
chromosomes
• Genetic data can be used to study a wide range of topics, such as human health, disease, evolution,
and biodiversity

Genomic data:
• This is a type of genetic data that focuses on the entire genome of an organism
• A genome is an organism's complete set of DNA
• Genomic data can be used to identify genes that are associated with disease, to develop new drugs,
and to understand how organisms evolve

Transcriptomic data:
• This type of data measures the levels of universal gene expression as RNA
transcripts or mRNA in a cell or tissue
• Basically, this the complete set of RNA transcripts or mRNA produced by the
genome under specific circumstances or in a specific cell
• Transcriptomic data can be used to study how genes are regulated and to identify
genes that are involved in diseases such as cancer
Proteomic data:
• This type of data measures the levels and activities of proteins in a cell or tissue;
• A proteome refers to a complete set of proteins produced in an organism, system, or
biological context
• This data includes information on protein sequences, structures, functions, and post-translational
modifications. Proteomic data is crucial for understanding disease mechanisms, drug targets, and
biomarker discovery

Metabolomic Data
• Metabolomics is the large-scale study of small molecules (< 1500 Da), commonly
known as metabolites, within cells, biofluids, tissues or organisms
• Examples; peptides, sugars, amino acids, nucleic acids, organic acids, lipids, fatty
acids
• Collectively, these small molecules and their interactions within a biological system
are known as the metabolome
• Metabolomic data measures the levels of these metabolites in a cell or tissue
• Metabolomic data provides insights into the metabolic processes within a cell, tissue,
or organism, and is used in clinical diagnostics, nutrition, and understanding
metabolic diseases

Microbiome data:
• Microbiome is the entire genome of a collection of microorganisms (such as fungi,
bacteria and viruses) that exists in a particular environment; humans, animals, plants,
soil, oceans, and the atmosphere
• This type of data measures the composition of microbial communities and how it
changes in response to the environment they live in
• Microbiome data can be used to study how the microbiome affects our health and to
develop new treatments for diseases

Phenotypic Data;
• Phenotypic data refer to the observable characteristics or traits of an organism,
tissue, organ or a cell
• These characteristics are determined by both genetic makeup and environmental
influences
• Phenotypic data is critical in genetics, breeding programs, and understanding the
genotype-phenotype relationship

Structural Biology Data:
• Includes data on the molecular structure of biological macromolecules, such as
proteins and nucleic acids, providing insights into how they work and interact with
each other

Ecological and Environmental Data:
• It encompasses a wide range of information about the relationships between
organisms and their environment
• It could include species distribution, population dynamics, environmental
conditions like climate change, and biodiversity information
• Such data is vital for conservation biology, distribution of species, climate change
studies, and ecosystem management
Soo......with this regard, biological database stored information regrading DNA, RNA
and protein sequence data, structural information, gene expression data,
molecular interaction data, mutation data, phenotypic data, metabolic pathways
information, taxonomic information of biological organism, among others

How do you find biological databases;
a) Database Journals;
• These are journals that focus on research and developments in the field of
databases
• These journals cover a wide range of topics related to databases; management
systems, database design, development, implementation, and use
• Such journals include;
• The Journal of Biological Databases and Curation;
https://academic.oup.com/database?login=false
• The Nucleic Acids Research journal: https://academic.oup.com/nar

b) Database portals;
• This is a web-based platform that provides users with a single point of access to a
variety of databases;
• Omics Discovery Index: https://www.omicsdi.org/
• Several biological databases can be browsed and searched through this site
• Database of Biological Databases: https://www.biodbs.info/
• The primary aim of DBD is to provide an easy access to a number of databases
• National Center for Biotechnology Information: https://www.ncbi.nlm.nih.gov/
• It advances science and health by providing access to biomedical and genomic
information
• It provides search and retrieval operations for data from 35 distinct databases

There are two broad categories of biological databases;
1. Primary Database
2. Secondary Databases
1) Primary Database
• It is a collection of original, raw/unprocessed biological data
• The data is derived experimentally derived
• The content controlled by the submitter

• The are various types of primary databases classified based on the data they store;
i. Sequence database:
• They stores information on sequence of nucleotide and protein
• The various categories of sequence databases include;
a) Nucleotide sequence databases
• Nucleotide sequence refers to the order of the building blocks of DNA and RNA
• Most nucleotide sequence databases maintains an own set of submission and
retrieval tools, but they exchange data daily so that all the databases should contain
the same set of sequences

• Some important examples of nucleotide databases include;
• The databases under the International Nucleotide Sequence Database Collaboration
(INSDC); https://www.insdc.org/
• It is collaboration of three organization DNA Data Bank of Japan (DDBJ),
European Molecular Biology Laboratory- European Bioinformatics
Institute (EMBL-EBI) and NCBI-GenBank
• The site provide a one gateway to access;
GenBank, EMBL and DDBJ databases
• The data in these databases is synchronized
such that one entry in GenBank is shared
with EMBL and DDBJ

DNA Data Bank of Japan (DDBJ)
• DDBJ began data bank activities in 1986 at National Institute of Genetics (NIG),
Japan
• DDBJ collects sequence data mainly from Japanese researchers, however, they also
receive data from researchers of any other countries
Main activities of DDBJ
i) As a member INSDC, DDBJ collects nucleotide sequence data from researcher and
exchanges the collected data with EMBL-EBI and GenBank on a daily basis
ii) DDBJ manage bioinformatics tools for data submission and retrieval
iii)DDBJ develops tools for analysis of biological data
iv) Organizes Bioinformatics Training Course to teach how to analyze biological data
• DDBJ can be accessed through; https://www.ddbj.nig.ac.jp/index-e.html

European Molecular Biology Laboratory- European Bioinformatics Institute
(EMBL-EBI)
• EMBL also known as EMBL-Bank and was established in 1980 at the EMBL in
Heidelberg, Germany
• It was the world's first nucleotide sequence database
• European Bioinformatics Institute, is one of the six campuses of EMBL and its
located in UK

• EBI provides freely available data from life science experiments, performs basic
research in computational biology and offers an extensive user training programme
for the researchers
• EBI does not only store data on DNA and RNA but also on gene expression
(RNA, protein and metabolite expression), protein (sequence, families and
motifs), structure (molecular and cellular structures), systems (reaction,
interaction, pathways), chemical biology (chemogenomics and metabolomics),
and literature (scientific publications and patents)
• The component of EBI that store nucleotide sequence data is known as the
European Nucleotide Archive
• EBI can be accessed through; https://www.ebi.ac.uk/

GenBank
• GenBank is a comprehensive repository for DNA and RNA sequences submitted by
researchers from around the world
• It is maintained by the National Center for Biotechnology Information (NCBI)
• The database provides nucleotide sequence and their protein translations including
mRNA sequences with coding regions, segments of genomic DNA with a single
gene or multiple genes, and ribosomal RNA gene clusters
• GenBank can be accessed through; http://www.ncbi.nlm.nih.gov/genbank/

How are sequences identified in these databases;
• Every sequence record in these database are identified using accession numbers
• Accession numbers are unique identifiers assigned to each entry in databases
to ensure that each record can be easily found, cited, and referenced
• It includes a combination of letters and numbers that uniquely identifies a
sequence record; e.g., U12345, AF123456
• The purpose of accession numbers is to provide a stable, permanent way to
uniquely identify a particular entry within a collection

• In biological databases, accession numbers do not change, even if information in the
record is changed at the author's request
• Sometimes, however, an original accession number might become secondary to a
newer accession number, if the authors make a new submission that combines
previous sequences, or if for some reason a new submission supercedes an earlier
record
Let do some practice searching,
•ACCESSION: U49845
•ACCESSION: U75746

Sequence Read Archive (SRA) Database;
• It was also known as Short Read Archive
• It also part of INSDC
• SRA stores raw sequencing data generated from "next-generation“ sequencing
technologies
• Sequencing involves determining the sequence of DNA or RNA molecules
• The technologies used for sequencing are known as "next-generation“ sequencing
or high-throughput sequencing
• NGS method can sequence millions or even billions of molecules simultaneously,
making it much faster and more affordable
• Examples; 454, IonTorrent, Illumina, SOLiD and Complete Genomics

• SRA database is managed and accessible through NCBI: https://www.ncbi.nlm.nih.gov/sra
• It is the largest publicly available repository of high throughput sequencing data
• There are multiple ways to retrieve data from SRA;
• You can browse the database by:
• Studies
• Samples
• Analyses
• Provisional SRA
• Provisional SRA
• In some cases the NCBI SRA publishes datasets on a "provisional" basis
• This may be done in order to accommodate incomplete or preliminary dataset
• Over time, provisional datasets will become fully archived
• When the dataset has been fully loaded into the Archive, then it will no longer appear in the
provisional SRA browser

• READING ASSIGNMENT; Other
example of nucleotide databases; Trace
Archive, Ensembl

b) Protein database
• Primary protein sequence databases store information on protein sequence
• These databases house vast collections of amino acid sequences and related
information
• Some primary protein sequence databases;
Universal Protein Resource (UniProt)
• UniProt is the world's leading database of protein sequences and functional
information
• It was created in 2003 by merging the Swiss-Prot, EMBL-EBI, and Protein
Information Resource (PIR) databases
• It is accessed through; https://www.uniprot.org/

• UniProt is maintained by a consortium of organizations; European Bioinformatics
Institute (EMBL-EBI), the Swiss Institute of Bioinformatics (SIB), and the Protein
Information Resource (PIR)
UniProt contains three main databases:
i. UniProt Knowledgebase (UniProtKB)
• This is the central hub of UniProt, containing high-quality, curated protein sequences
and information; the amino acid sequence, protein name or description, taxonomic
data and citation information

It containing two sections:
• UniProtKB/Swiss-Prot:
• This section is manually annotated and reviewed, providing reliable and precise
information about protein sequences, functions, domains, sites of activity, and
the biological significance of these proteins
• The information is curated by experts and includes extensive cross-references and
literature citations
• UniProtKB/TrEMBL:
• TrEMBL; Translated EMBL Nucleotide Sequence Database
• TrEMBL is a computer-annotated protein sequence database

• The protein sequence are translations of all coding sequences present in the
EMBL-EBI database and not yet integrated to SWISS-PROT
• The translations are generated automatically using computer algorithms where
an amino acid sequence is generated based on the nucleotide sequences in EBI
database
• Thus, the annotations provide useful predictions about the protein but are not
have been verified manually and experimentally
• Essentially, TrEMBL Database supplements SWISS-PROT

ii. UniProt Reference Clusters (UniRef):
• Its part of the UniProt database that groups proteins based on the level of sequence
similarity
• There are three levels of clustering;
• UniRef100: Combines identical sequences and fragments across all organisms into
one entry
• UniRef90: Groups together sequences with at least 90% identity to a representative
sequence in UniRef100
• UniRef50: Clusters together sequences with at least 50% identity to a
representative sequence in UniRef90

iii. UniProt Archive (UniParc)
• This section contains all publicly available protein sequences from different
databases, regardless of their quality or curation status
• It is much larger than the UniProtKB, but the information it contains is not as
reliable
• UniParc contains protein sequences only without any additional information about
the proteins, such as their function or structure

UniProt is a valuable resource for researchers who study proteins. It can be used to:
• Find information about a specific protein
• Compare proteins from different organisms
• Identify proteins with similar amino acid sequence
• Analyze the function of a protein
• Develop new drugs and therapies

Let’s practice searching UniProt: https://www.uniprot.org/
Accession number; A2BC19, P12345, A0A023GPI8
Reading assignment check out:
Protein Data Bank Japan: https://pdbj.org/
Protein-NCBI: https://www.ncbi.nlm.nih.gov/protein/

ii. Structure database;
• Structure databases are primary database created to store information about the
structure of biological macromolecules e.g. DNA, RNA and protein
• The data stored in these databases is generated through structural biology, a branch
of biological sciences that deals with the study of molecular structure of biological
macromolecule
• The database provide insights regarding spatial positions of the molecular atoms,
the cavities, channels, pores, and clefts found in the macromolecular structure
For example in proteins; active sites, secondary, tertiary and quaternary
structures
For DNA, alpha helical structure with major and minor grove

Examples of important structural database;
a) Worldwide Protein Data Bank (wwPDB)
• wwPDB is the central organization that maintains and archives 3-Dimession
structural information of proteins
• It is accessible through; www.wwpdb.org
The wwPDB is composed of four partners:
Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB)
• Its the US partner of wwPDB

Protein Data Bank in Europe
• Is the European partner of wwPDB
Protein Data Bank Japan (PDBj)
• Is the Asian and Middle Eastern partner of wwPDB
Biological Magnetic Resonance Data Bank (BMRB)
• It is the central repository for experimental nuclear magnetic resonance spectral
data for macromolecules, aims to archive and annotate nuclear magnetic resonance
(NMR) data obtained from macromolecules; peptides, proteins, and nucleic acids
• Its managed and sponsored by the US organisations; University of Wisconsin-
Madison, National Library of Medicine, and National Institutes of Health

b) NCBI Structure Resources
• The NCBI devotes one of its databases to the structure information
• Contains information about the three-dimensional structures of biological
macromolecules such as proteins, nucleic acids, and complex assemblies.
Accessible through; https://www.ncbi.nlm.nih.gov/structure/
c) Nucleic Acid Knowledgebase (NAKB)
• It archives 3D structures of nucleic acids, such as DNA and RNA
• NAKB provides search, report, statistics and visualization pages for all nucleic-acid
with experimentally determined 3D structures
• It was established in 2023 to succeed the retired Nucleic Acid Database (NDB)
• Accessible through; https://www.nakb.org/

Practice querying the databases with accession number: 2BH9
In www.wwpdb.org
Other structural databases;
• PDBsum: Structural Summaries of PDB Entries;
http://www.ebi.ac.uk/thornton-srv/databases/pdbsum/
• sc-PDB: A 3D Database of Ligandable Binding Sites;
https://ngdc.cncb.ac.cn/databasecommons/database/id/57
• PDBTM: Protein Data Bank of Transmembrane Proteins; https://pdbtm.unitmp.org/
• CATH Database; http://www.cathdb.info/
• SCOP (Structural Classification of Proteins) Database; http://scop.mrc-lmb.cam.ac.uk/

NEXT: Secondary biological database

understanding the Human DNA components database design

More Related Content

Similar to understanding the Human DNA components database design (20)

Recently uploaded (20)

understanding the Human DNA components database design