SlideShare a Scribd company logo
HME 2228: BIOINFORMATICS
BIOLOGICAL DATABASES
• These are database that store biological data
• The data is collected from scientific experiments and published literature on various
biological topics
• Biological data refers to information derived from study of biological systems,
organisms, and their components
• This can include a wide range of things, from the DNA, RNA and protein
sequence data, medicinal compound made from living organisms, such as
antibiotics, vaccines
HME 2228: BIOINFORMATICS
Example of categories of biological data include;
Genetic data:
• This includes information about the DNA sequences of organisms, such as their genes and
chromosomes
• Genetic data can be used to study a wide range of topics, such as human health, disease, evolution,
and biodiversity
HME 2228: BIOINFORMATICS
Genomic data:
• This is a type of genetic data that focuses on the entire genome of an organism
• A genome is an organism's complete set of DNA
• Genomic data can be used to identify genes that are associated with disease, to develop new drugs,
and to understand how organisms evolve
HME 2228: BIOINFORMATICS
Transcriptomic data:
• This type of data measures the levels of universal gene expression as RNA
transcripts or mRNA in a cell or tissue
• Basically, this the complete set of RNA transcripts or mRNA produced by the
genome under specific circumstances or in a specific cell
• Transcriptomic data can be used to study how genes are regulated and to identify
genes that are involved in diseases such as cancer
Proteomic data:
• This type of data measures the levels and activities of proteins in a cell or tissue;
• A proteome refers to a complete set of proteins produced in an organism, system, or
biological context
• This data includes information on protein sequences, structures, functions, and post-translational
modifications. Proteomic data is crucial for understanding disease mechanisms, drug targets, and
biomarker discovery
HME 2228: BIOINFORMATICS
Metabolomic Data
• Metabolomics is the large-scale study of small molecules (< 1500 Da), commonly
known as metabolites, within cells, biofluids, tissues or organisms
• Examples; peptides, sugars, amino acids, nucleic acids, organic acids, lipids, fatty
acids
• Collectively, these small molecules and their interactions within a biological system
are known as the metabolome
• Metabolomic data measures the levels of these metabolites in a cell or tissue
• Metabolomic data provides insights into the metabolic processes within a cell, tissue,
or organism, and is used in clinical diagnostics, nutrition, and understanding
metabolic diseases
HME 2228: BIOINFORMATICS
Microbiome data:
• Microbiome is the entire genome of a collection of microorganisms (such as fungi,
bacteria and viruses) that exists in a particular environment; humans, animals, plants,
soil, oceans, and the atmosphere
• This type of data measures the composition of microbial communities and how it
changes in response to the environment they live in
• Microbiome data can be used to study how the microbiome affects our health and to
develop new treatments for diseases
HME 2228: BIOINFORMATICS
Phenotypic Data;
• Phenotypic data refer to the observable characteristics or traits of an organism,
tissue, organ or a cell
• These characteristics are determined by both genetic makeup and environmental
influences
• Phenotypic data is critical in genetics, breeding programs, and understanding the
genotype-phenotype relationship
HME 2228: BIOINFORMATICS
Structural Biology Data:
• Includes data on the molecular structure of biological macromolecules, such as
proteins and nucleic acids, providing insights into how they work and interact with
each other
HME 2228: BIOINFORMATICS
Ecological and Environmental Data:
• It encompasses a wide range of information about the relationships between
organisms and their environment
• It could include species distribution, population dynamics, environmental
conditions like climate change, and biodiversity information
• Such data is vital for conservation biology, distribution of species, climate change
studies, and ecosystem management
Soo......with this regard, biological database stored information regrading DNA, RNA
and protein sequence data, structural information, gene expression data,
molecular interaction data, mutation data, phenotypic data, metabolic pathways
information, taxonomic information of biological organism, among others
HME 2228: BIOINFORMATICS
How do you find biological databases;
a) Database Journals;
• These are journals that focus on research and developments in the field of
databases
• These journals cover a wide range of topics related to databases; management
systems, database design, development, implementation, and use
• Such journals include;
• The Journal of Biological Databases and Curation;
https://academic.oup.com/database?login=false
• The Nucleic Acids Research journal: https://academic.oup.com/nar
HME 2228: BIOINFORMATICS
b) Database portals;
• This is a web-based platform that provides users with a single point of access to a
variety of databases;
• Omics Discovery Index: https://www.omicsdi.org/
• Several biological databases can be browsed and searched through this site
• Database of Biological Databases: https://www.biodbs.info/
• The primary aim of DBD is to provide an easy access to a number of databases
• National Center for Biotechnology Information: https://www.ncbi.nlm.nih.gov/
• It advances science and health by providing access to biomedical and genomic
information
• It provides search and retrieval operations for data from 35 distinct databases
HME 2228: BIOINFORMATICS
There are two broad categories of biological databases;
1. Primary Database
2. Secondary Databases
1) Primary Database
• It is a collection of original, raw/unprocessed biological data
• The data is derived experimentally derived
• The content controlled by the submitter
HME 2228: BIOINFORMATICS
• The are various types of primary databases classified based on the data they store;
i. Sequence database:
• They stores information on sequence of nucleotide and protein
• The various categories of sequence databases include;
a) Nucleotide sequence databases
• Nucleotide sequence refers to the order of the building blocks of DNA and RNA
• Most nucleotide sequence databases maintains an own set of submission and
retrieval tools, but they exchange data daily so that all the databases should contain
the same set of sequences
HME 2228: BIOINFORMATICS
• Some important examples of nucleotide databases include;
• The databases under the International Nucleotide Sequence Database Collaboration
(INSDC); https://www.insdc.org/
• It is collaboration of three organization DNA Data Bank of Japan (DDBJ),
European Molecular Biology Laboratory- European Bioinformatics
Institute (EMBL-EBI) and NCBI-GenBank
• The site provide a one gateway to access;
GenBank, EMBL and DDBJ databases
• The data in these databases is synchronized
such that one entry in GenBank is shared
with EMBL and DDBJ
HME 2228: BIOINFORMATICS
DNA Data Bank of Japan (DDBJ)
• DDBJ began data bank activities in 1986 at National Institute of Genetics (NIG),
Japan
• DDBJ collects sequence data mainly from Japanese researchers, however, they also
receive data from researchers of any other countries
Main activities of DDBJ
i) As a member INSDC, DDBJ collects nucleotide sequence data from researcher and
exchanges the collected data with EMBL-EBI and GenBank on a daily basis
ii) DDBJ manage bioinformatics tools for data submission and retrieval
iii)DDBJ develops tools for analysis of biological data
iv) Organizes Bioinformatics Training Course to teach how to analyze biological data
• DDBJ can be accessed through; https://www.ddbj.nig.ac.jp/index-e.html
HME 2228: BIOINFORMATICS
European Molecular Biology Laboratory- European Bioinformatics Institute
(EMBL-EBI)
• EMBL also known as EMBL-Bank and was established in 1980 at the EMBL in
Heidelberg, Germany
• It was the world's first nucleotide sequence database
• European Bioinformatics Institute, is one of the six campuses of EMBL and its
located in UK
HME 2228: BIOINFORMATICS
• EBI provides freely available data from life science experiments, performs basic
research in computational biology and offers an extensive user training programme
for the researchers
• EBI does not only store data on DNA and RNA but also on gene expression
(RNA, protein and metabolite expression), protein (sequence, families and
motifs), structure (molecular and cellular structures), systems (reaction,
interaction, pathways), chemical biology (chemogenomics and metabolomics),
and literature (scientific publications and patents)
• The component of EBI that store nucleotide sequence data is known as the
European Nucleotide Archive
• EBI can be accessed through; https://www.ebi.ac.uk/
HME 2228: BIOINFORMATICS
GenBank
• GenBank is a comprehensive repository for DNA and RNA sequences submitted by
researchers from around the world
• It is maintained by the National Center for Biotechnology Information (NCBI)
• The database provides nucleotide sequence and their protein translations including
mRNA sequences with coding regions, segments of genomic DNA with a single
gene or multiple genes, and ribosomal RNA gene clusters
• GenBank can be accessed through; http://www.ncbi.nlm.nih.gov/genbank/
HME 2228: BIOINFORMATICS
How are sequences identified in these databases;
• Every sequence record in these database are identified using accession numbers
• Accession numbers are unique identifiers assigned to each entry in databases
to ensure that each record can be easily found, cited, and referenced
• It includes a combination of letters and numbers that uniquely identifies a
sequence record; e.g., U12345, AF123456
• The purpose of accession numbers is to provide a stable, permanent way to
uniquely identify a particular entry within a collection
HME 2228: BIOINFORMATICS
• In biological databases, accession numbers do not change, even if information in the
record is changed at the author's request
• Sometimes, however, an original accession number might become secondary to a
newer accession number, if the authors make a new submission that combines
previous sequences, or if for some reason a new submission supercedes an earlier
record
Let do some practice searching,
•ACCESSION: U49845
•ACCESSION: U75746
HME 2228: BIOINFORMATICS
Sequence Read Archive (SRA) Database;
• It was also known as Short Read Archive
• It also part of INSDC
• SRA stores raw sequencing data generated from "next-generation“ sequencing
technologies
• Sequencing involves determining the sequence of DNA or RNA molecules
• The technologies used for sequencing are known as "next-generation“ sequencing
or high-throughput sequencing
• NGS method can sequence millions or even billions of molecules simultaneously,
making it much faster and more affordable
• Examples; 454, IonTorrent, Illumina, SOLiD and Complete Genomics
HME 2228: BIOINFORMATICS
• SRA database is managed and accessible through NCBI: https://www.ncbi.nlm.nih.gov/sra
• It is the largest publicly available repository of high throughput sequencing data
• There are multiple ways to retrieve data from SRA;
• You can browse the database by:
• Studies
• Samples
• Analyses
• Provisional SRA
• Provisional SRA
• In some cases the NCBI SRA publishes datasets on a "provisional" basis
• This may be done in order to accommodate incomplete or preliminary dataset
• Over time, provisional datasets will become fully archived
• When the dataset has been fully loaded into the Archive, then it will no longer appear in the
provisional SRA browser
HME 2228: BIOINFORMATICS
• READING ASSIGNMENT; Other
example of nucleotide databases; Trace
Archive, Ensembl
HME 2228: BIOINFORMATICS
b) Protein database
• Primary protein sequence databases store information on protein sequence
• These databases house vast collections of amino acid sequences and related
information
• Some primary protein sequence databases;
Universal Protein Resource (UniProt)
• UniProt is the world's leading database of protein sequences and functional
information
• It was created in 2003 by merging the Swiss-Prot, EMBL-EBI, and Protein
Information Resource (PIR) databases
• It is accessed through; https://www.uniprot.org/
HME 2228: BIOINFORMATICS
• UniProt is maintained by a consortium of organizations; European Bioinformatics
Institute (EMBL-EBI), the Swiss Institute of Bioinformatics (SIB), and the Protein
Information Resource (PIR)
UniProt contains three main databases:
i. UniProt Knowledgebase (UniProtKB)
• This is the central hub of UniProt, containing high-quality, curated protein sequences
and information; the amino acid sequence, protein name or description, taxonomic
data and citation information
HME 2228: BIOINFORMATICS
It containing two sections:
• UniProtKB/Swiss-Prot:
• This section is manually annotated and reviewed, providing reliable and precise
information about protein sequences, functions, domains, sites of activity, and
the biological significance of these proteins
• The information is curated by experts and includes extensive cross-references and
literature citations
• UniProtKB/TrEMBL:
• TrEMBL; Translated EMBL Nucleotide Sequence Database
• TrEMBL is a computer-annotated protein sequence database
HME 2228: BIOINFORMATICS
• The protein sequence are translations of all coding sequences present in the
EMBL-EBI database and not yet integrated to SWISS-PROT
• The translations are generated automatically using computer algorithms where
an amino acid sequence is generated based on the nucleotide sequences in EBI
database
• Thus, the annotations provide useful predictions about the protein but are not
have been verified manually and experimentally
• Essentially, TrEMBL Database supplements SWISS-PROT
HME 2228: BIOINFORMATICS
ii. UniProt Reference Clusters (UniRef):
• Its part of the UniProt database that groups proteins based on the level of sequence
similarity
• There are three levels of clustering;
• UniRef100: Combines identical sequences and fragments across all organisms into
one entry
• UniRef90: Groups together sequences with at least 90% identity to a representative
sequence in UniRef100
• UniRef50: Clusters together sequences with at least 50% identity to a
representative sequence in UniRef90
HME 2228: BIOINFORMATICS
iii. UniProt Archive (UniParc)
• This section contains all publicly available protein sequences from different
databases, regardless of their quality or curation status
• It is much larger than the UniProtKB, but the information it contains is not as
reliable
• UniParc contains protein sequences only without any additional information about
the proteins, such as their function or structure
HME 2228: BIOINFORMATICS
UniProt is a valuable resource for researchers who study proteins. It can be used to:
• Find information about a specific protein
• Compare proteins from different organisms
• Identify proteins with similar amino acid sequence
• Analyze the function of a protein
• Develop new drugs and therapies
HME 2228: BIOINFORMATICS
Let’s practice searching UniProt: https://www.uniprot.org/
Accession number; A2BC19, P12345, A0A023GPI8
Reading assignment check out:
Protein Data Bank Japan: https://pdbj.org/
Protein-NCBI: https://www.ncbi.nlm.nih.gov/protein/
HME 2228: BIOINFORMATICS
ii. Structure database;
• Structure databases are primary database created to store information about the
structure of biological macromolecules e.g. DNA, RNA and protein
• The data stored in these databases is generated through structural biology, a branch
of biological sciences that deals with the study of molecular structure of biological
macromolecule
• The database provide insights regarding spatial positions of the molecular atoms,
the cavities, channels, pores, and clefts found in the macromolecular structure
For example in proteins; active sites, secondary, tertiary and quaternary
structures
For DNA, alpha helical structure with major and minor grove
HME 2228: BIOINFORMATICS
Examples of important structural database;
a) Worldwide Protein Data Bank (wwPDB)
• wwPDB is the central organization that maintains and archives 3-Dimession
structural information of proteins
• It is accessible through; www.wwpdb.org
The wwPDB is composed of four partners:
Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB)
• Its the US partner of wwPDB
HME 2228: BIOINFORMATICS
Protein Data Bank in Europe
• Is the European partner of wwPDB
Protein Data Bank Japan (PDBj)
• Is the Asian and Middle Eastern partner of wwPDB
Biological Magnetic Resonance Data Bank (BMRB)
• It is the central repository for experimental nuclear magnetic resonance spectral
data for macromolecules, aims to archive and annotate nuclear magnetic resonance
(NMR) data obtained from macromolecules; peptides, proteins, and nucleic acids
• Its managed and sponsored by the US organisations; University of Wisconsin-
Madison, National Library of Medicine, and National Institutes of Health
HME 2228: BIOINFORMATICS
b) NCBI Structure Resources
• The NCBI devotes one of its databases to the structure information
• Contains information about the three-dimensional structures of biological
macromolecules such as proteins, nucleic acids, and complex assemblies.
Accessible through; https://www.ncbi.nlm.nih.gov/structure/
c) Nucleic Acid Knowledgebase (NAKB)
• It archives 3D structures of nucleic acids, such as DNA and RNA
• NAKB provides search, report, statistics and visualization pages for all nucleic-acid
with experimentally determined 3D structures
• It was established in 2023 to succeed the retired Nucleic Acid Database (NDB)
• Accessible through; https://www.nakb.org/
HME 2228: BIOINFORMATICS
Practice querying the databases with accession number: 2BH9
In www.wwpdb.org
Other structural databases;
• PDBsum: Structural Summaries of PDB Entries;
http://www.ebi.ac.uk/thornton-srv/databases/pdbsum/
• sc-PDB: A 3D Database of Ligandable Binding Sites;
https://ngdc.cncb.ac.cn/databasecommons/database/id/57
• PDBTM: Protein Data Bank of Transmembrane Proteins; https://pdbtm.unitmp.org/
• CATH Database; http://www.cathdb.info/
• SCOP (Structural Classification of Proteins) Database; http://scop.mrc-lmb.cam.ac.uk/
HME 2228: BIOINFORMATICS
NEXT: Secondary biological database

More Related Content

PDF
Bioinformatics: History of Bioinformatics, Components of Bioinformatics, Geno...
PPTX
Share_Introduction to Bioinformatics-WPS_Office.pptx
PDF
BIOLOGICAL DATABASE AND ITS TYPES,IMPORTANCE OF BIOLOGICAL DATABASE
PPTX
biological databases.pptx
PPTX
Biological database ppt(1).pptx Introuction
PPTX
Introduction to Biological database ppt(1).pptx
PDF
Bioinformatics - Exam_Materials.pdf by uos
PDF
Biological Database (1)pptxpdfpdfpdf.pdf
Bioinformatics: History of Bioinformatics, Components of Bioinformatics, Geno...
Share_Introduction to Bioinformatics-WPS_Office.pptx
BIOLOGICAL DATABASE AND ITS TYPES,IMPORTANCE OF BIOLOGICAL DATABASE
biological databases.pptx
Biological database ppt(1).pptx Introuction
Introduction to Biological database ppt(1).pptx
Bioinformatics - Exam_Materials.pdf by uos
Biological Database (1)pptxpdfpdfpdf.pdf

Similar to understanding the Human DNA components database design (20)

PPTX
Introduction to databases.pptx
PPT
Primary and secondary database
PPTX
BIOINFO unit 1.pptx
PPTX
Bioinformatics
PPTX
Introduction to bioinformatics.pptx
PPTX
bioinformatics presentation in the master presentation
PPTX
Bioinformatics
PPTX
Presentation.pptx
PPT
Bioinformatics in biotechnology by kk sahu
PDF
Bioinformatics مي.pdf
PPTX
Bioinformatics .pptx
PPTX
BIOINFORMATICS BIOLOGICAL DATABASES DATA BASES.pptx
PPTX
Lecture 1 Bioinformatics Free Lecture Download
PPT
Intro bioinfo
PPT
Intro bioinfo
PPTX
Bioinformatics introduction
PPTX
Introduction OF BIOLOGICAL DATABASE
PPTX
MOLECULAR BIOLOGY TECHNIQUES AND APPLICATIONS
PPTX
MLS 5321 MOLECULAR BIOLOGY II TECHNIQUES AND APPLICATIONS POWER POINT.pptx
Introduction to databases.pptx
Primary and secondary database
BIOINFO unit 1.pptx
Bioinformatics
Introduction to bioinformatics.pptx
bioinformatics presentation in the master presentation
Bioinformatics
Presentation.pptx
Bioinformatics in biotechnology by kk sahu
Bioinformatics مي.pdf
Bioinformatics .pptx
BIOINFORMATICS BIOLOGICAL DATABASES DATA BASES.pptx
Lecture 1 Bioinformatics Free Lecture Download
Intro bioinfo
Intro bioinfo
Bioinformatics introduction
Introduction OF BIOLOGICAL DATABASE
MOLECULAR BIOLOGY TECHNIQUES AND APPLICATIONS
MLS 5321 MOLECULAR BIOLOGY II TECHNIQUES AND APPLICATIONS POWER POINT.pptx
Ad

Recently uploaded (20)

PDF
EVs U-5 ONE SHOT Notes_c49f9e68-5eac-4201-bf86-b314ef5930ba.pdf
PPT
business model and some other things that
PPTX
BULAN K3 NASIONAL PowerPt Templates.pptx
PPTX
just letters randomized coz i need to up
PDF
Western Pop Music: From Classics to Chart-Toppers
PDF
Rare Big Band Arrangers Who Revolutionized Big Band Music in USA.pdf
DOCX
Lambutchi Calin Claudiu had a discussion with the Buddha about the restructur...
PDF
oppenheimer and the story of the atomic bomb
PDF
Best IPTV Service Providers in the UK (2025) – Honest Reviews & Top Picks
PDF
MAGNET STORY- Coaster Sequence (Rough Version 2).pdf
PPTX
providenetworksystemadministration.pptxhnnhgcbdjckk
PDF
Keanu Reeves Beyond the Legendary Hollywood Movie Star.pdf
PPTX
continuous_steps_relay.pptx. Another activity
PDF
Download FL Studio Crack Latest version 2025
PDF
How Old Radio Shows in the 1940s and 1950s Helped Ella Fitzgerald Grow.pdf
PPTX
What Makes an Entertainment App Addictive?
PDF
What is Rotoscoping Best Software for Rotoscoping in 2025.pdf
PDF
Ct.pdffffffffffffffffffffffffffffffffffff
PDF
TAIPANQQ SITUS MUDAH MENANG DAN MUDAH MAXWIN SEGERA DAFTAR DI TAIPANQQ DAN RA...
PDF
WKA #29: "FALLING FOR CUPID" TRANSCRIPT.pdf
EVs U-5 ONE SHOT Notes_c49f9e68-5eac-4201-bf86-b314ef5930ba.pdf
business model and some other things that
BULAN K3 NASIONAL PowerPt Templates.pptx
just letters randomized coz i need to up
Western Pop Music: From Classics to Chart-Toppers
Rare Big Band Arrangers Who Revolutionized Big Band Music in USA.pdf
Lambutchi Calin Claudiu had a discussion with the Buddha about the restructur...
oppenheimer and the story of the atomic bomb
Best IPTV Service Providers in the UK (2025) – Honest Reviews & Top Picks
MAGNET STORY- Coaster Sequence (Rough Version 2).pdf
providenetworksystemadministration.pptxhnnhgcbdjckk
Keanu Reeves Beyond the Legendary Hollywood Movie Star.pdf
continuous_steps_relay.pptx. Another activity
Download FL Studio Crack Latest version 2025
How Old Radio Shows in the 1940s and 1950s Helped Ella Fitzgerald Grow.pdf
What Makes an Entertainment App Addictive?
What is Rotoscoping Best Software for Rotoscoping in 2025.pdf
Ct.pdffffffffffffffffffffffffffffffffffff
TAIPANQQ SITUS MUDAH MENANG DAN MUDAH MAXWIN SEGERA DAFTAR DI TAIPANQQ DAN RA...
WKA #29: "FALLING FOR CUPID" TRANSCRIPT.pdf
Ad

understanding the Human DNA components database design

  • 1. HME 2228: BIOINFORMATICS BIOLOGICAL DATABASES • These are database that store biological data • The data is collected from scientific experiments and published literature on various biological topics • Biological data refers to information derived from study of biological systems, organisms, and their components • This can include a wide range of things, from the DNA, RNA and protein sequence data, medicinal compound made from living organisms, such as antibiotics, vaccines
  • 2. HME 2228: BIOINFORMATICS Example of categories of biological data include; Genetic data: • This includes information about the DNA sequences of organisms, such as their genes and chromosomes • Genetic data can be used to study a wide range of topics, such as human health, disease, evolution, and biodiversity
  • 3. HME 2228: BIOINFORMATICS Genomic data: • This is a type of genetic data that focuses on the entire genome of an organism • A genome is an organism's complete set of DNA • Genomic data can be used to identify genes that are associated with disease, to develop new drugs, and to understand how organisms evolve
  • 4. HME 2228: BIOINFORMATICS Transcriptomic data: • This type of data measures the levels of universal gene expression as RNA transcripts or mRNA in a cell or tissue • Basically, this the complete set of RNA transcripts or mRNA produced by the genome under specific circumstances or in a specific cell • Transcriptomic data can be used to study how genes are regulated and to identify genes that are involved in diseases such as cancer Proteomic data: • This type of data measures the levels and activities of proteins in a cell or tissue; • A proteome refers to a complete set of proteins produced in an organism, system, or biological context • This data includes information on protein sequences, structures, functions, and post-translational modifications. Proteomic data is crucial for understanding disease mechanisms, drug targets, and biomarker discovery
  • 5. HME 2228: BIOINFORMATICS Metabolomic Data • Metabolomics is the large-scale study of small molecules (< 1500 Da), commonly known as metabolites, within cells, biofluids, tissues or organisms • Examples; peptides, sugars, amino acids, nucleic acids, organic acids, lipids, fatty acids • Collectively, these small molecules and their interactions within a biological system are known as the metabolome • Metabolomic data measures the levels of these metabolites in a cell or tissue • Metabolomic data provides insights into the metabolic processes within a cell, tissue, or organism, and is used in clinical diagnostics, nutrition, and understanding metabolic diseases
  • 6. HME 2228: BIOINFORMATICS Microbiome data: • Microbiome is the entire genome of a collection of microorganisms (such as fungi, bacteria and viruses) that exists in a particular environment; humans, animals, plants, soil, oceans, and the atmosphere • This type of data measures the composition of microbial communities and how it changes in response to the environment they live in • Microbiome data can be used to study how the microbiome affects our health and to develop new treatments for diseases
  • 7. HME 2228: BIOINFORMATICS Phenotypic Data; • Phenotypic data refer to the observable characteristics or traits of an organism, tissue, organ or a cell • These characteristics are determined by both genetic makeup and environmental influences • Phenotypic data is critical in genetics, breeding programs, and understanding the genotype-phenotype relationship
  • 8. HME 2228: BIOINFORMATICS Structural Biology Data: • Includes data on the molecular structure of biological macromolecules, such as proteins and nucleic acids, providing insights into how they work and interact with each other
  • 9. HME 2228: BIOINFORMATICS Ecological and Environmental Data: • It encompasses a wide range of information about the relationships between organisms and their environment • It could include species distribution, population dynamics, environmental conditions like climate change, and biodiversity information • Such data is vital for conservation biology, distribution of species, climate change studies, and ecosystem management Soo......with this regard, biological database stored information regrading DNA, RNA and protein sequence data, structural information, gene expression data, molecular interaction data, mutation data, phenotypic data, metabolic pathways information, taxonomic information of biological organism, among others
  • 10. HME 2228: BIOINFORMATICS How do you find biological databases; a) Database Journals; • These are journals that focus on research and developments in the field of databases • These journals cover a wide range of topics related to databases; management systems, database design, development, implementation, and use • Such journals include; • The Journal of Biological Databases and Curation; https://academic.oup.com/database?login=false • The Nucleic Acids Research journal: https://academic.oup.com/nar
  • 11. HME 2228: BIOINFORMATICS b) Database portals; • This is a web-based platform that provides users with a single point of access to a variety of databases; • Omics Discovery Index: https://www.omicsdi.org/ • Several biological databases can be browsed and searched through this site • Database of Biological Databases: https://www.biodbs.info/ • The primary aim of DBD is to provide an easy access to a number of databases • National Center for Biotechnology Information: https://www.ncbi.nlm.nih.gov/ • It advances science and health by providing access to biomedical and genomic information • It provides search and retrieval operations for data from 35 distinct databases
  • 12. HME 2228: BIOINFORMATICS There are two broad categories of biological databases; 1. Primary Database 2. Secondary Databases 1) Primary Database • It is a collection of original, raw/unprocessed biological data • The data is derived experimentally derived • The content controlled by the submitter
  • 13. HME 2228: BIOINFORMATICS • The are various types of primary databases classified based on the data they store; i. Sequence database: • They stores information on sequence of nucleotide and protein • The various categories of sequence databases include; a) Nucleotide sequence databases • Nucleotide sequence refers to the order of the building blocks of DNA and RNA • Most nucleotide sequence databases maintains an own set of submission and retrieval tools, but they exchange data daily so that all the databases should contain the same set of sequences
  • 14. HME 2228: BIOINFORMATICS • Some important examples of nucleotide databases include; • The databases under the International Nucleotide Sequence Database Collaboration (INSDC); https://www.insdc.org/ • It is collaboration of three organization DNA Data Bank of Japan (DDBJ), European Molecular Biology Laboratory- European Bioinformatics Institute (EMBL-EBI) and NCBI-GenBank • The site provide a one gateway to access; GenBank, EMBL and DDBJ databases • The data in these databases is synchronized such that one entry in GenBank is shared with EMBL and DDBJ
  • 15. HME 2228: BIOINFORMATICS DNA Data Bank of Japan (DDBJ) • DDBJ began data bank activities in 1986 at National Institute of Genetics (NIG), Japan • DDBJ collects sequence data mainly from Japanese researchers, however, they also receive data from researchers of any other countries Main activities of DDBJ i) As a member INSDC, DDBJ collects nucleotide sequence data from researcher and exchanges the collected data with EMBL-EBI and GenBank on a daily basis ii) DDBJ manage bioinformatics tools for data submission and retrieval iii)DDBJ develops tools for analysis of biological data iv) Organizes Bioinformatics Training Course to teach how to analyze biological data • DDBJ can be accessed through; https://www.ddbj.nig.ac.jp/index-e.html
  • 16. HME 2228: BIOINFORMATICS European Molecular Biology Laboratory- European Bioinformatics Institute (EMBL-EBI) • EMBL also known as EMBL-Bank and was established in 1980 at the EMBL in Heidelberg, Germany • It was the world's first nucleotide sequence database • European Bioinformatics Institute, is one of the six campuses of EMBL and its located in UK
  • 17. HME 2228: BIOINFORMATICS • EBI provides freely available data from life science experiments, performs basic research in computational biology and offers an extensive user training programme for the researchers • EBI does not only store data on DNA and RNA but also on gene expression (RNA, protein and metabolite expression), protein (sequence, families and motifs), structure (molecular and cellular structures), systems (reaction, interaction, pathways), chemical biology (chemogenomics and metabolomics), and literature (scientific publications and patents) • The component of EBI that store nucleotide sequence data is known as the European Nucleotide Archive • EBI can be accessed through; https://www.ebi.ac.uk/
  • 18. HME 2228: BIOINFORMATICS GenBank • GenBank is a comprehensive repository for DNA and RNA sequences submitted by researchers from around the world • It is maintained by the National Center for Biotechnology Information (NCBI) • The database provides nucleotide sequence and their protein translations including mRNA sequences with coding regions, segments of genomic DNA with a single gene or multiple genes, and ribosomal RNA gene clusters • GenBank can be accessed through; http://www.ncbi.nlm.nih.gov/genbank/
  • 19. HME 2228: BIOINFORMATICS How are sequences identified in these databases; • Every sequence record in these database are identified using accession numbers • Accession numbers are unique identifiers assigned to each entry in databases to ensure that each record can be easily found, cited, and referenced • It includes a combination of letters and numbers that uniquely identifies a sequence record; e.g., U12345, AF123456 • The purpose of accession numbers is to provide a stable, permanent way to uniquely identify a particular entry within a collection
  • 20. HME 2228: BIOINFORMATICS • In biological databases, accession numbers do not change, even if information in the record is changed at the author's request • Sometimes, however, an original accession number might become secondary to a newer accession number, if the authors make a new submission that combines previous sequences, or if for some reason a new submission supercedes an earlier record Let do some practice searching, •ACCESSION: U49845 •ACCESSION: U75746
  • 21. HME 2228: BIOINFORMATICS Sequence Read Archive (SRA) Database; • It was also known as Short Read Archive • It also part of INSDC • SRA stores raw sequencing data generated from "next-generation“ sequencing technologies • Sequencing involves determining the sequence of DNA or RNA molecules • The technologies used for sequencing are known as "next-generation“ sequencing or high-throughput sequencing • NGS method can sequence millions or even billions of molecules simultaneously, making it much faster and more affordable • Examples; 454, IonTorrent, Illumina, SOLiD and Complete Genomics
  • 22. HME 2228: BIOINFORMATICS • SRA database is managed and accessible through NCBI: https://www.ncbi.nlm.nih.gov/sra • It is the largest publicly available repository of high throughput sequencing data • There are multiple ways to retrieve data from SRA; • You can browse the database by: • Studies • Samples • Analyses • Provisional SRA • Provisional SRA • In some cases the NCBI SRA publishes datasets on a "provisional" basis • This may be done in order to accommodate incomplete or preliminary dataset • Over time, provisional datasets will become fully archived • When the dataset has been fully loaded into the Archive, then it will no longer appear in the provisional SRA browser
  • 23. HME 2228: BIOINFORMATICS • READING ASSIGNMENT; Other example of nucleotide databases; Trace Archive, Ensembl
  • 24. HME 2228: BIOINFORMATICS b) Protein database • Primary protein sequence databases store information on protein sequence • These databases house vast collections of amino acid sequences and related information • Some primary protein sequence databases; Universal Protein Resource (UniProt) • UniProt is the world's leading database of protein sequences and functional information • It was created in 2003 by merging the Swiss-Prot, EMBL-EBI, and Protein Information Resource (PIR) databases • It is accessed through; https://www.uniprot.org/
  • 25. HME 2228: BIOINFORMATICS • UniProt is maintained by a consortium of organizations; European Bioinformatics Institute (EMBL-EBI), the Swiss Institute of Bioinformatics (SIB), and the Protein Information Resource (PIR) UniProt contains three main databases: i. UniProt Knowledgebase (UniProtKB) • This is the central hub of UniProt, containing high-quality, curated protein sequences and information; the amino acid sequence, protein name or description, taxonomic data and citation information
  • 26. HME 2228: BIOINFORMATICS It containing two sections: • UniProtKB/Swiss-Prot: • This section is manually annotated and reviewed, providing reliable and precise information about protein sequences, functions, domains, sites of activity, and the biological significance of these proteins • The information is curated by experts and includes extensive cross-references and literature citations • UniProtKB/TrEMBL: • TrEMBL; Translated EMBL Nucleotide Sequence Database • TrEMBL is a computer-annotated protein sequence database
  • 27. HME 2228: BIOINFORMATICS • The protein sequence are translations of all coding sequences present in the EMBL-EBI database and not yet integrated to SWISS-PROT • The translations are generated automatically using computer algorithms where an amino acid sequence is generated based on the nucleotide sequences in EBI database • Thus, the annotations provide useful predictions about the protein but are not have been verified manually and experimentally • Essentially, TrEMBL Database supplements SWISS-PROT
  • 28. HME 2228: BIOINFORMATICS ii. UniProt Reference Clusters (UniRef): • Its part of the UniProt database that groups proteins based on the level of sequence similarity • There are three levels of clustering; • UniRef100: Combines identical sequences and fragments across all organisms into one entry • UniRef90: Groups together sequences with at least 90% identity to a representative sequence in UniRef100 • UniRef50: Clusters together sequences with at least 50% identity to a representative sequence in UniRef90
  • 29. HME 2228: BIOINFORMATICS iii. UniProt Archive (UniParc) • This section contains all publicly available protein sequences from different databases, regardless of their quality or curation status • It is much larger than the UniProtKB, but the information it contains is not as reliable • UniParc contains protein sequences only without any additional information about the proteins, such as their function or structure
  • 30. HME 2228: BIOINFORMATICS UniProt is a valuable resource for researchers who study proteins. It can be used to: • Find information about a specific protein • Compare proteins from different organisms • Identify proteins with similar amino acid sequence • Analyze the function of a protein • Develop new drugs and therapies
  • 31. HME 2228: BIOINFORMATICS Let’s practice searching UniProt: https://www.uniprot.org/ Accession number; A2BC19, P12345, A0A023GPI8 Reading assignment check out: Protein Data Bank Japan: https://pdbj.org/ Protein-NCBI: https://www.ncbi.nlm.nih.gov/protein/
  • 32. HME 2228: BIOINFORMATICS ii. Structure database; • Structure databases are primary database created to store information about the structure of biological macromolecules e.g. DNA, RNA and protein • The data stored in these databases is generated through structural biology, a branch of biological sciences that deals with the study of molecular structure of biological macromolecule • The database provide insights regarding spatial positions of the molecular atoms, the cavities, channels, pores, and clefts found in the macromolecular structure For example in proteins; active sites, secondary, tertiary and quaternary structures For DNA, alpha helical structure with major and minor grove
  • 33. HME 2228: BIOINFORMATICS Examples of important structural database; a) Worldwide Protein Data Bank (wwPDB) • wwPDB is the central organization that maintains and archives 3-Dimession structural information of proteins • It is accessible through; www.wwpdb.org The wwPDB is composed of four partners: Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB) • Its the US partner of wwPDB
  • 34. HME 2228: BIOINFORMATICS Protein Data Bank in Europe • Is the European partner of wwPDB Protein Data Bank Japan (PDBj) • Is the Asian and Middle Eastern partner of wwPDB Biological Magnetic Resonance Data Bank (BMRB) • It is the central repository for experimental nuclear magnetic resonance spectral data for macromolecules, aims to archive and annotate nuclear magnetic resonance (NMR) data obtained from macromolecules; peptides, proteins, and nucleic acids • Its managed and sponsored by the US organisations; University of Wisconsin- Madison, National Library of Medicine, and National Institutes of Health
  • 35. HME 2228: BIOINFORMATICS b) NCBI Structure Resources • The NCBI devotes one of its databases to the structure information • Contains information about the three-dimensional structures of biological macromolecules such as proteins, nucleic acids, and complex assemblies. Accessible through; https://www.ncbi.nlm.nih.gov/structure/ c) Nucleic Acid Knowledgebase (NAKB) • It archives 3D structures of nucleic acids, such as DNA and RNA • NAKB provides search, report, statistics and visualization pages for all nucleic-acid with experimentally determined 3D structures • It was established in 2023 to succeed the retired Nucleic Acid Database (NDB) • Accessible through; https://www.nakb.org/
  • 36. HME 2228: BIOINFORMATICS Practice querying the databases with accession number: 2BH9 In www.wwpdb.org Other structural databases; • PDBsum: Structural Summaries of PDB Entries; http://www.ebi.ac.uk/thornton-srv/databases/pdbsum/ • sc-PDB: A 3D Database of Ligandable Binding Sites; https://ngdc.cncb.ac.cn/databasecommons/database/id/57 • PDBTM: Protein Data Bank of Transmembrane Proteins; https://pdbtm.unitmp.org/ • CATH Database; http://www.cathdb.info/ • SCOP (Structural Classification of Proteins) Database; http://scop.mrc-lmb.cam.ac.uk/
  • 37. HME 2228: BIOINFORMATICS NEXT: Secondary biological database