SlideShare a Scribd company logo
2
Most read
3
Most read
5
Most read
BIOINFORMATICS
TOPIC:DATABASE
SEARCHING
PRESENTED BY:
YASH SOGANI
MCA/25025/18
BIOLOGICAL DATABASE
A BIOLOGICAL DATABASE is a large, organized body
of persistent data, usually associated with
computerized software designed to update, query,
and retrieve components of the data stored within the
system. A simple database might be a single file
containing many records, each of which includes the
same set of information.
FOR EXAMPLE
A few popular databases are GenBank from NCBI (National
Center for Biotechnology Information), SwissProt from the Swiss
Institute of Bioinformatics and PIR from the Protein Information
Resource.
GenBank: GenBank (Genetic Sequence Databank) is one of the
fastest growing repositories of known genetic sequences.
EMBL: The EMBL Nucleotide Sequence Database is a
comprehensive database of DNA and RNA sequences collected
from the scientific literature and patent applications and directly
submitted from researchers and sequencing groups.
DATABASE SEARCHING
Searching is done to find the relatedness between the query
and the entries in the database.
For nucleic acids and proteins, the relatedness is defined by
“homology”. A ‘Query’ sequence is used to search against each
entry, a ‘subject’ in a database.
Two sequences are said to be homologous when they possess
sequence identity above a certain threshold.
Thresholds can be defined by length, percentage identity, E-
value, Bit-score, etc., or a combination of one or more of these,
depending on the objective of the search.
Basic Elements in Searching
Biological Databases
Sensitivity versus Specificity/selectivity
Scoring Scheme, Gap penalties
Distance/Substitution Matrices (PAM, BLOSSUM Series)
Search Parameters (E-value, Bit score)
Handling Data Quality Issues (Filtering, Clustering)
Type of Algorithm (Smith-Waterman, Needleman-Wunsch)
Sensitivity vs. Specificity
Sensitivity:
Attempts to report ALL true positives
Sensitivity = True Positives / (True Positives + False Negatives)
= TP/(TP+FN)
(1-sensitivity) gives false negative rate
Specificity:
Attempts to report ALL true negatives
Specificity = True Negatives / (True Negatives + False Positives)
= TN/(TN+FP)
(1-specificity) gives false positive rate
Scoring Scheme
Match - Match between identical letters or letters of the same
group
Mismatch – Match between letters of different groups
Gap – Match between a letter and a gap
Alignment score is the sum of match, mismatch and gap
penalty scores
Say, you are aligning two sequences A and B
Sequence A : PQVNNTVNRT / Sequence B : PVNRT
1 2 3
PQVNNTVNRT PQVNNTVNRT PQVNNTVNRT
P-VNRT---- P-V-N---RT P-----VNRT
Distance/Substitution Matrices
 Unitary matrices/minimum distance matrices
 PAM (Percent Accepted Mutations)
 BLOSUM (BLOcks SUbstitution Matrix)
PAM (Percent Accepted Mutations)
Developed by Dayhoff and co-workers
PAM 30, 60, 100, 200, 250
Built from globally aligned, closely related sequences (85%
similarity)
A database of 1572 changes in 71 groups of closely related
proteins
PAM 1 matrix incorporates amino acid replacements that would
be expected if one mutation had occurred per 100 amino acids
of sequence i.e., corresponds to roughly one percent
divergence in a protein
BLOSUM (BLOCKS SUBSTITUTION Matrix)
Developed by Henikoff and Henikoff (1992)
Blosum 30, 62, 80
Built from BLOCKS database
From the most conserved regions of aligned sequences
2000 blocks from 500 families
Blosum 62 is the most popular. Here, 62 means that the
sequences used in creating the matrix are at least 62% identical
E-value (Expectation value)
The number of equal or higher scores expected at random for
a given High Scoring Pair (HSP)
E-value of 10 for a match means, in a database of current size,
one might expect to see 10 matches with a similar or better
score, simply by chance
E-value is the most commonly used threshold in database
searches. Only those matches with E-values smaller than the set
threshold will be reported in the output
E-value ranges between 0 to higher, lower the E-value, better
the reliability of a match.
Bit Score
Raw scores have no meaning without the knowledge of the
scoring scheme used.
Raw score are normalized to get Bit scores by incorporating
information about the scoring scheme used and the search
space used (size of database)
Bit score is normalized score and hence it is independent of the
size of the database, while Evalues are very sensitive to the
database size.
Generally bit scores of 40 are higher are considered reliable
Filtering low complexity
sequences
Filters out short repeats and low complexity regions
from the query sequences before searching the
database.
Filtering helps to obtain statistically significant results
and reduce the background noise resulting from
matches with repeats and low complexity regions.
The output shows which regions of the query
sequence were masked.
Choice of the Searching Algorithm
An ideal algorithm should have
• Good specificity and sensitivity
• Should be fast running
• Should not use too much memory
Greedy algorithms are very sensitive, but very slow. Heuristic
algorithms are relatively fast, but loose some sensitivity. Itʼs
always a challenge for a programmer to develop algorithms
that fulfill both of these requirements
Needleman-Wunsch algorithm (JMB 48:443-53, 1970)
• Very greedy algorithm, so very sensitive
• Implements Dynamic programming
• Provides global alignment between the two sequences
Smith-Waterman algorithm (JMB 147:195-97, 1981)
• A set of heuristics were applied to the above algorithm to
make it less greedy, so it is less sensitive but runs faster
• Implements Dynamic programming
• Provide local alignment between two sequences
• Both BLAST and FASTA use this algorithm with varying
heuristics applied in each case
FASTA (FAST Algorithm)
The first step is application of heuristics and the second step is
using dynamic programming
• First, the query sequence and the database sequence are
cut into defined length words and a word matching is
performed in all-to-all combinations
• Word size is 2 for proteins and 6 for nucleic acids
• If the initial score is above a threshold, the second score is
computed by joining fragments and using gaps of less than
some maximum length
• If this second score is above some threshold, Smith-
Waterman alignment is performed within the regions of
high identities (known as high-scoring pairs)
BLAST (Basic Local Alignment
Search Tool)
The fist step is application of heuristics and the second step is using dynamic
programming
• First, the query sequence and the database sequence are cut into
defined length words and a word matching is performed in all
combinations.
• Words that score above a threshold are used to extend the word list.
• Several High Scoring Segments are found, with the maximum scoring
segment used to define a band in the path graph
• Smith-Waterman algorithm is performed on several possible segments to
obtain optimal alignment
• The word size for Protein is 3 and for Nucleic acid is 11.
THANK
YOU

More Related Content

PPT
Protein Structure, Databases and Structural Alignment
PPT
blast and fasta
PPTX
Data base searching tool
PPT
Seq alignment
PPTX
Sequence homology search and multiple sequence alignment(1)
PPTX
Structural bioinformatics.
Protein Structure, Databases and Structural Alignment
blast and fasta
Data base searching tool
Seq alignment
Sequence homology search and multiple sequence alignment(1)
Structural bioinformatics.

What's hot (20)

PDF
Dot matrix
PPTX
Cath
PPTX
Gen bank databases
PPTX
Scop database
PPTX
PPTX
Genome annotation
PPT
Clustal
PPTX
sequence of file formats in bioinformatics
PPTX
Needleman-Wunsch Algorithm
PPTX
Sequence alig Sequence Alignment Pairwise alignment:-
PPTX
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
PPTX
Kegg
PPTX
Multiple Sequence Alignment
PPTX
PAM matrices evolution
PDF
Structural databases
PPTX
Kegg databse
PDF
Bioinformatics data mining
PDF
NCBI National Center for Biotechnology Information
Dot matrix
Cath
Gen bank databases
Scop database
Genome annotation
Clustal
sequence of file formats in bioinformatics
Needleman-Wunsch Algorithm
Sequence alig Sequence Alignment Pairwise alignment:-
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
Kegg
Multiple Sequence Alignment
PAM matrices evolution
Structural databases
Kegg databse
Bioinformatics data mining
NCBI National Center for Biotechnology Information
Ad

Similar to Database Searching (20)

PPTX
Bioinformatics
PPTX
2016 bioinformatics i_database_searching_wimvancriekinge
PPTX
bio informatic Database Similarity Searching 3.pptx
PPTX
Bioinformatics_1_ChenS.pptx
PPTX
Sequence database
PPTX
Bioinformatics t5-databasesearching v2014
PPTX
2015 bioinformatics database_searching_wimvancriekinge
PPT
B.sc biochem i bobi u 3.1 sequence alignment
PPT
B.sc biochem i bobi u 3.1 sequence alignment
PDF
Basics of bioinformatics
PPTX
BLAST AND FASTA.pptx12345789999987544321234
PPT
Bioinformatica 10-11-2011-t5-database searching
PPTX
Bioinformatics life sciences_v2015
PPTX
Bioinformatics t5-database searching-v2013_wim_vancriekinge
PPTX
Blast and fasta
PPT
How the blast work
PDF
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
PDF
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
PPT
lecture4.ppt Sequence Alignmentaldf sdfsadf
PPTX
Bioinformatica t5-database searching
Bioinformatics
2016 bioinformatics i_database_searching_wimvancriekinge
bio informatic Database Similarity Searching 3.pptx
Bioinformatics_1_ChenS.pptx
Sequence database
Bioinformatics t5-databasesearching v2014
2015 bioinformatics database_searching_wimvancriekinge
B.sc biochem i bobi u 3.1 sequence alignment
B.sc biochem i bobi u 3.1 sequence alignment
Basics of bioinformatics
BLAST AND FASTA.pptx12345789999987544321234
Bioinformatica 10-11-2011-t5-database searching
Bioinformatics life sciences_v2015
Bioinformatics t5-database searching-v2013_wim_vancriekinge
Blast and fasta
How the blast work
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
Performance Improvement of BLAST with Use of MSA Techniques to Search Ancesto...
lecture4.ppt Sequence Alignmentaldf sdfsadf
Bioinformatica t5-database searching
Ad

More from Meghaj Mallick (20)

PPT
24 partial-orderings
PPTX
PORTFOLIO BY USING HTML & CSS
PPTX
Introduction to Software Testing
PPTX
Introduction to System Programming
PPTX
MACRO ASSEBLER
PPTX
Icons, Image & Multimedia
PPTX
Project Tracking & SPC
PPTX
Peephole Optimization
PPTX
Routing in MANET
PPTX
Macro assembler
PPTX
Architecture and security in Vanet PPT
PPTX
Design Model & User Interface Design in Software Engineering
PPTX
Text Mining of Twitter in Data Mining
PPTX
DFS & BFS in Computer Algorithm
PPTX
Software Development Method
PPTX
Secant method in Numerical & Statistical Method
PPTX
Motivation in Organization
PPTX
Communication Skill
PPT
Partial-Orderings in Discrete Mathematics
PPTX
Hashing In Data Structure
24 partial-orderings
PORTFOLIO BY USING HTML & CSS
Introduction to Software Testing
Introduction to System Programming
MACRO ASSEBLER
Icons, Image & Multimedia
Project Tracking & SPC
Peephole Optimization
Routing in MANET
Macro assembler
Architecture and security in Vanet PPT
Design Model & User Interface Design in Software Engineering
Text Mining of Twitter in Data Mining
DFS & BFS in Computer Algorithm
Software Development Method
Secant method in Numerical & Statistical Method
Motivation in Organization
Communication Skill
Partial-Orderings in Discrete Mathematics
Hashing In Data Structure

Recently uploaded (20)

PPTX
PHIL.-ASTRONOMY-AND-NAVIGATION of ..pptx
PDF
natwest.pdf company description and business model
PDF
oil_refinery_presentation_v1 sllfmfls.pdf
PDF
Yusen Logistics Group Sustainability Report 2024.pdf
PPTX
_ISO_Presentation_ISO 9001 and 45001.pptx
PPTX
An Unlikely Response 08 10 2025.pptx
PPTX
fundraisepro pitch deck elegant and modern
PPTX
Lesson-7-Gas. -Exchange_074636.pptx
PPTX
Effective_Handling_Information_Presentation.pptx
PDF
Presentation1 [Autosaved].pdf diagnosiss
PPTX
Tour Presentation Educational Activity.pptx
PDF
Tunisia's Founding Father(s) Pitch-Deck 2022.pdf
PPTX
NORMAN_RESEARCH_PRESENTATION.in education
PDF
Instagram's Product Secrets Unveiled with this PPT
PPTX
FINAL TEST 3C_OCTAVIA RAMADHANI SANTOSO-1.pptx
PPTX
Project and change Managment: short video sequences for IBA
PPTX
Presentation for DGJV QMS (PQP)_12.03.2025.pptx
PPTX
Impressionism_PostImpressionism_Presentation.pptx
PPTX
Introduction-to-Food-Packaging-and-packaging -materials.pptx
PPTX
Sustainable Forest Management ..SFM.pptx
PHIL.-ASTRONOMY-AND-NAVIGATION of ..pptx
natwest.pdf company description and business model
oil_refinery_presentation_v1 sllfmfls.pdf
Yusen Logistics Group Sustainability Report 2024.pdf
_ISO_Presentation_ISO 9001 and 45001.pptx
An Unlikely Response 08 10 2025.pptx
fundraisepro pitch deck elegant and modern
Lesson-7-Gas. -Exchange_074636.pptx
Effective_Handling_Information_Presentation.pptx
Presentation1 [Autosaved].pdf diagnosiss
Tour Presentation Educational Activity.pptx
Tunisia's Founding Father(s) Pitch-Deck 2022.pdf
NORMAN_RESEARCH_PRESENTATION.in education
Instagram's Product Secrets Unveiled with this PPT
FINAL TEST 3C_OCTAVIA RAMADHANI SANTOSO-1.pptx
Project and change Managment: short video sequences for IBA
Presentation for DGJV QMS (PQP)_12.03.2025.pptx
Impressionism_PostImpressionism_Presentation.pptx
Introduction-to-Food-Packaging-and-packaging -materials.pptx
Sustainable Forest Management ..SFM.pptx

Database Searching

  • 2. BIOLOGICAL DATABASE A BIOLOGICAL DATABASE is a large, organized body of persistent data, usually associated with computerized software designed to update, query, and retrieve components of the data stored within the system. A simple database might be a single file containing many records, each of which includes the same set of information.
  • 3. FOR EXAMPLE A few popular databases are GenBank from NCBI (National Center for Biotechnology Information), SwissProt from the Swiss Institute of Bioinformatics and PIR from the Protein Information Resource. GenBank: GenBank (Genetic Sequence Databank) is one of the fastest growing repositories of known genetic sequences. EMBL: The EMBL Nucleotide Sequence Database is a comprehensive database of DNA and RNA sequences collected from the scientific literature and patent applications and directly submitted from researchers and sequencing groups.
  • 4. DATABASE SEARCHING Searching is done to find the relatedness between the query and the entries in the database. For nucleic acids and proteins, the relatedness is defined by “homology”. A ‘Query’ sequence is used to search against each entry, a ‘subject’ in a database. Two sequences are said to be homologous when they possess sequence identity above a certain threshold. Thresholds can be defined by length, percentage identity, E- value, Bit-score, etc., or a combination of one or more of these, depending on the objective of the search.
  • 5. Basic Elements in Searching Biological Databases Sensitivity versus Specificity/selectivity Scoring Scheme, Gap penalties Distance/Substitution Matrices (PAM, BLOSSUM Series) Search Parameters (E-value, Bit score) Handling Data Quality Issues (Filtering, Clustering) Type of Algorithm (Smith-Waterman, Needleman-Wunsch)
  • 6. Sensitivity vs. Specificity Sensitivity: Attempts to report ALL true positives Sensitivity = True Positives / (True Positives + False Negatives) = TP/(TP+FN) (1-sensitivity) gives false negative rate Specificity: Attempts to report ALL true negatives Specificity = True Negatives / (True Negatives + False Positives) = TN/(TN+FP) (1-specificity) gives false positive rate
  • 7. Scoring Scheme Match - Match between identical letters or letters of the same group Mismatch – Match between letters of different groups Gap – Match between a letter and a gap Alignment score is the sum of match, mismatch and gap penalty scores Say, you are aligning two sequences A and B Sequence A : PQVNNTVNRT / Sequence B : PVNRT
  • 8. 1 2 3 PQVNNTVNRT PQVNNTVNRT PQVNNTVNRT P-VNRT---- P-V-N---RT P-----VNRT
  • 9. Distance/Substitution Matrices  Unitary matrices/minimum distance matrices  PAM (Percent Accepted Mutations)  BLOSUM (BLOcks SUbstitution Matrix)
  • 10. PAM (Percent Accepted Mutations) Developed by Dayhoff and co-workers PAM 30, 60, 100, 200, 250 Built from globally aligned, closely related sequences (85% similarity) A database of 1572 changes in 71 groups of closely related proteins PAM 1 matrix incorporates amino acid replacements that would be expected if one mutation had occurred per 100 amino acids of sequence i.e., corresponds to roughly one percent divergence in a protein
  • 11. BLOSUM (BLOCKS SUBSTITUTION Matrix) Developed by Henikoff and Henikoff (1992) Blosum 30, 62, 80 Built from BLOCKS database From the most conserved regions of aligned sequences 2000 blocks from 500 families Blosum 62 is the most popular. Here, 62 means that the sequences used in creating the matrix are at least 62% identical
  • 12. E-value (Expectation value) The number of equal or higher scores expected at random for a given High Scoring Pair (HSP) E-value of 10 for a match means, in a database of current size, one might expect to see 10 matches with a similar or better score, simply by chance E-value is the most commonly used threshold in database searches. Only those matches with E-values smaller than the set threshold will be reported in the output E-value ranges between 0 to higher, lower the E-value, better the reliability of a match.
  • 13. Bit Score Raw scores have no meaning without the knowledge of the scoring scheme used. Raw score are normalized to get Bit scores by incorporating information about the scoring scheme used and the search space used (size of database) Bit score is normalized score and hence it is independent of the size of the database, while Evalues are very sensitive to the database size. Generally bit scores of 40 are higher are considered reliable
  • 14. Filtering low complexity sequences Filters out short repeats and low complexity regions from the query sequences before searching the database. Filtering helps to obtain statistically significant results and reduce the background noise resulting from matches with repeats and low complexity regions. The output shows which regions of the query sequence were masked.
  • 15. Choice of the Searching Algorithm An ideal algorithm should have • Good specificity and sensitivity • Should be fast running • Should not use too much memory Greedy algorithms are very sensitive, but very slow. Heuristic algorithms are relatively fast, but loose some sensitivity. Itʼs always a challenge for a programmer to develop algorithms that fulfill both of these requirements
  • 16. Needleman-Wunsch algorithm (JMB 48:443-53, 1970) • Very greedy algorithm, so very sensitive • Implements Dynamic programming • Provides global alignment between the two sequences Smith-Waterman algorithm (JMB 147:195-97, 1981) • A set of heuristics were applied to the above algorithm to make it less greedy, so it is less sensitive but runs faster • Implements Dynamic programming • Provide local alignment between two sequences • Both BLAST and FASTA use this algorithm with varying heuristics applied in each case
  • 17. FASTA (FAST Algorithm) The first step is application of heuristics and the second step is using dynamic programming • First, the query sequence and the database sequence are cut into defined length words and a word matching is performed in all-to-all combinations • Word size is 2 for proteins and 6 for nucleic acids • If the initial score is above a threshold, the second score is computed by joining fragments and using gaps of less than some maximum length • If this second score is above some threshold, Smith- Waterman alignment is performed within the regions of high identities (known as high-scoring pairs)
  • 18. BLAST (Basic Local Alignment Search Tool) The fist step is application of heuristics and the second step is using dynamic programming • First, the query sequence and the database sequence are cut into defined length words and a word matching is performed in all combinations. • Words that score above a threshold are used to extend the word list. • Several High Scoring Segments are found, with the maximum scoring segment used to define a band in the path graph • Smith-Waterman algorithm is performed on several possible segments to obtain optimal alignment • The word size for Protein is 3 and for Nucleic acid is 11.