SlideShare a Scribd company logo
Bioinformatics 1 -- lecture 10
Sequence weights
log-odds
profiles
Logos
Modeling a sequence family
In statistical modeling, we choose we build a representative
model for the sequence family. To choose the representative,
we take a poll over all observed sequences. Are they a
representative sample?
family
sequence space
superfamily
A typical poll of the database
If we submit one sequence (for example, citrate synthase
from human) to the GenBank database (using BLAST for
example), and take 100 results, and we build a cladogram
from this, we might get something like this...
primates rabbit rat E. coli
lawyer
What is our
representative
going to look like
if we use the rule:
"one sequence one
vote"?
Sequence weighting corrects for poor
sampling
To build a representative model we can...
(1) throw out all redundant sequences and keep
representatives of each clade only, or
(2) apply a weight to each sequence reflecting how non-
redundant that sequence is.
One measure of non-redundancy is sequence-distance, or
evolutionary distance.
Crude weights from a cladogram
Simplest weighting scheme: Start with weight = 1.0 at the
common ancestor of the tree. Split the weight evenly at each
node.
primates
rabbit rat E. coli
lawyer
1.000
0.500
0.500
0.250
0.250
0.125
0.125
0.0625 0.0625
0.008
0.0625
0.125
0.125
0.0625
0.016
0.016
0.016
0.016
0.016
0.008
0.008
0.016
0.008
0.031
0.031
0.031
0.031
0.0625
Human sequences are
10/18 of the tree, but
only 0.125 of the
weights
Better weights from a phylogram
0.5
0.3
0.1
0.2
The sequence weight is calculated starting from the distance from the
taxon to the first ancestor node, adding half of the distance from the first
ancestor to the second ancestor, 1/4th of the distance from the second to
third ancetor, and so on.
Finally, the weights are normalized.
A
B
C
wA = 0.2 + 0.3/2 = 0.35
wB = 0.1 + 0.3/2 = 0.25
wC = 0.5
Making a phylogram in Geneious
• Align
• Make tree
• turn off “transform branches”
– Resulting branches are proportional to p-
distance
7
(Easy) Distance-based weights
Self-consistent Weights Method of Sander & Schneider, 1994
0.3 1.0
0.9
A
B
C
C
B
A
all wi initialized to 1.
while (wi ≠ w'i) do
for i from A to C do
w'i = Σj wj Dij
end do
for i from A to C do
wi = w'i/ Σj w'j
end do
end do
(1) Sum the weighted distances to
get new weights.
(2) Normalize the new weights
(3) Repeat (1) and (2) until no change.
Pseudocode :
Distance-based weights 0.3 1.0
0.9
A
B
C
C
B
A
w'A = 0.3 + 1.0 = 1.3
w'B = 0.3 + 0.9 = 1.2
w'C = 1.0 + 0.9 = 1.9
wA = 1.3/(1.3+1.2+1.9)=0.30
wB = 1.2/4.4 = 0.27
wC = 1.9/4.4 = 0.43
w'A= 0.3*0.27+1.0*0.43=0.51
w'B= 0.3*0.3+0.9*0.43 =0.48
w'C= 1.0*0.3+0.9*0.27 =0.54
...
wABC = 0.33 0.31 0.35
wABC = 0.30 0.28 0.42
wABC = 0.31 0.29 0.40
wABC = 0.30 0.28 0.41
wABC = 0.30 0.28 0.41 converged.
(3) Repeat (1) and (2) until no change.
Running the pseudocode :
(1) Sum the weighted distances to
get new weights.
(2) Normalize the new weights
(1) Sum the weighted distances to
get new weights.
Amino acid probability profiles
An amino acid profile is defined as a set of probability distributions
over the 20 amino acids, one PDF for each position in the alignment.
Gap probabilities may or may not be included when talking about a
profile.
A C D E F G H I K L M N P Q R S T V W Y
Amino acids are not equally likely in Nature. K, L and R are the most
common.
Profile
11
Usually, Log-likelihood ratios
LLR(a) = log( P(a|i) / P(a) )
probability of a in one column
likelihood of a overall
(the whole database)
Pseudocounts, because you never know...
LLR(a) = log( P(a|i) / P(a) )
If P(a|i)=0., you
can't take the log
The probability of seeing a in column i of a sequence alignment
is never really zero. So we add a small number of 'pseudocounts'
ε.
LLR(a) = log( P(a|i)+ε / P(a) )
This LLR does not go to negative infinity as P(a)-->0.000.
Instead it goes to log(ε/P(a)).
Color = LLR.
Blue = high negative values. Green = zero. Red = high positive values.
Color matrix
One way to visualize profiles
Another way: Logos
Height of letter is the LLR.
Example Logos for DNA alignments
Alignments of transcription factor
footprint sites
Scoring a sequence versus a profile
KEMGFDHIIIHP
score = Σi LLR(ai)
The score is
the sum of
the log-
likelihood
ratios of
the amino
acid in the
sequence.
Sequence=
In class exercise: build a profile
Copy “tree of 5” from Collaboration-->Bioinformatics1@RPI
Display as a phylogram
On paper...
(1) Calculate sequence weights based on the distances, using one iteration of
distance-based weights. wA = Σi DiA Then normalize (divide by Σi wi ).
(2) Sum the probabilities of each AA in the nth column. i.e. P(A) = sum wi over
sequences that contain an A.
(3) Convert each P() to a LLR using equal probability AAs (0.05) as the expected
value. Use a pseudocount of 0.02
LLR = log((P(n)+0.02)/(0.05))
(4) Divide by log(2) to convert to 'bits'.
(5) Stack letters, Logo style. Height of letter = bits.
Aligning sequence to profile
S(i,j) = 0
do aa=1,20
S(i,j) = S(i,j) + P(aa,i)*B(aa,s(j))
enddo
20
Aligning profile to profile
S(i,j) = 0
do aai=1,20
do aaj=1,20
S(i,j) = S(i,j) + P(aai,i)*P(aaj,j)*B(aai,aaj)
enddo
enddo
profile1@i P(aa|i)
BLOSUM
score
No need to normalize, since ∑ ∑ P(aai|i)*P(aaj|j)= 1
sequence2@j
aai aaj
Psi-BLAST: Blast with profiles
Psi-BLAST searches the database iteratively.
(Cycle 1) Normal BLAST (with gaps)
(Cycle 2) (a) Construct a profile from the results of Cycle 1.
(b) Search the database using the profile.
(Cycle 3) (a) Construct a profile from the results of Cycle 2.
(b) Search the database using the profile.
And So On... (user sets the number of cycles)
Psi-BLAST is much more sensitive than BLAST.
Also more vulnerable to low-complexity.
Other forms of BLAST
22
BLAST query database
blastn nucleotide nucleotide
blastp protein protein
tblastn protein translated DNA
blastx translated DNA protein
tblastx translated DNA translated DNA
psi-blast protein, profile protein
phi-blast pattern protein
transitive blast* any any
*not really a blast. Just a way of using blast.
PHI-BLAST --
Patterned Hit Initiated BLAST
23

More Related Content

PDF
Genome_annotation@BioDec: Python all over the place
PPT
Phylogenetics2
PPTX
Bioinformatics life sciences_v2015
PPTX
2015 bioinformatics bio_python
PDF
01.4.pssm theory
PPT
Sequence alignment belgaum
PPTX
Protein motif pdf this is very useful for students
Genome_annotation@BioDec: Python all over the place
Phylogenetics2
Bioinformatics life sciences_v2015
2015 bioinformatics bio_python
01.4.pssm theory
Sequence alignment belgaum
Protein motif pdf this is very useful for students

Similar to Bio informatics, Sequence tags, log odds and profile (20)

PPT
PPT
32_Nov07_MachineLear..
PPTX
The application of artificial intelligence
PPTX
PCB_Lect07_Gen_genetic_yes I am like this Fin.pptx
PPT
gene, gene annotation gene annotation.ppt
PPTX
PPT
Blast fasta 4
PPT
Softwares For Phylogentic Analysis
PPT
Bioinformatics MiRON
PPTX
Tools in phylogeny
PPTX
local and global allignment
PPTX
PPTX
Bioinformatics t5-databasesearching v2014
PPT
Barcelona sabatica
PPTX
презентация за варшава
PPT
Kyle Jensen's MIT Ph.D. Thesis Proposal
PPTX
Virus Sequence Alignment and Phylogenetic Analysis 2019
PDF
Advanced BLAST (BlastP, PSI-BLAST)
PPT
Plant Molecular Systematics Phylogenetics.ppt
PPT
phylogenetics (1)...............................ppt
32_Nov07_MachineLear..
The application of artificial intelligence
PCB_Lect07_Gen_genetic_yes I am like this Fin.pptx
gene, gene annotation gene annotation.ppt
Blast fasta 4
Softwares For Phylogentic Analysis
Bioinformatics MiRON
Tools in phylogeny
local and global allignment
Bioinformatics t5-databasesearching v2014
Barcelona sabatica
презентация за варшава
Kyle Jensen's MIT Ph.D. Thesis Proposal
Virus Sequence Alignment and Phylogenetic Analysis 2019
Advanced BLAST (BlastP, PSI-BLAST)
Plant Molecular Systematics Phylogenetics.ppt
phylogenetics (1)...............................ppt
Ad

More from RenukaVyawahare (7)

PPTX
Transgenic.pptx
PPTX
Vector Engineering.pptx
PPTX
Ribosomes.pptx
PPTX
chromatography-130817104003-phpapp021-140503201413-phpapp01.pptx
PPTX
RNA Isolation.pptx
PPTX
Waste water treatment.pptx
PPTX
ACT .pptx
Transgenic.pptx
Vector Engineering.pptx
Ribosomes.pptx
chromatography-130817104003-phpapp021-140503201413-phpapp01.pptx
RNA Isolation.pptx
Waste water treatment.pptx
ACT .pptx
Ad

Recently uploaded (20)

PDF
What if we spent less time fighting change, and more time building what’s rig...
PDF
Paper A Mock Exam 9_ Attempt review.pdf.
PPTX
UV-Visible spectroscopy..pptx UV-Visible Spectroscopy – Electronic Transition...
PDF
LDMMIA Reiki Yoga Finals Review Spring Summer
PPTX
A powerpoint presentation on the Revised K-10 Science Shaping Paper
PPTX
Radiologic_Anatomy_of_the_Brachial_plexus [final].pptx
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
PPTX
Digestion and Absorption of Carbohydrates, Proteina and Fats
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PPTX
Cell Types and Its function , kingdom of life
PDF
advance database management system book.pdf
PPTX
History, Philosophy and sociology of education (1).pptx
PDF
RMMM.pdf make it easy to upload and study
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PDF
Computing-Curriculum for Schools in Ghana
PDF
Empowerment Technology for Senior High School Guide
PPTX
Unit 4 Skeletal System.ppt.pptxopresentatiom
PDF
احياء السادس العلمي - الفصل الثالث (التكاثر) منهج متميزين/كلية بغداد/موهوبين
PDF
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
What if we spent less time fighting change, and more time building what’s rig...
Paper A Mock Exam 9_ Attempt review.pdf.
UV-Visible spectroscopy..pptx UV-Visible Spectroscopy – Electronic Transition...
LDMMIA Reiki Yoga Finals Review Spring Summer
A powerpoint presentation on the Revised K-10 Science Shaping Paper
Radiologic_Anatomy_of_the_Brachial_plexus [final].pptx
Final Presentation General Medicine 03-08-2024.pptx
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
Digestion and Absorption of Carbohydrates, Proteina and Fats
202450812 BayCHI UCSC-SV 20250812 v17.pptx
Cell Types and Its function , kingdom of life
advance database management system book.pdf
History, Philosophy and sociology of education (1).pptx
RMMM.pdf make it easy to upload and study
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
Computing-Curriculum for Schools in Ghana
Empowerment Technology for Senior High School Guide
Unit 4 Skeletal System.ppt.pptxopresentatiom
احياء السادس العلمي - الفصل الثالث (التكاثر) منهج متميزين/كلية بغداد/موهوبين
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE

Bio informatics, Sequence tags, log odds and profile

  • 1. Bioinformatics 1 -- lecture 10 Sequence weights log-odds profiles Logos
  • 2. Modeling a sequence family In statistical modeling, we choose we build a representative model for the sequence family. To choose the representative, we take a poll over all observed sequences. Are they a representative sample? family sequence space superfamily
  • 3. A typical poll of the database If we submit one sequence (for example, citrate synthase from human) to the GenBank database (using BLAST for example), and take 100 results, and we build a cladogram from this, we might get something like this... primates rabbit rat E. coli lawyer What is our representative going to look like if we use the rule: "one sequence one vote"?
  • 4. Sequence weighting corrects for poor sampling To build a representative model we can... (1) throw out all redundant sequences and keep representatives of each clade only, or (2) apply a weight to each sequence reflecting how non- redundant that sequence is. One measure of non-redundancy is sequence-distance, or evolutionary distance.
  • 5. Crude weights from a cladogram Simplest weighting scheme: Start with weight = 1.0 at the common ancestor of the tree. Split the weight evenly at each node. primates rabbit rat E. coli lawyer 1.000 0.500 0.500 0.250 0.250 0.125 0.125 0.0625 0.0625 0.008 0.0625 0.125 0.125 0.0625 0.016 0.016 0.016 0.016 0.016 0.008 0.008 0.016 0.008 0.031 0.031 0.031 0.031 0.0625 Human sequences are 10/18 of the tree, but only 0.125 of the weights
  • 6. Better weights from a phylogram 0.5 0.3 0.1 0.2 The sequence weight is calculated starting from the distance from the taxon to the first ancestor node, adding half of the distance from the first ancestor to the second ancestor, 1/4th of the distance from the second to third ancetor, and so on. Finally, the weights are normalized. A B C wA = 0.2 + 0.3/2 = 0.35 wB = 0.1 + 0.3/2 = 0.25 wC = 0.5
  • 7. Making a phylogram in Geneious • Align • Make tree • turn off “transform branches” – Resulting branches are proportional to p- distance 7
  • 8. (Easy) Distance-based weights Self-consistent Weights Method of Sander & Schneider, 1994 0.3 1.0 0.9 A B C C B A all wi initialized to 1. while (wi ≠ w'i) do for i from A to C do w'i = Σj wj Dij end do for i from A to C do wi = w'i/ Σj w'j end do end do (1) Sum the weighted distances to get new weights. (2) Normalize the new weights (3) Repeat (1) and (2) until no change. Pseudocode :
  • 9. Distance-based weights 0.3 1.0 0.9 A B C C B A w'A = 0.3 + 1.0 = 1.3 w'B = 0.3 + 0.9 = 1.2 w'C = 1.0 + 0.9 = 1.9 wA = 1.3/(1.3+1.2+1.9)=0.30 wB = 1.2/4.4 = 0.27 wC = 1.9/4.4 = 0.43 w'A= 0.3*0.27+1.0*0.43=0.51 w'B= 0.3*0.3+0.9*0.43 =0.48 w'C= 1.0*0.3+0.9*0.27 =0.54 ... wABC = 0.33 0.31 0.35 wABC = 0.30 0.28 0.42 wABC = 0.31 0.29 0.40 wABC = 0.30 0.28 0.41 wABC = 0.30 0.28 0.41 converged. (3) Repeat (1) and (2) until no change. Running the pseudocode : (1) Sum the weighted distances to get new weights. (2) Normalize the new weights (1) Sum the weighted distances to get new weights.
  • 10. Amino acid probability profiles An amino acid profile is defined as a set of probability distributions over the 20 amino acids, one PDF for each position in the alignment. Gap probabilities may or may not be included when talking about a profile. A C D E F G H I K L M N P Q R S T V W Y Amino acids are not equally likely in Nature. K, L and R are the most common.
  • 12. Usually, Log-likelihood ratios LLR(a) = log( P(a|i) / P(a) ) probability of a in one column likelihood of a overall (the whole database)
  • 13. Pseudocounts, because you never know... LLR(a) = log( P(a|i) / P(a) ) If P(a|i)=0., you can't take the log The probability of seeing a in column i of a sequence alignment is never really zero. So we add a small number of 'pseudocounts' ε. LLR(a) = log( P(a|i)+ε / P(a) ) This LLR does not go to negative infinity as P(a)-->0.000. Instead it goes to log(ε/P(a)).
  • 14. Color = LLR. Blue = high negative values. Green = zero. Red = high positive values. Color matrix One way to visualize profiles
  • 15. Another way: Logos Height of letter is the LLR.
  • 16. Example Logos for DNA alignments
  • 17. Alignments of transcription factor footprint sites
  • 18. Scoring a sequence versus a profile KEMGFDHIIIHP score = Σi LLR(ai) The score is the sum of the log- likelihood ratios of the amino acid in the sequence. Sequence=
  • 19. In class exercise: build a profile Copy “tree of 5” from Collaboration-->Bioinformatics1@RPI Display as a phylogram On paper... (1) Calculate sequence weights based on the distances, using one iteration of distance-based weights. wA = Σi DiA Then normalize (divide by Σi wi ). (2) Sum the probabilities of each AA in the nth column. i.e. P(A) = sum wi over sequences that contain an A. (3) Convert each P() to a LLR using equal probability AAs (0.05) as the expected value. Use a pseudocount of 0.02 LLR = log((P(n)+0.02)/(0.05)) (4) Divide by log(2) to convert to 'bits'. (5) Stack letters, Logo style. Height of letter = bits.
  • 20. Aligning sequence to profile S(i,j) = 0 do aa=1,20 S(i,j) = S(i,j) + P(aa,i)*B(aa,s(j)) enddo 20 Aligning profile to profile S(i,j) = 0 do aai=1,20 do aaj=1,20 S(i,j) = S(i,j) + P(aai,i)*P(aaj,j)*B(aai,aaj) enddo enddo profile1@i P(aa|i) BLOSUM score No need to normalize, since ∑ ∑ P(aai|i)*P(aaj|j)= 1 sequence2@j aai aaj
  • 21. Psi-BLAST: Blast with profiles Psi-BLAST searches the database iteratively. (Cycle 1) Normal BLAST (with gaps) (Cycle 2) (a) Construct a profile from the results of Cycle 1. (b) Search the database using the profile. (Cycle 3) (a) Construct a profile from the results of Cycle 2. (b) Search the database using the profile. And So On... (user sets the number of cycles) Psi-BLAST is much more sensitive than BLAST. Also more vulnerable to low-complexity.
  • 22. Other forms of BLAST 22 BLAST query database blastn nucleotide nucleotide blastp protein protein tblastn protein translated DNA blastx translated DNA protein tblastx translated DNA translated DNA psi-blast protein, profile protein phi-blast pattern protein transitive blast* any any *not really a blast. Just a way of using blast.
  • 23. PHI-BLAST -- Patterned Hit Initiated BLAST 23