SlideShare a Scribd company logo
Distance based
phylogenetics
Usman Roshan
Phylogenetics
• Study of how species relate to each other
• “Nothing in biology makes sense, except
in the light of evolution”, Theodosius
Dobzhansky, Am. Biol. Teacher (1973)
• Rich in computational problems
• Fundamental tool in comparative
bioinformatics
Why phylogenetics?
• Study of evolution
– Origin and migration of humans
– Origin and spead of disease
• Many applications in comparative bioinformatics
– Sequence alignment
– Motif detection (phylogenetic motifs, evolutionary trace,
phylogenetic footprinting)
– Correlated mutation (useful for structural contact prediction)
– Protein interaction
– Gene networks
– Vaccine devlopment
– And many more…
Phylogeny Problem
TAGCCCA TAGACTT TGCACAA TGCGCTT
AGGGCAT
U V W X Y
U
V W
X
Y
Bipartitions
• Phylogenies are equivalent to bipartitions
Topological differences
Phylogeny Problem
• Two main methodologies:
– Alignment first and phylogeny second
• Construct alignment using one of the MANY alignment programs in
the literature
• Do manual (eye) adjustments if necessary
• Apply a phylogeny reconstruction method
• Fast but biologically not realistic
• Phylogeny is highly dependent on accuracy of alignment (but so is
the alignment on the phylogeny!)
– Simultaneously alignment and phylogeny reconstruction
• Output both an alignment and phylogeny
• Computationally much harder
• Biologically more realistic as insertions, deletions, and mutations
occur during the evolutionary process
First methodology
• Compute alignment (for now we assume we are given an
alignment)
• Construct a phylogeny (two approaches)
• Distance-based methods
– Input: Distance matrix containing pairwise statistical estimation
of aligned sequences
– Output: Phylogenetic tree
– Fast but less accurate
• Character-based methods
– Input: Sequence alignment
– Output: Phylogenetic tree
– Accurate but computationally very hard
Distance-based methods
Evolution on a single edge
• Poisson process
– Number of changes in a fixed time interval t is
independent of changes in any other non-overlapping
time interval u
– Number of changes in time interval t is proportional to
the length of the interval
– No changes in time interval of length 0
• Let X be the number of nucleotide changes on a
single edge. We assume X is a Poisson process
• Probability dictates that
Evolution on a single edge
• We want to compute (the probability of a
nucleotide change on edge e)
• The probability of observing a change is just the
sum of probabilities of observing k changes over
all possible values of k (excluding even ones
because those changes cannot be seen)
• Expected number of nucleotide changes on a
given edge is given by
• Key: is additive
Evolution on a single edge
Additivity
• Assume we have a path of k edges and that p1,
p2,…, pk are the probabilities of change on each
edge of the path
• Using induction we can show that
• Multiplicative term is hard to deal with and does
not easily decompose into a product or sum of p
i’s
Additivity
• But the expected number of nucleotide
changes on the path p is elegant
Evolutionary models
• Simple 0,1 alphabet
evolutionary model
– i.i.d. model
– uniformly random root
sequence
• Jukes-Cantor:
– Uniformly random root
sequence
– i.i.d. model
• General Markov
Model
– Uniformly random root
sequence
– i.i.d. model
– For time reversible
models
Evolutionary models
Variation across sites
• Standard assumption of how sites can
vary is that each site has a multiplicative
scaling factor
• Typically these scaling factors are drawn
from a Gamma distribution (or Gamma
plus invariant)
Special issues
• Molecular clock: the expected number of
changes for a site is proportional to time
• No-common-mechanism model: there is a
random variable for every combination of
edge and site
Evolutionary distance estimation
Estimating evolutionary distances
• For sequences A and B what is the evolutionary
distance under the Jukes-Cantor model?
– ACCTGTGGGTAACCACCC
– ACCTGAGGGATAGGTCCG
• But we don’t know what is
Estimating evolutionary distances
• Assume nucleotide changes are Bernoulli trials
(i.i.d. trials of success or failure)
• is probability of head in n Bernoulli trials (n is
sequence length)
• Compute a maximum likelihood estimate for
ACCTGTGGGTAACCACCC
ACCTGAGGGATAGGTCCG
0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 1
Estimating evolutionary distance
• We want to find the value of p that maximizes
the probability:
• Set dP/dp to 0 and solve for p to get
Estimating evolutionary distances
• = 5/18
• Continuing in this manner we estimate for
all pairs of sequences in the alignment
• We now have a distance matrix under a
biologically sound evolutionary model
ACCTGTGGGTAACCACCC
ACCTGAGGGATAGGTCCG
0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 1
Distance methods
Distance methods
• UPGMA: similar to hierarchical clustering
but not additive
• Neighbor-joining: more sophisticated and
additive
• What is additivity?
Additivity
UPGMA
UPGMA is not additive but works for
ultrametric trees. Takes O(n^3) time
A D
C
B
3 3
3
10
10
A
B
C
D
A B C D
6 26 26
26 26
6
3
1. Initialize n clusters where each cluster i contains the
sequence i
2. Find closest pair of clusters i, j, using distances in
matrix D
3. Make them neighbors in the tree by adding new node
(ij), and set distance from (ij) to i and j as Dij/2
4. Update distance matrix D: for all clusters k do the
following (ni and nj are size of clusters i and j
respectively)
5. Delete columns and rows for i and j in D and add new
ones corresponding to cluster (ij) with distances as
computed above
6. Goto step 2 until only one cluster is left
UPGMA
UPGMA
A
B
C
D
A B C D
6 26 26
26 26
6
A D
C
B
3 3
3
10
10
3
UPGMA
Doesn’t work (in general) for non-ultrametric
trees
A D
C
B
10
3
3 3
3
10
A
B
C
D
A B C D
13 19 26
12 19
13
UPGMA
UPGMA constructs incorrect tree here
A
B
C
D
A B C D
13 19 26
12 19
13
B D
A
C
UPGMA
Bipartition (BC,AD) is not in true tree
B D
A
C
A D
C
B
10
3
3 3
3
10
True tree UPGMA tree
Neighbor joining
1. Additive and O(n^3) time
2. Initialization: same as UPGMA
3. For each species compute
4. Select i and j for which is
minimum
5. Make them neighbors in the tree by adding
new node (ij), and set distance from (ij) to i
and j as
Neighbor joining
6. Update distance matrix D: for all clusters k do
the following
7. Delete columns and rows for i and j in D and
add new ones corresponding to cluster (ij) with
distances as computed above
8. Go to 3 until two nodes/clusters are left
NJ
NJ constructs the correct tree for additive
matrices
A D
C
B
10
3
3 3
3
10
A
B
C
D
A B C D
13 19 26
12 19
13
Simulation studies
Simulation studies
• The true evolutionary tree is never known in
practice. Simulation allows us to study accuracy
of methods under biologically realistic scenarios
• Mathematics behind the phylogenetics is often
complex and challenging. Simulation allows us
to study algorithms when not possible
theoretically and also examine algorithm
performance under various conditions such as
different evolutionary rates, sequence lengths,
or numbers of taxa
Statistical consistency
• As sequence lengths tend to infinity the distance
estimation improves and eventually leads to the true
additive matrix
• If a method like NJ is then applied we get the true tree.
• In practice, however, we have limited sequence length.
Therefore we want to know how much sequence length a
method requires to achieve low error
Convergence rates
• Can be studied
experimentally or
theoretically
• Theoretical results
offer loose bounds
• Experiments (under
simulation) provide
more realistic bounds
on sequence lengths
Sequence length requirements
Sequence length requirements
Typical performance study
Sequence lengths for NJ
Sequence lengths required to obtain 90% accuracy
Error rate of NJ

More Related Content

PDF
EVE161: Microbial Phylogenomics - Class 4 - Phylogeny
PPT
Phylogenetics1
PPTX
human phylogetic contrution of evolution tree.pptx
PPTX
Phylogenetic tree construction
PPT
6238578.ppt
PDF
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES
PDF
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES ...
PDF
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES ...
EVE161: Microbial Phylogenomics - Class 4 - Phylogeny
Phylogenetics1
human phylogetic contrution of evolution tree.pptx
Phylogenetic tree construction
6238578.ppt
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES ...
A COMPARATIVE ANALYSIS OF PROGRESSIVE MULTIPLE SEQUENCE ALIGNMENT APPROACHES ...

Similar to phylogenetics (1)...............................ppt (20)

PPTX
Distance based method
PPTX
BTC 506 Phylogenetic Analysis.pptx
PPTX
PPTX
Tools in phylogeny
PPTX
BioINfo.pptx
PPT
Phylogenetic trees
PDF
Phylogenetic analysis
PPTX
Algorithm research project neighbor joining
PPTX
PPT
Interpreting ‘tree space’ in the context of very large empirical datasets
DOCX
Humans, it would seem, have a great love of categorizing, organi
PDF
Molecular Evolution and Phylogenetics (2009)
PPTX
Perl for Phyloinformatics
PPT
Multiple Sequence Alignment-just glims of viewes on bioinformatics.
PPT
Phylogenetic analysis
PPTX
Upgma
PPTX
Phylogenetic tree
PPT
Data mining maximumlikelihood
PPT
Data mining maximumlikelihood
PPT
Data mining maximumlikelihood
Distance based method
BTC 506 Phylogenetic Analysis.pptx
Tools in phylogeny
BioINfo.pptx
Phylogenetic trees
Phylogenetic analysis
Algorithm research project neighbor joining
Interpreting ‘tree space’ in the context of very large empirical datasets
Humans, it would seem, have a great love of categorizing, organi
Molecular Evolution and Phylogenetics (2009)
Perl for Phyloinformatics
Multiple Sequence Alignment-just glims of viewes on bioinformatics.
Phylogenetic analysis
Upgma
Phylogenetic tree
Data mining maximumlikelihood
Data mining maximumlikelihood
Data mining maximumlikelihood
Ad

More from iqbalzubaria69 (20)

PPTX
evidance of evolution....................................pptx
PPTX
Evolution_of_elephant........................................pptx
PPTX
Vertebrtes (fishes)-1......................pptx
PPTX
AVES (1).......................................pptx
PPTX
Endangered species of Pakistan (3).pptx
PPTX
Harmones and behavior...............................pptx
PPTX
DOC-20250106-WA0018..............................pptx
PPTX
Harmones and behavior.pptxmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
PPTX
cladistics.pptx/////////////////////////////////////
PPT
cooprative breeding................................ppt
PPTX
Animal competition............................pptx
PPTX
Altruism........................................pptx
PPTX
mosquito isba...........................pptx
PPTX
seminar-1........................................pptx
PPTX
how elite exploits lower class.........................................pptx
PPTX
seminar-1.pptx..............................
PPTX
SILVERFISH .AZKA.pptx....................................
DOCX
Endoplasmic Reticulum.........................docx
DOCX
Lysosomes.docx......................................
PPTX
Tehmeena_Tanveer[2][1]...........................pptx
evidance of evolution....................................pptx
Evolution_of_elephant........................................pptx
Vertebrtes (fishes)-1......................pptx
AVES (1).......................................pptx
Endangered species of Pakistan (3).pptx
Harmones and behavior...............................pptx
DOC-20250106-WA0018..............................pptx
Harmones and behavior.pptxmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
cladistics.pptx/////////////////////////////////////
cooprative breeding................................ppt
Animal competition............................pptx
Altruism........................................pptx
mosquito isba...........................pptx
seminar-1........................................pptx
how elite exploits lower class.........................................pptx
seminar-1.pptx..............................
SILVERFISH .AZKA.pptx....................................
Endoplasmic Reticulum.........................docx
Lysosomes.docx......................................
Tehmeena_Tanveer[2][1]...........................pptx
Ad

Recently uploaded (20)

PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
Computing-Curriculum for Schools in Ghana
PPTX
Lesson notes of climatology university.
PPTX
Pharma ospi slides which help in ospi learning
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
Pre independence Education in Inndia.pdf
PDF
01-Introduction-to-Information-Management.pdf
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
TR - Agricultural Crops Production NC III.pdf
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
Pharmacology of Heart Failure /Pharmacotherapy of CHF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Computing-Curriculum for Schools in Ghana
Lesson notes of climatology university.
Pharma ospi slides which help in ospi learning
Microbial disease of the cardiovascular and lymphatic systems
Module 4: Burden of Disease Tutorial Slides S2 2025
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Pre independence Education in Inndia.pdf
01-Introduction-to-Information-Management.pdf
human mycosis Human fungal infections are called human mycosis..pptx
Anesthesia in Laparoscopic Surgery in India
STATICS OF THE RIGID BODIES Hibbelers.pdf
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
O5-L3 Freight Transport Ops (International) V1.pdf
TR - Agricultural Crops Production NC III.pdf
Renaissance Architecture: A Journey from Faith to Humanism
2.FourierTransform-ShortQuestionswithAnswers.pdf
102 student loan defaulters named and shamed – Is someone you know on the list?

phylogenetics (1)...............................ppt

  • 2. Phylogenetics • Study of how species relate to each other • “Nothing in biology makes sense, except in the light of evolution”, Theodosius Dobzhansky, Am. Biol. Teacher (1973) • Rich in computational problems • Fundamental tool in comparative bioinformatics
  • 3. Why phylogenetics? • Study of evolution – Origin and migration of humans – Origin and spead of disease • Many applications in comparative bioinformatics – Sequence alignment – Motif detection (phylogenetic motifs, evolutionary trace, phylogenetic footprinting) – Correlated mutation (useful for structural contact prediction) – Protein interaction – Gene networks – Vaccine devlopment – And many more…
  • 4. Phylogeny Problem TAGCCCA TAGACTT TGCACAA TGCGCTT AGGGCAT U V W X Y U V W X Y
  • 5. Bipartitions • Phylogenies are equivalent to bipartitions
  • 7. Phylogeny Problem • Two main methodologies: – Alignment first and phylogeny second • Construct alignment using one of the MANY alignment programs in the literature • Do manual (eye) adjustments if necessary • Apply a phylogeny reconstruction method • Fast but biologically not realistic • Phylogeny is highly dependent on accuracy of alignment (but so is the alignment on the phylogeny!) – Simultaneously alignment and phylogeny reconstruction • Output both an alignment and phylogeny • Computationally much harder • Biologically more realistic as insertions, deletions, and mutations occur during the evolutionary process
  • 8. First methodology • Compute alignment (for now we assume we are given an alignment) • Construct a phylogeny (two approaches) • Distance-based methods – Input: Distance matrix containing pairwise statistical estimation of aligned sequences – Output: Phylogenetic tree – Fast but less accurate • Character-based methods – Input: Sequence alignment – Output: Phylogenetic tree – Accurate but computationally very hard
  • 10. Evolution on a single edge • Poisson process – Number of changes in a fixed time interval t is independent of changes in any other non-overlapping time interval u – Number of changes in time interval t is proportional to the length of the interval – No changes in time interval of length 0 • Let X be the number of nucleotide changes on a single edge. We assume X is a Poisson process • Probability dictates that
  • 11. Evolution on a single edge • We want to compute (the probability of a nucleotide change on edge e) • The probability of observing a change is just the sum of probabilities of observing k changes over all possible values of k (excluding even ones because those changes cannot be seen)
  • 12. • Expected number of nucleotide changes on a given edge is given by • Key: is additive Evolution on a single edge
  • 13. Additivity • Assume we have a path of k edges and that p1, p2,…, pk are the probabilities of change on each edge of the path • Using induction we can show that • Multiplicative term is hard to deal with and does not easily decompose into a product or sum of p i’s
  • 14. Additivity • But the expected number of nucleotide changes on the path p is elegant
  • 15. Evolutionary models • Simple 0,1 alphabet evolutionary model – i.i.d. model – uniformly random root sequence • Jukes-Cantor: – Uniformly random root sequence – i.i.d. model
  • 16. • General Markov Model – Uniformly random root sequence – i.i.d. model – For time reversible models Evolutionary models
  • 17. Variation across sites • Standard assumption of how sites can vary is that each site has a multiplicative scaling factor • Typically these scaling factors are drawn from a Gamma distribution (or Gamma plus invariant)
  • 18. Special issues • Molecular clock: the expected number of changes for a site is proportional to time • No-common-mechanism model: there is a random variable for every combination of edge and site
  • 20. Estimating evolutionary distances • For sequences A and B what is the evolutionary distance under the Jukes-Cantor model? – ACCTGTGGGTAACCACCC – ACCTGAGGGATAGGTCCG • But we don’t know what is
  • 21. Estimating evolutionary distances • Assume nucleotide changes are Bernoulli trials (i.i.d. trials of success or failure) • is probability of head in n Bernoulli trials (n is sequence length) • Compute a maximum likelihood estimate for ACCTGTGGGTAACCACCC ACCTGAGGGATAGGTCCG 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 1
  • 22. Estimating evolutionary distance • We want to find the value of p that maximizes the probability: • Set dP/dp to 0 and solve for p to get
  • 23. Estimating evolutionary distances • = 5/18 • Continuing in this manner we estimate for all pairs of sequences in the alignment • We now have a distance matrix under a biologically sound evolutionary model ACCTGTGGGTAACCACCC ACCTGAGGGATAGGTCCG 0 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 1
  • 25. Distance methods • UPGMA: similar to hierarchical clustering but not additive • Neighbor-joining: more sophisticated and additive • What is additivity?
  • 27. UPGMA UPGMA is not additive but works for ultrametric trees. Takes O(n^3) time A D C B 3 3 3 10 10 A B C D A B C D 6 26 26 26 26 6 3
  • 28. 1. Initialize n clusters where each cluster i contains the sequence i 2. Find closest pair of clusters i, j, using distances in matrix D 3. Make them neighbors in the tree by adding new node (ij), and set distance from (ij) to i and j as Dij/2 4. Update distance matrix D: for all clusters k do the following (ni and nj are size of clusters i and j respectively) 5. Delete columns and rows for i and j in D and add new ones corresponding to cluster (ij) with distances as computed above 6. Goto step 2 until only one cluster is left UPGMA
  • 29. UPGMA A B C D A B C D 6 26 26 26 26 6 A D C B 3 3 3 10 10 3
  • 30. UPGMA Doesn’t work (in general) for non-ultrametric trees A D C B 10 3 3 3 3 10 A B C D A B C D 13 19 26 12 19 13
  • 31. UPGMA UPGMA constructs incorrect tree here A B C D A B C D 13 19 26 12 19 13 B D A C
  • 32. UPGMA Bipartition (BC,AD) is not in true tree B D A C A D C B 10 3 3 3 3 10 True tree UPGMA tree
  • 33. Neighbor joining 1. Additive and O(n^3) time 2. Initialization: same as UPGMA 3. For each species compute 4. Select i and j for which is minimum 5. Make them neighbors in the tree by adding new node (ij), and set distance from (ij) to i and j as
  • 34. Neighbor joining 6. Update distance matrix D: for all clusters k do the following 7. Delete columns and rows for i and j in D and add new ones corresponding to cluster (ij) with distances as computed above 8. Go to 3 until two nodes/clusters are left
  • 35. NJ NJ constructs the correct tree for additive matrices A D C B 10 3 3 3 3 10 A B C D A B C D 13 19 26 12 19 13
  • 37. Simulation studies • The true evolutionary tree is never known in practice. Simulation allows us to study accuracy of methods under biologically realistic scenarios • Mathematics behind the phylogenetics is often complex and challenging. Simulation allows us to study algorithms when not possible theoretically and also examine algorithm performance under various conditions such as different evolutionary rates, sequence lengths, or numbers of taxa
  • 38. Statistical consistency • As sequence lengths tend to infinity the distance estimation improves and eventually leads to the true additive matrix • If a method like NJ is then applied we get the true tree. • In practice, however, we have limited sequence length. Therefore we want to know how much sequence length a method requires to achieve low error
  • 39. Convergence rates • Can be studied experimentally or theoretically • Theoretical results offer loose bounds • Experiments (under simulation) provide more realistic bounds on sequence lengths
  • 43. Sequence lengths for NJ Sequence lengths required to obtain 90% accuracy

Editor's Notes