SlideShare a Scribd company logo
Lecture 11
Phylogenetic trees
Principles of Computational
Biology
Teresa Przytycka, PhD
Phylogenetic (evolutionary) Tree
• showing the evolutionary relationships among
various biological species or other entities that
are believed to have a common ancestor.
• Each node is called a taxonomic unit.
• Internal nodes are generally called hypothetical
taxonomic units
• In a phylogenetic tree, each node with
descendants represents the most recent
common ancestor of the descendants, and the
• edge lengths (if present) correspond to time
estimates.
Methods to construct phylogentic
trees
• Parsimony
• Distance matrix based
• Maximum likelihood
Parsimony methods
The preferred evolutionary tree is
the one that requires
“the minimum net amount of evolution”
[Edwards and Cavalli-Sforza, 1963]
Assumption of character based
parsimony
• Each taxa is described by a set of characters
• Each character can be in one of finite number
of states
• In one step certain changes are allowed in
character states
• Goal: find evolutionary tree that explains the
states of the taxa with minimal number of
changes
Example
Taxon1 Yes Yes No
Taxon 2 YES Yes Yes
Taxon 3 Yes No No
Taxon 4 Yes No No
Taxon 5 Yes No Yes
Taxon 6 No No Yes
Ancestral
states
4 Changes
Version parsimony models:
• Character states
– Binary: states are 0 and 1 usually interpreted as presence or
absence of an attribute (eg. character is a gene and can be
present or absent in a genome)
– Multistate: Any number of states (Eg. Characters are
position in a multiple sequence alignment and states are
A,C,T,G.
• Type of changes:
– Characters are ordered (the changes have to happen in
particular order or not.
– The changes are reversible or not.
Variants of parsimony
• Fitch Parsimony unordered, multistate characters with
reversibility
• Wagner Parsimony ordered, multistate characters
with reversibility
• Dollo Parsimony ordered, binary characters with
reversibility but only one insertion allowed per
character characters that are relatively chard to gain but easy
to lose (like introns)
• Camin-Sokal Parsimony- no reversals, derived states
arise once only
• (binary) prefect phylogeny – binary and non-
reversible; each character changes at most once.
Prefect – No
(triangle gained
and the lost)
Dollo – Yes
Camin-Sokal –
No (for the same
reason as perfect)
3 Changes
Camin-Sokal Parsimony
Triangle inserted twice!
But this is
not prefect
and not
Dollo
Homoplasy
Having some states arise more than once is
called homoplasy.
Example – triangle in the tree on the
previous slide
Finding most parsimonious tree
• There are exponentially many trees with n
nodes
• Finding most parsimonious tree is NP-
complete (for most variants of parsimony
models)
• Exception: Perfect phylogeny if exists can
be found quickly. Problem – perfect
phylogeny is to restrictive in practice.
Perfect phylogeny
• Each change can happen only once and is
not reversible.
• Can be directed or not
Example: Consider binary characters. Each character
corresponds to a gene.
0-gene absent
1-gene present
It make sense to assume directed changes only form 0 to 1.
The root has to be all zeros
Perfect phylogeny
Example: characters = genes; 0 = absent ; 1 = present
Taxa: genomes (A,B,C,D,E)
A 0 0 0 1 1 0
B 1 1 0 0 0 0
C 0 0 0 1 1 1
D 1 0 1 0 0 0
E 0 0 0 1 0 0
genes
B D E A C
Perfect phylogeny tree
Goal: For a given character state matrix construct a tree topology
that provides perfect phylogeny.
1
1
1
1 1
1
Does there exist prefect
parsimony tree for our example
with geometrical shapes?
There is a simple test
Character Compatibility
• Two characters A, B are
compatible if there do
not exits four taxa
containing all four
combinations as in the
table
• Fact: there exits perfect
phylogeny if and only if
and only if all pairs of
characters are
compatible
T1 1 1
T2 1 0
T3 0 1
T4 0 0
A B
Are not compatible
Taxon1 Yes Yes No
Taxon 2 YES Yes Yes
Taxon 3 Yes No No
Taxon 4 Yes No No
Taxon 5 Yes No Yes
Taxon 6 No No Yes
?
One cannot add triangle
to the tree so that no
character changes it state
twice:
If we add it to on of the
left branches it will be
inserted twice if to the
right most – circle would
have to be deleted
(insertion and the
deletion of the circle)
Ordered characters and perfect
phylogeny
• Assume that we in the last common
ancestor all characters had state 0.
• This assumption makes sense for many
characters, for example genes.
• Then compatibility criterion is even
simpler: characters are compatible if and
only if there do not exist three taxa
containing combinations (1,0),(0,1),(1,1)
Example
Under assumption that states are directed
form 0 to 1: if i and j are two different genes
then the set of species containing i is either
disjoint with set if species containing j or
one of this sets contains the other.
A 0 0 0 1 1 0
B 1 1 0 0 0 0
C 0 0 0 1 1 1
D 1 0 1 0 0 0
E 0 0 0 1 0 0
• The above property is necessary and sufficient for prefect
phylogeny under 0 to 1 ordering
• Why works: associated with each character is a subtree.
These subtrees have to be nested.
Simple test for prefect phylogeny
• Fact: there exits perfect phylogeny if and only if and
only if all pairs of characters are compatible
• Special case: if we assume directed parsimony (0!1
only) then characters are compatible if and only if
there do not exist tree taxa containing combinations
(1,0),(0,1),(1,1)
• Observe the last one is equivalent to non-overlapping
criterion
• Optimal algorithm: Gusfield O(nm):
n = # taxa; m= #characters
Two version optimization problem:
Small parsimony: Tree is given and we want to find the
labeling that minimizes #changes – there are good
algorithms to do it.
Large parsimony: Find the tree that minimize number of
evolutionary changes. For most models NP complete
One approach to large parsimony requires:
- generating all possible trees
- finding optimal labeling of internal nodes for each tree.
Fact 1: #tree topologies grows exponentially with #nodes
Fact 2: There may be many possible labels leading to the
same score.
Clique method for large parsimony
• Consider the following
graph:
– nodes – characters;
– edge if two characters
are compatible
1 2 3 4 5 6
α	

 1 0 0 1 1 0
β	

 0 0 1 0 0 0
γ	

 1 1 0 0 0 0
δ	

 1 1 0 1 1 1
ε	

 0 0 1 1 1 0
ω	

 0 0 0 0 0 0
characters
2
1
3
4
5
6
3,5 INCOMPATIBLE
Max. compatible set
Clique method (Meacham 1981) -
• Find maximal compatible clique (NP-
complete problem)
• Each characters defines a partition of the
set into two subsets
α,β,γ,	

δ,ε,ω	

α,γ,	

δ	

β	

ε,ω	

1 2
β	

ε,ω	

γ,	

δ	

α,	

	

ε,ω	

γ,	

δ	

α,	

3
β
Small parsimony
• Assumptions: the tree is known
• Goal: find the optimal labeling of the tree
(optimal = minimizing cost under given
parsimony assumption)
“Small” parsimony
Infer nodes labels
Application of small parsimony
problem
• errors in data
• loss of function
• convergent evolution (a trait
developed independently by two
evolutionary pathways e.g. wings
in birds an bats)
• lateral gene transfer (transferring
genes across species not by
inheritance)
From paper: Are There Bugs in Our Genome:
Anderson, Doolittle, Nesbo, Science 292 (2001) 1848-51
Red – gene encoding
N-acetylneuraminate
lyase
Dynamic programming algorithm for small
parsimony problem
• Sankoff (1975) comes with the DP approach
(Fitch provided an earlier non DP algorithm)
• Assumptions
– one character with multiple states
- The cost of change from state v to w is δ(v,w)
(note that it is a generalization, so far we talk about
cost of any change equal to 1)
DP algorithm continued
st(v) = minimum parsimony cost for node v under
assumption that the character state is t.
st(v) = 0 if v is a leaf. Otherwise let u, w be children
of u
st(v) = min i {si(u)+ δ(i,t)}+ min j {sj(w)+ δ(j,t)}
t
δ(i,t) δ(j,t)
u w
Try all possible states in u and v
O(nk) cost where n=number of nodes
k = number of sates
Exercise
1 2
5
4
6
7
3
St(v) 1 2 3 4 5 6 7
t
Left and right characters are independent,
We will compute the left.
Branch lengths
• Numbers that indicate the
number of changes in each
branch
• Problem – there may by many
most parsimonious trees
• Method 1: Average over all
most parsimonious trees.
• Still a problem – the branch
lengths are frequently
underestimated
1
4
6
1
Character patterns and parsimony
• Assume 2 state characters (0/1) and four
taxa A,B,C,D
• The possible topologies are:
A
A A
B
B
B
C
C C
D D D
A B C D
0 0 0 0
0 0 0 1
0 0 1 0
0 0 1 1
Changes in each topology
0, 0, 0
1, 1, 1
1, 1, 1
1, 2, 2
Informative character
(helps to decide the tree topology)
Informative characters: xxyy, xyxy,xyyx
Inconsistency
• Let p, q character change probability
• Consider the three informative patters xxyy, xyxy,
xyyx
• The tree selected by the parsimony depends
which pattern has the highest fraction;
• If q(1-q) < p2 then the most frequent pattern is
xyxy leading to incorrect tree.
p
p
q
q
q
A
B
C
D
Distance based methods
• When two sequences are similar they are
likely to originate from the same ancestor
• Sequence similarity can approximate
evolutionary distances
GAATC GAGTT
GA(A/G)T(C/T)
Distance Method
• Assume that for any pair of species we
have an estimation of evolutionary
distance between them
– eg. alignment score
• Goal: construct a tree which best
approximates these distance
Tree from distance matrix
A B
C D
E
1 1
3
2 2
1
3
5
A B C D E
0
0
0
0
0
2 7 7 12
2 7 7 12
7 7 4 11
7 7 4 11
11
11
12
12
A
B
C
D
E
length of the path from A to D = 1+3+1+2=7
Consider weighted trees: w(e) = weight of edge e
Recall: In a tree there is a unique path between any two nodes.
Let e1,e2,…ek be the edges of the path connecting u and v then the
distance between u and v in the tree is:
d(u,v) = w(e1) + w(e2) + … + w(ek)
M T
Can one always represent a
distance matrix as a weighted
tree?
0 10 5 10
10 0 9 5
5 9 0 8
10 5 8 0
a c
b
3
2
7
a
b
c
d
a b c d
d
?
There is no way to add d
to the tree and preserve
the distances
Quadrangle inequality
• Matrix that satisfies quadrangle inequality (called
also the four point condition) for every four taxa is
called additive.
• Theorem: Distance matrix can be represented
precisely as a weighted tree if and only if it is
additive.
a c
b d
d(a,c) + d(b,d) = d(a,d) + d(b,c) >= d(a,b) + d(d,c)
Constructing the tree representing an additive
matrix (one of several methods)
1. Start form 2-leaf tree a,b where a,b are
any two elements
2. For i = 3 to n (iteratively add vertices)
1. Take any vertex z not yet in the tree
and consider 2 vertices x,y that are
in the tree and compute
d(z,c) = (d(z,x) + d(z,y) - d(x,y) )/2
d(x,c) = (d(x,z) + d(x,y) – d(y,z))/2
2. From step 1 we know position of c
and the length of brunch (c,z).
If c did not hit exactly a brunching
point add c and z
else take as y any node from sub-tree
that brunches at c and repeat steps
1,2.
x
y
x y
c
z
x
y
z
Example
0 10 5 9
10 0 9 5
5 9 0 9
9 5 9 0
u
u
v
x
y
u v x y
v
10
Adding x:
d(x,c) = (d(u,x) + d(v,x) – d(u,v))/2 = (5+9-10)/2= 2
d(u,c) = (d(u,x) + d(u,v) - d(x,v))/2 = (5+10-9)/2 = 3
u v
x
3
2
7
c
Adding y:
d(y,c’) = (d(u,y) + d(v,y) – d(u,w))/2 = (5+9-10)/2= 2
d(u,c’) = (d(u,y) + d(u,v) - d(y,v))/2 = (10+9-5)/2 = 7
u v
x
3
2
3
c c’
y
2
4
Real matrices are almost never
additive
• Finding a tree that minimizes the error
Optimizing the error is hard
• Heuristics:
– Unweighted Pair Group Method with
Arithmetic Mean (UPGMA)
– Neighborhood Joining (NJ)
Hierarchical Clustering
• Clustering problem: Group items (e.g. genes) with
similar properties (e.g. expression pattern, sequence
similarity) so that
– The clusters are homogenous (the items in each cluster are
highly similar, as measured by the given property –
sequence similarity, expression pattern)
– The clusters are well separated (the items in different
clusters are different)
• Hierarchical clustering Many clusters have natural
sub-clusters which are often easier to identify e.g. cuts
are sub-cluster of carnivore sub-cluster of mammals
Organize the elements into a tree rather than forming
explicit portioning
The basic algorithm
Input: distance array d; cluster to cluster distance function
Initialize:
1. Put every element in one-element cluster
2. Initialize a forest T of one-node trees (each tree
corresponds to one cluster)
while there is more than on cluster
1. Find two closest clusters C1 and C2 and merge them into
C
2. Compute distance from C to all other clusters
3. Add new vertex corresponding to C to the forest T and
make nodes corresponding to C1, C2 children of this
node.
4. Remove from d columns corresponding to C1,C2
5. Add to d column corresponding to C
A distance function
• dave(C1,C2) = 1/(|C1||C2|)Σ d(x,y)
x in C1 y in C2
Average over all distances
***
Example (on blackboard)
A B C D E
0
0
0
0
0
2 7 7 12
2 7 7 12
7 7 4 11
7 7 4 11
11
11
12
12
A
B
C
D
E
d
Unweighted Pair Group Method with
Arithmetic Mean (UPGMA)
• Idea:
– Combine hierarchical clustering with a method
to put weights on the edges
– Distance function used:
dave(C1,C2) = 1/(|C1||C2|)Σ d(x,y)
– We need to come up with a method of
computing brunch lengths
x in C1
y in C2
Ultrametric trees
• The distance from
any internal node C
to any of its leaves
is constant and
equal to h(C)
• For each node (v)
we keep variable h –
height of the node in
the tree. h(v) = 0 for
all leaves.
UPGMA algorithm
Initialization (as in hierarchical clustering); h(v) = 0
while there is more than on cluster
1. Find two closest clusters C1 and C2 and
merge them into C
2. Compute dave from C to all other clusters
3. Add new vertex corresponding to C to the
forest T and make nodes corresponding to C1,
C2 children of this node.
4. Remove from d columns corresponding to
C1,C2
5. Add to d column corresponding to C
6. h(C) = d(C1, C2) /2
7. Assign length h(C)-h(C1) to edge (C1,C)
8. Assign length h(C)-h(C2) to edge (C2,C)
Neighbor Joining
• Idea:
– Construct tree by
iteratively combing first
nodes that are neighbors
in the tree
• Trick: Figuring out a pair
of neighboring vertices
takes a trick – the closest
pair want always do:
• B and C are the closest
but are NOT neighbors.
B
A
C
D
0 5 7 10
5 0 4 7
7 4 0 5
10 7 5 0
A
B
C
D
2
4
4
1
Finding Neighbors
• Let u(C) = 1/(#clusters-2)Σ d(C,C’)
• Find a pair C1C2 that minimizes
f(C1,C2)= d(C1,C2)-(u(C1)+u(C2))
• Motivation: keep d(C1,C2) small while
(u(C1)+u(C2)) large
all clusters C’
Finding Neighbors
• Let u(C) = 1/(#clusters-2)Σ d(C,C’)
• Find a pair C1C2 that minimizes
f(C1,C2)= d(C1,C2)-(u(C1)+u(C2))
• For the data from example:
u(CA) = u(CD) = 1/2(5+7+10) = 11
u(CB) = u(CC) = 1/2(5+4+7) = 8
f(CA,CB) = 5-11 -8 = -14
f(CB,CC) = 4- 8 -8 = -12
all clusters C’
NJ algorithm
Initialization (as in hierarchical clustering); h(v) = 0
while there is more than on cluster
1. Find clusters C1 and C2 minimizing f(C1C2) and
merge then into C
2. Compute for all C*: d(C,C*) = (d(C1C)+ d(C2C))/2
3. Add new vertex corresponding to C to the forest T
and connect it to C1, C2
4. Remove from d columns corresponding to C1,C2
5. Add to d column corresponding to C
6. Assign length ½(d(C1C2)+u(C1)-u(C2) to edge C1C
7. Assign length ½(d(C1C2)+u(C2)-u(C1) to edge C2C
NJ tree is not rooted
The order of construction of internal nodes of NJ does not suggest an
ancestral relation:
1 2 3 4 5
Rooting a tree
• Choose one distant organism as an out-
group
Species of interests
out-group
root
Bootstraping
• Estimating confidence in the tree topology
•Are we sure if this is
correct?
•Is there enough
evidence that A is a
successor of B not the
other way around? A
B
Bootstrapping, continued
• Assume that the tree is build form multiple
sequence alignment
A
B
1 2 3 4 5 67 8 9 10 11
Columns
of the alignment
Select columns randomly
(with replacement)
13 2 1 10 6 5 4 5 0 11
Initial tree
A
B
New tree
Repeat, say 1000 times,
For each edge of initial tree calculate % times
it is present in the new tree
59%
Summary
• Assume you have multiple alignment of length N.
Let T be the NJ tree build from this alignment
• Repeat, say 1000 times the following process:
– Select randomly with replacement N columns of the
alignment to produce a randomized alignment
– Build the tree for this randomized alignment
• For each edge of T report % time it was present
in a tree build form randomized alignment . This
is called the bootstrap value.
• Trusted edges: 80% or better bootstrap.
Maximum Likelihood Method
• Given is a multiple sequence alignment
and probabilistic model of for substitutions
(like PAM model) find the tree which has
the highest probability of generating the
data.
• Simplifying assumptions:
– Positions involved independently
– After species diverged they evolve
independently.
Formally:
• Find the tree T such that assuming evolution
model M
Pr[Data| T,M] is maximized
• From the independence of symbols:
Pr[Data| T,M] = P i Pr[Di| T,M]
Where the product is taken over all characters i
and Di value of the character i is over all taxa
Computing Pr[Di| T,M]
Pr[Di| T,M]
= Σ x Σ y Σ z
p(x)p(x,y,t1)p(y,A,t2)p(y,B,t6)p(x,z,t3)p(z,C,t4)p(z,D,t5)
A B C D
Column Di
x
y z
consider all possible
assignments here
t1
t2
t3
t4
t5 time
p(x,y,t) = prob. of
mutation x to y in time t
(from the model)
t6
Discovering the tree of life
• “Tree of life” – evolutionary tree of all
organisms
• Construction: choose a gene universally present
in all organisms; good examples: small rRNA
subunit, mitochondrial sequences.
• Things to keep in mind while constructing tree of
life from sequence distances:
– Lateral (or horizontal) gene transfer
– Gene duplication: genome may contain similar genes
that may evolve using different pathways. Phylogeny
tree need to be derived based on orthologous genes.
Where we go with it…
• We now how to compute for given column
and given tree Pr[Di| T,M]
• Sum up over all columns to get
Pr[Data| T,M]
Now, explore the space of possible trees
Problem:
• Bad news: the space of all possible trees is HUGE
Various heuristic approaches are used.
Metropolis algorithm:Random Walk
in Energy Space
Goal design transition probabilities so that the
probability of arriving at state j is
P(j) = q(j) / Z
(typically, q(S) = e
–E(S)/kT
, where E – energy)
Z – partition function = sum over all states S of
terms q(s). Z cannot be computed analytically since the
space is to large
State i
(in our case tree)
State j
(in our case another tree)
Tree i can be obtained
from tree j by a “local” change
The network
of possible
trees, there is
an edge
between
“similar
trees”
temperature
const.
Monte Carlo, Metropolis
Algorithm
• At each state i choose uniformly at random
one of neighboring conformations j.
• Compute p(i,j) = min(1, q(i)/q(j) )
• With probability p(i,j) move to state j.
• Iterate
MrBayes
• Program developed by J.Huelsenbeck &
F.Ronquist.
• Assumption:
• q(i) = Pr[D| Ti,M]
Prior probabilities: All trees are equally
likely.
• Proportion of time a given tree is visited
approximates posterior probabilities.
Most Popular Phylogeny Software
• PAUP
• PHYLIP

More Related Content

PPTX
Distance based method
PDF
High throughput sequencing
PPT
Phylogenetic prediction - maximum parsimony method
PPTX
Tree building
PPTX
Phylogenetic data analysis
PPT
Homology
PPT
Maximum parsimony
PPT
Gene expression profiling i
Distance based method
High throughput sequencing
Phylogenetic prediction - maximum parsimony method
Tree building
Phylogenetic data analysis
Homology
Maximum parsimony
Gene expression profiling i

What's hot (20)

PPT
The Smith Waterman algorithm
PPTX
Cath
PDF
sequence alignment
PPTX
Algorithm research project neighbor joining
PPT
Gene bank by kk sahu
PPTX
Blast and fasta
PPTX
Multiple sequence alignment
PPTX
Multiple sequence alignment
PPT
Pairwise sequence alignment
PDF
Sequence alignment
PPTX
Sequence alig Sequence Alignment Pairwise alignment:-
PPTX
MULTIPLE SEQUENCE ALIGNMENT
PPTX
prediction methods for ORF
PPTX
BLAST Search tool
PDF
Data Retrieval Systems
PPT
Sequence Alignment In Bioinformatics
PPTX
Phylogenetic Tree viewing softwares
PPTX
GENOMICS AND BIOINFORMATICS
PPTX
Entrez databases
The Smith Waterman algorithm
Cath
sequence alignment
Algorithm research project neighbor joining
Gene bank by kk sahu
Blast and fasta
Multiple sequence alignment
Multiple sequence alignment
Pairwise sequence alignment
Sequence alignment
Sequence alig Sequence Alignment Pairwise alignment:-
MULTIPLE SEQUENCE ALIGNMENT
prediction methods for ORF
BLAST Search tool
Data Retrieval Systems
Sequence Alignment In Bioinformatics
Phylogenetic Tree viewing softwares
GENOMICS AND BIOINFORMATICS
Entrez databases
Ad

Similar to maximum parsimony.pdf (20)

PPTX
Parsimony analysis
PPTX
Bio info
PPTX
Phylogenetic tree construction
PPTX
human phylogetic contrution of evolution tree.pptx
PPTX
Perl for Phyloinformatics
PPT
Phylogenetic analysis
DOCX
Humans, it would seem, have a great love of categorizing, organi
PDF
Phylogenetics Analysis in R
DOCX
1) Build a tree using characters 1 through 7 - but don't use Character.docx
PPTX
Phylogenetic tree and its construction and phylogeny of
PDF
phylogenetictreeanditsconstructionandphylogenyof-191208102256.pdf
PDF
Introduction to Phylogenetics
PDF
PMC Poster - phylogenetic algorithm for morphological data
PPTX
PDF
EVE161: Microbial Phylogenomics - Class 4 - Phylogeny
PPTX
philogenetic tree
PDF
Phylogenetic analysis
DOCX
Goals- A- Construct a simple parsimony tree using characters- B- Const.docx
PPTX
Phylogenetic tree
PPTX
Bioinformatics presentation shabir .pptx
Parsimony analysis
Bio info
Phylogenetic tree construction
human phylogetic contrution of evolution tree.pptx
Perl for Phyloinformatics
Phylogenetic analysis
Humans, it would seem, have a great love of categorizing, organi
Phylogenetics Analysis in R
1) Build a tree using characters 1 through 7 - but don't use Character.docx
Phylogenetic tree and its construction and phylogeny of
phylogenetictreeanditsconstructionandphylogenyof-191208102256.pdf
Introduction to Phylogenetics
PMC Poster - phylogenetic algorithm for morphological data
EVE161: Microbial Phylogenomics - Class 4 - Phylogeny
philogenetic tree
Phylogenetic analysis
Goals- A- Construct a simple parsimony tree using characters- B- Const.docx
Phylogenetic tree
Bioinformatics presentation shabir .pptx
Ad

More from SrimathideviJ (13)

PDF
Electro magnetic radiation principles.pdf
PPTX
chromatography.pptx
PPTX
Lecture 4 Xray diffraction unit 5.pptx
PDF
14-th-PPT-of-Foods-and-Industrial-MicrobiologyCourse-No.-DTM-321.pdf
PPT
Carcinogenesis ppt.ppt
PPTX
2.5 Co-Transport (1).pptx
PPTX
strain improvement-ppt unit 2.pptx
PPT
ch 3 proteins structure ppt.ppt
PDF
Basics in cancer biology.pdf
PDF
apoptosis lecture-powerpoint slides (1) (1).pdf
PDF
radioisotopetechnique-161107054511.pdf
PDF
phylogenetics.pdf
PDF
database retrival.pdf
Electro magnetic radiation principles.pdf
chromatography.pptx
Lecture 4 Xray diffraction unit 5.pptx
14-th-PPT-of-Foods-and-Industrial-MicrobiologyCourse-No.-DTM-321.pdf
Carcinogenesis ppt.ppt
2.5 Co-Transport (1).pptx
strain improvement-ppt unit 2.pptx
ch 3 proteins structure ppt.ppt
Basics in cancer biology.pdf
apoptosis lecture-powerpoint slides (1) (1).pdf
radioisotopetechnique-161107054511.pdf
phylogenetics.pdf
database retrival.pdf

Recently uploaded (20)

PDF
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
PDF
Design Guidelines and solutions for Plastics parts
PDF
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...
PPTX
communication and presentation skills 01
PPTX
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx
PDF
737-MAX_SRG.pdf student reference guides
PDF
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
PPTX
Sorting and Hashing in Data Structures with Algorithms, Techniques, Implement...
PPTX
Management Information system : MIS-e-Business Systems.pptx
PDF
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS
PDF
Influence of Green Infrastructure on Residents’ Endorsement of the New Ecolog...
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PPTX
Software Engineering and software moduleing
PDF
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
PDF
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
PPTX
Artificial Intelligence
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
Fundamentals of safety and accident prevention -final (1).pptx
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
Design Guidelines and solutions for Plastics parts
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...
communication and presentation skills 01
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx
737-MAX_SRG.pdf student reference guides
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
Sorting and Hashing in Data Structures with Algorithms, Techniques, Implement...
Management Information system : MIS-e-Business Systems.pptx
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS
Influence of Green Infrastructure on Residents’ Endorsement of the New Ecolog...
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
Software Engineering and software moduleing
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
Artificial Intelligence
Automation-in-Manufacturing-Chapter-Introduction.pdf
Fundamentals of safety and accident prevention -final (1).pptx
R24 SURVEYING LAB MANUAL for civil enggi
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF

maximum parsimony.pdf

  • 1. Lecture 11 Phylogenetic trees Principles of Computational Biology Teresa Przytycka, PhD
  • 2. Phylogenetic (evolutionary) Tree • showing the evolutionary relationships among various biological species or other entities that are believed to have a common ancestor. • Each node is called a taxonomic unit. • Internal nodes are generally called hypothetical taxonomic units • In a phylogenetic tree, each node with descendants represents the most recent common ancestor of the descendants, and the • edge lengths (if present) correspond to time estimates.
  • 3. Methods to construct phylogentic trees • Parsimony • Distance matrix based • Maximum likelihood
  • 4. Parsimony methods The preferred evolutionary tree is the one that requires “the minimum net amount of evolution” [Edwards and Cavalli-Sforza, 1963]
  • 5. Assumption of character based parsimony • Each taxa is described by a set of characters • Each character can be in one of finite number of states • In one step certain changes are allowed in character states • Goal: find evolutionary tree that explains the states of the taxa with minimal number of changes
  • 6. Example Taxon1 Yes Yes No Taxon 2 YES Yes Yes Taxon 3 Yes No No Taxon 4 Yes No No Taxon 5 Yes No Yes Taxon 6 No No Yes
  • 8. Version parsimony models: • Character states – Binary: states are 0 and 1 usually interpreted as presence or absence of an attribute (eg. character is a gene and can be present or absent in a genome) – Multistate: Any number of states (Eg. Characters are position in a multiple sequence alignment and states are A,C,T,G. • Type of changes: – Characters are ordered (the changes have to happen in particular order or not. – The changes are reversible or not.
  • 9. Variants of parsimony • Fitch Parsimony unordered, multistate characters with reversibility • Wagner Parsimony ordered, multistate characters with reversibility • Dollo Parsimony ordered, binary characters with reversibility but only one insertion allowed per character characters that are relatively chard to gain but easy to lose (like introns) • Camin-Sokal Parsimony- no reversals, derived states arise once only • (binary) prefect phylogeny – binary and non- reversible; each character changes at most once.
  • 10. Prefect – No (triangle gained and the lost) Dollo – Yes Camin-Sokal – No (for the same reason as perfect)
  • 11. 3 Changes Camin-Sokal Parsimony Triangle inserted twice! But this is not prefect and not Dollo
  • 12. Homoplasy Having some states arise more than once is called homoplasy. Example – triangle in the tree on the previous slide
  • 13. Finding most parsimonious tree • There are exponentially many trees with n nodes • Finding most parsimonious tree is NP- complete (for most variants of parsimony models) • Exception: Perfect phylogeny if exists can be found quickly. Problem – perfect phylogeny is to restrictive in practice.
  • 14. Perfect phylogeny • Each change can happen only once and is not reversible. • Can be directed or not Example: Consider binary characters. Each character corresponds to a gene. 0-gene absent 1-gene present It make sense to assume directed changes only form 0 to 1. The root has to be all zeros
  • 15. Perfect phylogeny Example: characters = genes; 0 = absent ; 1 = present Taxa: genomes (A,B,C,D,E) A 0 0 0 1 1 0 B 1 1 0 0 0 0 C 0 0 0 1 1 1 D 1 0 1 0 0 0 E 0 0 0 1 0 0 genes B D E A C Perfect phylogeny tree Goal: For a given character state matrix construct a tree topology that provides perfect phylogeny. 1 1 1 1 1 1
  • 16. Does there exist prefect parsimony tree for our example with geometrical shapes? There is a simple test
  • 17. Character Compatibility • Two characters A, B are compatible if there do not exits four taxa containing all four combinations as in the table • Fact: there exits perfect phylogeny if and only if and only if all pairs of characters are compatible T1 1 1 T2 1 0 T3 0 1 T4 0 0 A B
  • 18. Are not compatible Taxon1 Yes Yes No Taxon 2 YES Yes Yes Taxon 3 Yes No No Taxon 4 Yes No No Taxon 5 Yes No Yes Taxon 6 No No Yes
  • 19. ? One cannot add triangle to the tree so that no character changes it state twice: If we add it to on of the left branches it will be inserted twice if to the right most – circle would have to be deleted (insertion and the deletion of the circle)
  • 20. Ordered characters and perfect phylogeny • Assume that we in the last common ancestor all characters had state 0. • This assumption makes sense for many characters, for example genes. • Then compatibility criterion is even simpler: characters are compatible if and only if there do not exist three taxa containing combinations (1,0),(0,1),(1,1)
  • 21. Example Under assumption that states are directed form 0 to 1: if i and j are two different genes then the set of species containing i is either disjoint with set if species containing j or one of this sets contains the other. A 0 0 0 1 1 0 B 1 1 0 0 0 0 C 0 0 0 1 1 1 D 1 0 1 0 0 0 E 0 0 0 1 0 0 • The above property is necessary and sufficient for prefect phylogeny under 0 to 1 ordering • Why works: associated with each character is a subtree. These subtrees have to be nested.
  • 22. Simple test for prefect phylogeny • Fact: there exits perfect phylogeny if and only if and only if all pairs of characters are compatible • Special case: if we assume directed parsimony (0!1 only) then characters are compatible if and only if there do not exist tree taxa containing combinations (1,0),(0,1),(1,1) • Observe the last one is equivalent to non-overlapping criterion • Optimal algorithm: Gusfield O(nm): n = # taxa; m= #characters
  • 23. Two version optimization problem: Small parsimony: Tree is given and we want to find the labeling that minimizes #changes – there are good algorithms to do it. Large parsimony: Find the tree that minimize number of evolutionary changes. For most models NP complete One approach to large parsimony requires: - generating all possible trees - finding optimal labeling of internal nodes for each tree. Fact 1: #tree topologies grows exponentially with #nodes Fact 2: There may be many possible labels leading to the same score.
  • 24. Clique method for large parsimony • Consider the following graph: – nodes – characters; – edge if two characters are compatible 1 2 3 4 5 6 α 1 0 0 1 1 0 β 0 0 1 0 0 0 γ 1 1 0 0 0 0 δ 1 1 0 1 1 1 ε 0 0 1 1 1 0 ω 0 0 0 0 0 0 characters 2 1 3 4 5 6 3,5 INCOMPATIBLE Max. compatible set
  • 25. Clique method (Meacham 1981) - • Find maximal compatible clique (NP- complete problem) • Each characters defines a partition of the set into two subsets α,β,γ, δ,ε,ω α,γ, δ β ε,ω 1 2 β ε,ω γ, δ α, ε,ω γ, δ α, 3 β
  • 26. Small parsimony • Assumptions: the tree is known • Goal: find the optimal labeling of the tree (optimal = minimizing cost under given parsimony assumption)
  • 28. Application of small parsimony problem • errors in data • loss of function • convergent evolution (a trait developed independently by two evolutionary pathways e.g. wings in birds an bats) • lateral gene transfer (transferring genes across species not by inheritance) From paper: Are There Bugs in Our Genome: Anderson, Doolittle, Nesbo, Science 292 (2001) 1848-51 Red – gene encoding N-acetylneuraminate lyase
  • 29. Dynamic programming algorithm for small parsimony problem • Sankoff (1975) comes with the DP approach (Fitch provided an earlier non DP algorithm) • Assumptions – one character with multiple states - The cost of change from state v to w is δ(v,w) (note that it is a generalization, so far we talk about cost of any change equal to 1)
  • 30. DP algorithm continued st(v) = minimum parsimony cost for node v under assumption that the character state is t. st(v) = 0 if v is a leaf. Otherwise let u, w be children of u st(v) = min i {si(u)+ δ(i,t)}+ min j {sj(w)+ δ(j,t)} t δ(i,t) δ(j,t) u w Try all possible states in u and v O(nk) cost where n=number of nodes k = number of sates
  • 31. Exercise 1 2 5 4 6 7 3 St(v) 1 2 3 4 5 6 7 t Left and right characters are independent, We will compute the left.
  • 32. Branch lengths • Numbers that indicate the number of changes in each branch • Problem – there may by many most parsimonious trees • Method 1: Average over all most parsimonious trees. • Still a problem – the branch lengths are frequently underestimated 1 4 6 1
  • 33. Character patterns and parsimony • Assume 2 state characters (0/1) and four taxa A,B,C,D • The possible topologies are: A A A B B B C C C D D D A B C D 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 Changes in each topology 0, 0, 0 1, 1, 1 1, 1, 1 1, 2, 2 Informative character (helps to decide the tree topology) Informative characters: xxyy, xyxy,xyyx
  • 34. Inconsistency • Let p, q character change probability • Consider the three informative patters xxyy, xyxy, xyyx • The tree selected by the parsimony depends which pattern has the highest fraction; • If q(1-q) < p2 then the most frequent pattern is xyxy leading to incorrect tree. p p q q q A B C D
  • 35. Distance based methods • When two sequences are similar they are likely to originate from the same ancestor • Sequence similarity can approximate evolutionary distances GAATC GAGTT GA(A/G)T(C/T)
  • 36. Distance Method • Assume that for any pair of species we have an estimation of evolutionary distance between them – eg. alignment score • Goal: construct a tree which best approximates these distance
  • 37. Tree from distance matrix A B C D E 1 1 3 2 2 1 3 5 A B C D E 0 0 0 0 0 2 7 7 12 2 7 7 12 7 7 4 11 7 7 4 11 11 11 12 12 A B C D E length of the path from A to D = 1+3+1+2=7 Consider weighted trees: w(e) = weight of edge e Recall: In a tree there is a unique path between any two nodes. Let e1,e2,…ek be the edges of the path connecting u and v then the distance between u and v in the tree is: d(u,v) = w(e1) + w(e2) + … + w(ek) M T
  • 38. Can one always represent a distance matrix as a weighted tree? 0 10 5 10 10 0 9 5 5 9 0 8 10 5 8 0 a c b 3 2 7 a b c d a b c d d ? There is no way to add d to the tree and preserve the distances
  • 39. Quadrangle inequality • Matrix that satisfies quadrangle inequality (called also the four point condition) for every four taxa is called additive. • Theorem: Distance matrix can be represented precisely as a weighted tree if and only if it is additive. a c b d d(a,c) + d(b,d) = d(a,d) + d(b,c) >= d(a,b) + d(d,c)
  • 40. Constructing the tree representing an additive matrix (one of several methods) 1. Start form 2-leaf tree a,b where a,b are any two elements 2. For i = 3 to n (iteratively add vertices) 1. Take any vertex z not yet in the tree and consider 2 vertices x,y that are in the tree and compute d(z,c) = (d(z,x) + d(z,y) - d(x,y) )/2 d(x,c) = (d(x,z) + d(x,y) – d(y,z))/2 2. From step 1 we know position of c and the length of brunch (c,z). If c did not hit exactly a brunching point add c and z else take as y any node from sub-tree that brunches at c and repeat steps 1,2. x y x y c z x y z
  • 41. Example 0 10 5 9 10 0 9 5 5 9 0 9 9 5 9 0 u u v x y u v x y v 10 Adding x: d(x,c) = (d(u,x) + d(v,x) – d(u,v))/2 = (5+9-10)/2= 2 d(u,c) = (d(u,x) + d(u,v) - d(x,v))/2 = (5+10-9)/2 = 3 u v x 3 2 7 c Adding y: d(y,c’) = (d(u,y) + d(v,y) – d(u,w))/2 = (5+9-10)/2= 2 d(u,c’) = (d(u,y) + d(u,v) - d(y,v))/2 = (10+9-5)/2 = 7 u v x 3 2 3 c c’ y 2 4
  • 42. Real matrices are almost never additive • Finding a tree that minimizes the error Optimizing the error is hard • Heuristics: – Unweighted Pair Group Method with Arithmetic Mean (UPGMA) – Neighborhood Joining (NJ)
  • 43. Hierarchical Clustering • Clustering problem: Group items (e.g. genes) with similar properties (e.g. expression pattern, sequence similarity) so that – The clusters are homogenous (the items in each cluster are highly similar, as measured by the given property – sequence similarity, expression pattern) – The clusters are well separated (the items in different clusters are different) • Hierarchical clustering Many clusters have natural sub-clusters which are often easier to identify e.g. cuts are sub-cluster of carnivore sub-cluster of mammals Organize the elements into a tree rather than forming explicit portioning
  • 44. The basic algorithm Input: distance array d; cluster to cluster distance function Initialize: 1. Put every element in one-element cluster 2. Initialize a forest T of one-node trees (each tree corresponds to one cluster) while there is more than on cluster 1. Find two closest clusters C1 and C2 and merge them into C 2. Compute distance from C to all other clusters 3. Add new vertex corresponding to C to the forest T and make nodes corresponding to C1, C2 children of this node. 4. Remove from d columns corresponding to C1,C2 5. Add to d column corresponding to C
  • 45. A distance function • dave(C1,C2) = 1/(|C1||C2|)Σ d(x,y) x in C1 y in C2 Average over all distances ***
  • 46. Example (on blackboard) A B C D E 0 0 0 0 0 2 7 7 12 2 7 7 12 7 7 4 11 7 7 4 11 11 11 12 12 A B C D E d
  • 47. Unweighted Pair Group Method with Arithmetic Mean (UPGMA) • Idea: – Combine hierarchical clustering with a method to put weights on the edges – Distance function used: dave(C1,C2) = 1/(|C1||C2|)Σ d(x,y) – We need to come up with a method of computing brunch lengths x in C1 y in C2
  • 48. Ultrametric trees • The distance from any internal node C to any of its leaves is constant and equal to h(C) • For each node (v) we keep variable h – height of the node in the tree. h(v) = 0 for all leaves.
  • 49. UPGMA algorithm Initialization (as in hierarchical clustering); h(v) = 0 while there is more than on cluster 1. Find two closest clusters C1 and C2 and merge them into C 2. Compute dave from C to all other clusters 3. Add new vertex corresponding to C to the forest T and make nodes corresponding to C1, C2 children of this node. 4. Remove from d columns corresponding to C1,C2 5. Add to d column corresponding to C 6. h(C) = d(C1, C2) /2 7. Assign length h(C)-h(C1) to edge (C1,C) 8. Assign length h(C)-h(C2) to edge (C2,C)
  • 50. Neighbor Joining • Idea: – Construct tree by iteratively combing first nodes that are neighbors in the tree • Trick: Figuring out a pair of neighboring vertices takes a trick – the closest pair want always do: • B and C are the closest but are NOT neighbors. B A C D 0 5 7 10 5 0 4 7 7 4 0 5 10 7 5 0 A B C D 2 4 4 1
  • 51. Finding Neighbors • Let u(C) = 1/(#clusters-2)Σ d(C,C’) • Find a pair C1C2 that minimizes f(C1,C2)= d(C1,C2)-(u(C1)+u(C2)) • Motivation: keep d(C1,C2) small while (u(C1)+u(C2)) large all clusters C’
  • 52. Finding Neighbors • Let u(C) = 1/(#clusters-2)Σ d(C,C’) • Find a pair C1C2 that minimizes f(C1,C2)= d(C1,C2)-(u(C1)+u(C2)) • For the data from example: u(CA) = u(CD) = 1/2(5+7+10) = 11 u(CB) = u(CC) = 1/2(5+4+7) = 8 f(CA,CB) = 5-11 -8 = -14 f(CB,CC) = 4- 8 -8 = -12 all clusters C’
  • 53. NJ algorithm Initialization (as in hierarchical clustering); h(v) = 0 while there is more than on cluster 1. Find clusters C1 and C2 minimizing f(C1C2) and merge then into C 2. Compute for all C*: d(C,C*) = (d(C1C)+ d(C2C))/2 3. Add new vertex corresponding to C to the forest T and connect it to C1, C2 4. Remove from d columns corresponding to C1,C2 5. Add to d column corresponding to C 6. Assign length ½(d(C1C2)+u(C1)-u(C2) to edge C1C 7. Assign length ½(d(C1C2)+u(C2)-u(C1) to edge C2C
  • 54. NJ tree is not rooted The order of construction of internal nodes of NJ does not suggest an ancestral relation: 1 2 3 4 5
  • 55. Rooting a tree • Choose one distant organism as an out- group Species of interests out-group root
  • 56. Bootstraping • Estimating confidence in the tree topology •Are we sure if this is correct? •Is there enough evidence that A is a successor of B not the other way around? A B
  • 57. Bootstrapping, continued • Assume that the tree is build form multiple sequence alignment A B 1 2 3 4 5 67 8 9 10 11 Columns of the alignment Select columns randomly (with replacement) 13 2 1 10 6 5 4 5 0 11 Initial tree A B New tree Repeat, say 1000 times, For each edge of initial tree calculate % times it is present in the new tree 59%
  • 58. Summary • Assume you have multiple alignment of length N. Let T be the NJ tree build from this alignment • Repeat, say 1000 times the following process: – Select randomly with replacement N columns of the alignment to produce a randomized alignment – Build the tree for this randomized alignment • For each edge of T report % time it was present in a tree build form randomized alignment . This is called the bootstrap value. • Trusted edges: 80% or better bootstrap.
  • 59. Maximum Likelihood Method • Given is a multiple sequence alignment and probabilistic model of for substitutions (like PAM model) find the tree which has the highest probability of generating the data. • Simplifying assumptions: – Positions involved independently – After species diverged they evolve independently.
  • 60. Formally: • Find the tree T such that assuming evolution model M Pr[Data| T,M] is maximized • From the independence of symbols: Pr[Data| T,M] = P i Pr[Di| T,M] Where the product is taken over all characters i and Di value of the character i is over all taxa
  • 61. Computing Pr[Di| T,M] Pr[Di| T,M] = Σ x Σ y Σ z p(x)p(x,y,t1)p(y,A,t2)p(y,B,t6)p(x,z,t3)p(z,C,t4)p(z,D,t5) A B C D Column Di x y z consider all possible assignments here t1 t2 t3 t4 t5 time p(x,y,t) = prob. of mutation x to y in time t (from the model) t6
  • 62. Discovering the tree of life • “Tree of life” – evolutionary tree of all organisms • Construction: choose a gene universally present in all organisms; good examples: small rRNA subunit, mitochondrial sequences. • Things to keep in mind while constructing tree of life from sequence distances: – Lateral (or horizontal) gene transfer – Gene duplication: genome may contain similar genes that may evolve using different pathways. Phylogeny tree need to be derived based on orthologous genes.
  • 63. Where we go with it… • We now how to compute for given column and given tree Pr[Di| T,M] • Sum up over all columns to get Pr[Data| T,M] Now, explore the space of possible trees Problem: • Bad news: the space of all possible trees is HUGE Various heuristic approaches are used.
  • 64. Metropolis algorithm:Random Walk in Energy Space Goal design transition probabilities so that the probability of arriving at state j is P(j) = q(j) / Z (typically, q(S) = e –E(S)/kT , where E – energy) Z – partition function = sum over all states S of terms q(s). Z cannot be computed analytically since the space is to large State i (in our case tree) State j (in our case another tree) Tree i can be obtained from tree j by a “local” change The network of possible trees, there is an edge between “similar trees” temperature const.
  • 65. Monte Carlo, Metropolis Algorithm • At each state i choose uniformly at random one of neighboring conformations j. • Compute p(i,j) = min(1, q(i)/q(j) ) • With probability p(i,j) move to state j. • Iterate
  • 66. MrBayes • Program developed by J.Huelsenbeck & F.Ronquist. • Assumption: • q(i) = Pr[D| Ti,M] Prior probabilities: All trees are equally likely. • Proportion of time a given tree is visited approximates posterior probabilities.
  • 67. Most Popular Phylogeny Software • PAUP • PHYLIP