6238578.ppt

Phylogenetic Analysis
Introduction to bioinformatics
Stinus Lindgreen
stinus@binf.ku.dk
Bioinformatics Centre, University of Copenhagen

Outline of the lecture
 What is a phylogeny?
 Why and how to interpret them
 Programs: PHYLIP, PAUP* and BioEdit
 Building a tree 1: Multiple alignment
 Building a tree 2: The model
 Building a tree 3: Construction
 Building a tree 4: Evaluation

Nothing in Biology Makes Sense
Except in the Light of Evolution
Theodosius Dobzhansky (1900-1975)

Phylogeny
 Phylogenetic inference predicts a tree based on
characters (of some sort)
 Some variation needed
 Group together similar species/genes
 Connect to most common ancestor
 Unrooted tree: Just show connections
 Rooted tree: Direction of evolution
 Branch lengths can show divergence

Before sequences
 Phylogenetic trees show evolutionary relationships
 Existed longer than sequencing methods
 Previously based on morphological characters
 Still partly today – at least for checking
 Mainly based on biological sequences
 DNA or protein
 Base phylogeny on mutations

Some pitfalls
 Determining phylogeny is important for
understanding biology
 But also a very difficult problem
 Beware of incorrect trees
 Important to understand models and methods
 The programs are helpful tools
The result is only as good as the alignment

Assumptions
Basic concepts of evolutionary theory
 Relation to common ancestor
 Phylogenetics represented by bifurcating tree
 Mutations occur over evolutionary time
Necessary to make phylogenetic inference possible

Interpretation
 Know your model
 Both evolutionary and for tree construction
 Know the assumptions of the model
 Evolution independent? Identical between sites? The same
for all sequences?
 Are the sequences correct?
 And are they representative?
 And are they homologous?
 Is the multiple alignment correct?
What you get out is no better than what you put in

Some biological pitfalls
Don’t make hasty conclusions!
 Does your tree contradict common sense?
 Then it’s probably wrong!
 Differentiate between the homologs
 Orthologs
 Speciation, common ancestor, similar function
 Paralogs
 Gene duplication, within 1 organism, differing functions
 Xenologs
 Horizontal gene transfer – hard to tell, similar function

Software
Today we’ll look at the programs before the methods
Some programs for phylogenetic analysis
 A multiple alignment program:
Clustal, T-Coffee, MAFFT, Muscle…
 A phylogenetic program:
Phylip, PAUP*, MacClade, BioEdit…
 Visualizing the tree:
TreeView, NJplot

PAUP*
 Commercial package
 Apparently good
 Many different methods and analysis methods
 But since we don’t own a copy…
 Similarly: MacClade only works on Macintosh…

PHYLIP
 Free package
 Many programs
 Both distance and character based
 Bootstrapping possible
 But:
 It can be a little difficult
 No graphical user interface
 And you will need to run many programs

BioEdit
 Has phylogeny methods built in
 Can call Phylip routines
 No need for you to learn the command line
 But no bootstrapping… (as far as I know)
 Point and click:
 Select the sequences in the alignment
 Choose the wanted phylogeny
 Voila!

PhyloWin
 Another free program
 Simple, not many possibilities
 But you can make bootstrapping

Getting the software
 Install BioEdit, PHYLIP, PhyloWin and NJplot
 Links on the wiki

Constructing a tree
To make a phylogenetic tree, four steps are needed:
1. Perform multiple alignment
2. Choose your model
3. Build the tree
4. Evaluate the quality
A brief note:
Ideally: Parallel alignment and phylogenetic inference
 Very difficult – but it has been pursued

1) The multiple alignment
Already discussed
Some notes:
 Recall that MA programs are not exact
 Some manual editing often necessary
 Consider the algorithm used
 Does it consider the phylogeny of the data?
 Clustal’s guide tree: Not correct phylogeny
 What parameters are used?
 Solve ambiguities, remove near-identical sequences
 Gappy regions, identical sequences can bias the result

2) The model
The model describes the data
 Evolutionary events
 Overall mutability
 Evolutionary model?
 Crucial – both for alignment and tree building
 Are you looking at nucleotides or amino acids?
 Where do we get most information?
 Know the basis for the chosen model

Nucleotide models
 Create 4×4 matrix
 Either fixed cost
 Character state
 Or rate matrices
 Probabilities
 Used for different kinds of tree estimations
 Include site specific information
 Third codon position more variable

Nucleotide model 1
 Fixed cost for transitions and transversion
 E.g. transversions are twice as costly as
transitions
 For a tree: Count the number of
transitions/transversions
 Calculate cost
 Tends to minimize number of
transversion
 Cluster transitions
A C G T
A - 2 1 2
C 2 - 2 1
G 1 2 - 2
T 2 1 2 -

Nucleotide model 2
 Simple substitution rate matrix
 Assume same rates AB and BA
 Assume all mutations equally likely: Rate α
 The Jukes-Cantor model
A C G T
A -3α α α α
C α -3α α α
G α α -3α α
T α α α -3α

Nucleotide model 3
A C G T
A -(α2+α1) α2 α1 α2
C α2 -(α2+α1) α2 α1
G α1 α2 -(α2+α1) α2
T α2 α1 α2 -(α2+α1)
 More advanced rate matrix
 Include transitions/tranversions
 Rates α1 and α2
 The Kimura 2-parameter model

Amino acid models
 A 20×20 substitution matrix
 The BLOSUM matrices
 Fixed cost matrices
 Or the PAM matrices
 Rate matrices
 Described last week

3) Building the tree
We have the sequences, the alignment and the model
 Find the best tree
 What is the best tree?
 Two main strategies:
 Distance based
 Look at dissimilarities (=distances)
 Character based
 Look at the data

Problems with trees
 The number of possible trees grows exponentially
 For 15 taxa: 2.13·1014 possibilities…
 How to search?
 Branch and Bound
 Branch swapping
 Rooting the tree
 Not a simple problem
 All the following methods produce unrooted trees
 Use an outgroup
 Midpoint of longest branch

Distance methods
 Some sequences more similar than others
 Closely related sequences should be close in the tree
 Abstract view on the data
 Loss of information is usually a bad sign
 Only use the distances between sequences
 Recall Clustal
 All methods start with a distance matrix

Distance methods
 Can we get the correct answer?
 Yes, if all mutation events were present
 But: After one mutation, the site is ”saturated”
 Additional mutations do not give additional info
A B C: Distance 2
A C: Distance 1
 And mutations back will fool the method
A B A: Distance 2
A A: Distance 0

UPGMA
Unweighted Pair Group Method with Arithmetic Mean
 Unweighted: The distances are used as they are
 Pair: Find the two closest elements
 Group: Put them together in a new group
 Arithmetic Mean: Gives distances from the new group
 Correct tree assuming a molecular clock
 Evolutionary divergence time can be found from mutations
 Mutation rates are constant

UPGMA illustrated
 Find two closest: A and D
 Create a new group [A+D]
 Update distances:
7
2
6
8
2
B
D
B
A
B
D]
[A









A B C D E
A - 8 3 2 5
B - - 5 6 6
C - - - 7 5
D - - - - 3
E - - - - -
A+D B C E
A+D - 7 5 4
B - - 5 6
C - - - 5
E - - - -
 Repeat for all sequences
 Next time: Connect [A+D] with E

Trying UPGMA
 Go to the wiki and do the UPGMA exercise

Neighbour joining
 A little like UPGMA
 Difference: NJ does not assume a molecular clock
 But it assumes an additive tree
 Distance between two leaves is the sum of the edges
 Find the closest pair that is most apart from the rest
of the tree
 Connect pair and update distances
 A little advanced: Take the overall distance to the rest of the
tree into account
 Corrects for varying mutation
 Fast and can give good results

Fitch-Margoliash
FM method
 We have the pairwise distances
 Each branch in the tree has a length
 The length of all paths can be found
 Optimize tree by moving internal nodes around
 The best fit minimizes the overall error
 The minimum squared deviation
 
ij
2
ij
ij )
p
(d

Minimum Evolution
The ME method
 Find the shortest tree
 Count number of changes
 Similar to FM but only looks at branches
FM
 
ij
2
ij
ij )
l
(d
A
B
B
A
ME

Trying NJ
 Go to the wiki and do the NJ exercise

Character methods
 Use the data (the actual characters)
 All information at hand
 More advanced, slower, but also more accurate
 Maximum Parsimony (MP)
 Occam’s razor: Simplest explanation
 Maximum Likelihood (ML)
 Advanced statistical method
 Most probable tree given the data and the model

Maximum parsimony
 How does evolution work?
 Assumption: Path of least resistance
 True evolution gives rise to fewest changes
 The tree we want:
 Describe the given sequences by fewest changes
 The ancestral nodes must be as similar as possible
 Predict a tree
 Count the number of changes needed

MP illustrated
A C G G C
{A,C} {G}
{A,C,G}
{C}

MP illustrated
A C G G C
{A,C} {G}
{A,C,G}
{C}
X
X
Cost: 2 changes

MP illustrated
A C G G C
C G
C
C
CG
CA

Maximum Likelihood
 Given the data, predict the most probable model
 Can optimize both tree and substitution model
 We know the sequences
 What is the most likely substitution rates?
 Estimate from the alignment (and the phylogeny)
 And what is the most likely tree?
 Estimate from alignment and substitution rates
 Computationally heavy and rather slow
 Normally good results

Maximum Likelihood
 General practice: Optimize model then tree
 Calculate probability for each alignment column
 Combine to probability for entire alignment
 Averages over low and high probability sites
 Likelihood of column given tree
A A C
A
A
A A C
C
A
A A C
G
A
L=P +P +P +…

Maximum likelihood
 Then repeat this for all possible tree topologies
 And all possible assignments to internal nodes
 And then choose the combination that gives the
highest probability…
 Clearly very difficult

MP and ML exercise
 Go to the wiki and do the MP and ML exercises

Summary of methods
Distance Character based
Clustering UPGMA
Neighbour Joining
Optimality
criterion
Least Squares
Minimum evolution
Maximum parsimony
Maximum likelihood
(Bayesian statistics)

The differences
 Sometimes the differences can seem minimal
 They affect the tree – but the same result is possible
UPGMA and NJ
 Minimize the overall length of the tree
Maximum parsimony
 Finds tree with fewest changes
Maximum likelihood
 Maximizes the probability of the tree given the data

4) Evaluating trees
How good is the predicted tree?
Some sequence variation needed
 Is the signal strong enough?
There are so many possible trees
 Are there many trees similar to the prediction?
 Which one to choose?
 Is the tree robust?
 Does it change much when e.g. removing a sequence?

Randomization
 Is it possible that tree is just random?
 Permute the columns of the alignment
 i.e. shuffle the characters in a column
 Build a new tree
 Is it (partly) identical?
 If the tree is just as likely to be random, then don’t
put too much faith in it

Bootstrapping
 The story of Baron von Münchausen
 He pulled himself out of a swamp by his bootstraps
 The idea: Evaluate the quality of the result using the
same data all over again
 Make a large number of new datasets
 Create phylogenetic tree
 Observe the number of times clades are made

Bootstrapping
 The datasets should be similar
 Thereby: The trees are comparable
 Alignments of same size (length and sequences)
 Non-parametric: Sample with replacement
 Choose a random column and add new alignment
 Parametric: Simulate new datasets
 Use model that look like your data
 Characteristics are preserved (unlike randomization)

Bootstrap example
 Non-parametric bootstrapping
 We have an alignment:
A: A G G C U C C A A A
B: A G G U U C G A A A
C: A G C C C C G A A A
D: A U U U C C G A A C
#: 0 1 2 0 3 0 1 2 0 1
 Sample columns:
A: G G G U U U C A A A
B: G G G U U U G A A A
C: G C C C C C G A A A
D: U U U C C C G A A C
A B C D
A - - - -
B 1 - - -
C 5 5 - -
D 8 7 4 -
A
B
C
D

Bootstrap example
 Sample 2:
A: A U U C C C C A A A
B: A U U C C G G A A A
C: A C C C C G G A A A
D: A C C C C G G C C C
A B C D
A - - - -
B 2 - - -
C 4 2 - -
D 7 5 3 -
A
B
C
D

Bootstrap example
 Sample 3:
A: A C C C A A G G C C
B: A C C G A A G G U U
C: A C C G A A C C C C
D: A C C G C C U U U U
A B C D
A - - - -
B 3 - - -
C 3 4 - -
D 7 4 6 -
A
B
C
D

Bootstrap example
 Calculate consensus tree
 Can be done on many ways
 Put the bootstrap number at each branch point
 The proportions of times this branch is observed
 Of course, more than three samples needed
A
B
C
D
1.0
0.66

Bootstrapping exercise
 Do the bootstrapping exercise on the wiki

Summary
 What is phylogenetic inference?
 What can a phylogenetic tree be used for?
 Be aware of the multiple alignment
 The different models
 Tree building methods: NJ, UPGMA, ML and MP
 Evaluating trees: Bootstrapping
 Programs: Phylip, PAUP*,PhyloWin and BioEdit
Next time: Gene finding (with Anders Krogh)
Then RNA structure prediction with me again 

6238578.ppt

More Related Content

Similar to 6238578.ppt (20)

More from ChijiokeNsofor (20)

Recently uploaded (20)

6238578.ppt