Phylogenetic Analysis
Introduction to bioinformatics
Stinus Lindgreen
stinus@binf.ku.dk
Bioinformatics Centre, University of Copenhagen
Outline of the lecture
 What is a phylogeny?
 Why and how to interpret them
 Programs: PHYLIP, PAUP* and BioEdit
 Building a tree 1: Multiple alignment
 Building a tree 2: The model
 Building a tree 3: Construction
 Building a tree 4: Evaluation
Nothing in Biology Makes Sense
Except in the Light of Evolution
Theodosius Dobzhansky (1900-1975)
Phylogeny
 Phylogenetic inference predicts a tree based on
characters (of some sort)
 Some variation needed
 Group together similar species/genes
 Connect to most common ancestor
 Unrooted tree: Just show connections
 Rooted tree: Direction of evolution
 Branch lengths can show divergence
Before sequences
 Phylogenetic trees show evolutionary relationships
 Existed longer than sequencing methods
 Previously based on morphological characters
 Still partly today – at least for checking
 Mainly based on biological sequences
 DNA or protein
 Base phylogeny on mutations
Morphological tree
Modern tree
A A G C G
X
X
Some pitfalls
 Determining phylogeny is important for
understanding biology
 But also a very difficult problem
 Beware of incorrect trees
 Important to understand models and methods
 The programs are helpful tools
The result is only as good as the alignment
Assumptions
Basic concepts of evolutionary theory
 Relation to common ancestor
 Phylogenetics represented by bifurcating tree
 Mutations occur over evolutionary time
Necessary to make phylogenetic inference possible
Tree of Life
Interpretation
 Know your model
 Both evolutionary and for tree construction
 Know the assumptions of the model
 Evolution independent? Identical between sites? The same
for all sequences?
 Are the sequences correct?
 And are they representative?
 And are they homologous?
 Is the multiple alignment correct?
What you get out is no better than what you put in
Some biological pitfalls
Don’t make hasty conclusions!
 Does your tree contradict common sense?
 Then it’s probably wrong!
 Differentiate between the homologs
 Orthologs
 Speciation, common ancestor, similar function
 Paralogs
 Gene duplication, within 1 organism, differing functions
 Xenologs
 Horizontal gene transfer – hard to tell, similar function
Software
Today we’ll look at the programs before the methods
Some programs for phylogenetic analysis
 A multiple alignment program:
Clustal, T-Coffee, MAFFT, Muscle…
 A phylogenetic program:
Phylip, PAUP*, MacClade, BioEdit…
 Visualizing the tree:
TreeView, NJplot
PAUP*
 Commercial package
 Apparently good
 Many different methods and analysis methods
 But since we don’t own a copy…
 Similarly: MacClade only works on Macintosh…
PHYLIP
 Free package
 Many programs
 Both distance and character based
 Bootstrapping possible
 But:
 It can be a little difficult
 No graphical user interface
 And you will need to run many programs
BioEdit
 Has phylogeny methods built in
 Can call Phylip routines
 No need for you to learn the command line
 But no bootstrapping… (as far as I know)
 Point and click:
 Select the sequences in the alignment
 Choose the wanted phylogeny
 Voila!
PhyloWin
 Another free program
 Simple, not many possibilities
 But you can make bootstrapping
Getting the software
 Install BioEdit, PHYLIP, PhyloWin and NJplot
 Links on the wiki
Constructing a tree
To make a phylogenetic tree, four steps are needed:
1. Perform multiple alignment
2. Choose your model
3. Build the tree
4. Evaluate the quality
A brief note:
Ideally: Parallel alignment and phylogenetic inference
 Very difficult – but it has been pursued
1) The multiple alignment
Already discussed
Some notes:
 Recall that MA programs are not exact
 Some manual editing often necessary
 Consider the algorithm used
 Does it consider the phylogeny of the data?
 Clustal’s guide tree: Not correct phylogeny
 What parameters are used?
 Solve ambiguities, remove near-identical sequences
 Gappy regions, identical sequences can bias the result
2) The model
The model describes the data
 Evolutionary events
 Overall mutability
 Evolutionary model?
 Crucial – both for alignment and tree building
 Are you looking at nucleotides or amino acids?
 Where do we get most information?
 Know the basis for the chosen model
Nucleotide models
 Create 4×4 matrix
 Either fixed cost
 Character state
 Or rate matrices
 Probabilities
 Used for different kinds of tree estimations
 Include site specific information
 Third codon position more variable
Nucleotide model 1
 Fixed cost for transitions and transversion
 E.g. transversions are twice as costly as
transitions
 For a tree: Count the number of
transitions/transversions
 Calculate cost
 Tends to minimize number of
transversion
 Cluster transitions
A C G T
A - 2 1 2
C 2 - 2 1
G 1 2 - 2
T 2 1 2 -
Nucleotide model 2
 Simple substitution rate matrix
 Assume same rates AB and BA
 Assume all mutations equally likely: Rate α
 The Jukes-Cantor model
A C G T
A -3α α α α
C α -3α α α
G α α -3α α
T α α α -3α
Nucleotide model 3
A C G T
A -(α2+α1) α2 α1 α2
C α2 -(α2+α1) α2 α1
G α1 α2 -(α2+α1) α2
T α2 α1 α2 -(α2+α1)
 More advanced rate matrix
 Include transitions/tranversions
 Rates α1 and α2
 The Kimura 2-parameter model
Amino acid models
 A 20×20 substitution matrix
 The BLOSUM matrices
 Fixed cost matrices
 Or the PAM matrices
 Rate matrices
 Described last week
3) Building the tree
We have the sequences, the alignment and the model
 Find the best tree
 What is the best tree?
 Two main strategies:
 Distance based
 Look at dissimilarities (=distances)
 Character based
 Look at the data
Problems with trees
 The number of possible trees grows exponentially
 For 15 taxa: 2.13·1014 possibilities…
 How to search?
 Branch and Bound
 Branch swapping
 Rooting the tree
 Not a simple problem
 All the following methods produce unrooted trees
 Use an outgroup
 Midpoint of longest branch
Distance methods
 Some sequences more similar than others
 Closely related sequences should be close in the tree
 Abstract view on the data
 Loss of information is usually a bad sign
 Only use the distances between sequences
 Recall Clustal
 All methods start with a distance matrix
Distance methods
 Can we get the correct answer?
 Yes, if all mutation events were present
 But: After one mutation, the site is ”saturated”
 Additional mutations do not give additional info
A B C: Distance 2
A C: Distance 1
 And mutations back will fool the method
A B A: Distance 2
A A: Distance 0
UPGMA
Unweighted Pair Group Method with Arithmetic Mean
 Unweighted: The distances are used as they are
 Pair: Find the two closest elements
 Group: Put them together in a new group
 Arithmetic Mean: Gives distances from the new group
 Correct tree assuming a molecular clock
 Evolutionary divergence time can be found from mutations
 Mutation rates are constant
UPGMA illustrated
 Find two closest: A and D
 Create a new group [A+D]
 Update distances:
7
2
6
8
2
B
D
B
A
B
D]
[A









A B C D E
A - 8 3 2 5
B - - 5 6 6
C - - - 7 5
D - - - - 3
E - - - - -
A+D B C E
A+D - 7 5 4
B - - 5 6
C - - - 5
E - - - -
 Repeat for all sequences
 Next time: Connect [A+D] with E
Trying UPGMA
 Go to the wiki and do the UPGMA exercise
Neighbour joining
 A little like UPGMA
 Difference: NJ does not assume a molecular clock
 But it assumes an additive tree
 Distance between two leaves is the sum of the edges
 Find the closest pair that is most apart from the rest
of the tree
 Connect pair and update distances
 A little advanced: Take the overall distance to the rest of the
tree into account
 Corrects for varying mutation
 Fast and can give good results
Fitch-Margoliash
FM method
 We have the pairwise distances
 Each branch in the tree has a length
 The length of all paths can be found
 Optimize tree by moving internal nodes around
 The best fit minimizes the overall error
 The minimum squared deviation
 
ij
2
ij
ij )
p
(d
Minimum Evolution
The ME method
 Find the shortest tree
 Count number of changes
 Similar to FM but only looks at branches
FM
 
ij
2
ij
ij )
l
(d
A
B
B
A
ME
Trying NJ
 Go to the wiki and do the NJ exercise
Character methods
 Use the data (the actual characters)
 All information at hand
 More advanced, slower, but also more accurate
 Maximum Parsimony (MP)
 Occam’s razor: Simplest explanation
 Maximum Likelihood (ML)
 Advanced statistical method
 Most probable tree given the data and the model
Maximum parsimony
 How does evolution work?
 Assumption: Path of least resistance
 True evolution gives rise to fewest changes
 The tree we want:
 Describe the given sequences by fewest changes
 The ancestral nodes must be as similar as possible
 Predict a tree
 Count the number of changes needed
MP illustrated
A C G G C
{A,C} {G}
{A,C,G}
{C}
MP illustrated
A C G G C
{A,C} {G}
{A,C,G}
{C}
X
X
Cost: 2 changes
MP illustrated
A C G G C
C G
C
C
CG
CA
Maximum Likelihood
 Given the data, predict the most probable model
 Can optimize both tree and substitution model
 We know the sequences
 What is the most likely substitution rates?
 Estimate from the alignment (and the phylogeny)
 And what is the most likely tree?
 Estimate from alignment and substitution rates
 Computationally heavy and rather slow
 Normally good results
Maximum Likelihood
 General practice: Optimize model then tree
 Calculate probability for each alignment column
 Combine to probability for entire alignment
 Averages over low and high probability sites
 Likelihood of column given tree
A A C
A
A
A A C
C
A
A A C
G
A
L=P +P +P +…
Maximum likelihood
 Then repeat this for all possible tree topologies
 And all possible assignments to internal nodes
 And then choose the combination that gives the
highest probability…
 Clearly very difficult
MP and ML exercise
 Go to the wiki and do the MP and ML exercises
Summary of methods
Distance Character based
Clustering UPGMA
Neighbour Joining
Optimality
criterion
Least Squares
Minimum evolution
Maximum parsimony
Maximum likelihood
(Bayesian statistics)
The differences
 Sometimes the differences can seem minimal
 They affect the tree – but the same result is possible
UPGMA and NJ
 Minimize the overall length of the tree
Maximum parsimony
 Finds tree with fewest changes
Maximum likelihood
 Maximizes the probability of the tree given the data
4) Evaluating trees
How good is the predicted tree?
Some sequence variation needed
 Is the signal strong enough?
There are so many possible trees
 Are there many trees similar to the prediction?
 Which one to choose?
 Is the tree robust?
 Does it change much when e.g. removing a sequence?
Randomization
 Is it possible that tree is just random?
 Permute the columns of the alignment
 i.e. shuffle the characters in a column
 Build a new tree
 Is it (partly) identical?
 If the tree is just as likely to be random, then don’t
put too much faith in it
Bootstrapping
 The story of Baron von Münchausen
 He pulled himself out of a swamp by his bootstraps
 The idea: Evaluate the quality of the result using the
same data all over again
 Make a large number of new datasets
 Create phylogenetic tree
 Observe the number of times clades are made
Bootstrapping
 The datasets should be similar
 Thereby: The trees are comparable
 Alignments of same size (length and sequences)
 Non-parametric: Sample with replacement
 Choose a random column and add new alignment
 Parametric: Simulate new datasets
 Use model that look like your data
 Characteristics are preserved (unlike randomization)
Bootstrap example
 Non-parametric bootstrapping
 We have an alignment:
A: A G G C U C C A A A
B: A G G U U C G A A A
C: A G C C C C G A A A
D: A U U U C C G A A C
#: 0 1 2 0 3 0 1 2 0 1
 Sample columns:
A: G G G U U U C A A A
B: G G G U U U G A A A
C: G C C C C C G A A A
D: U U U C C C G A A C
A B C D
A - - - -
B 1 - - -
C 5 5 - -
D 8 7 4 -
A
B
C
D
Bootstrap example
 Sample 2:
A: A U U C C C C A A A
B: A U U C C G G A A A
C: A C C C C G G A A A
D: A C C C C G G C C C
A B C D
A - - - -
B 2 - - -
C 4 2 - -
D 7 5 3 -
A
B
C
D
Bootstrap example
 Sample 3:
A: A C C C A A G G C C
B: A C C G A A G G U U
C: A C C G A A C C C C
D: A C C G C C U U U U
A B C D
A - - - -
B 3 - - -
C 3 4 - -
D 7 4 6 -
A
B
C
D
Bootstrap example
 Calculate consensus tree
 Can be done on many ways
 Put the bootstrap number at each branch point
 The proportions of times this branch is observed
 Of course, more than three samples needed
A
B
C
D
1.0
0.66
Bootstrapping exercise
 Do the bootstrapping exercise on the wiki
Summary
 What is phylogenetic inference?
 What can a phylogenetic tree be used for?
 Be aware of the multiple alignment
 The different models
 Tree building methods: NJ, UPGMA, ML and MP
 Evaluating trees: Bootstrapping
 Programs: Phylip, PAUP*,PhyloWin and BioEdit
Next time: Gene finding (with Anders Krogh)
Then RNA structure prediction with me again 

More Related Content

DOCX
Humans, it would seem, have a great love of categorizing, organi
PPTX
Phylogenetic tree construction
PPTX
human phylogetic contrution of evolution tree.pptx
PPTX
Phylogenetic Tree evolution
PPTX
BTC 506 Phylogenetic Analysis.pptx
PPTX
PDF
Phylogenetic analysis
PPTX
Presentation about phylogenetic tree and its construction methods.
Humans, it would seem, have a great love of categorizing, organi
Phylogenetic tree construction
human phylogetic contrution of evolution tree.pptx
Phylogenetic Tree evolution
BTC 506 Phylogenetic Analysis.pptx
Phylogenetic analysis
Presentation about phylogenetic tree and its construction methods.

Similar to 6238578.ppt (20)

PPTX
Bioinformatics presentation shabir .pptx
PPT
Multiple Sequence Alignment-just glims of viewes on bioinformatics.
PDF
EVE161: Microbial Phylogenomics - Class 4 - Phylogeny
PPTX
Phylogenetic tree by Dr. Amrita Saxena.pptx
PPT
Phylogenetic alignment analysis an important tool in computational biology
PPTX
PPT
Maximum parsimony
PPT
Plant Molecular Systematics Phylogenetics.ppt
PPT
phylogenetics (1)...............................ppt
PPT
distance based phylogenetics-methodology
PPTX
Tree building
PPT
Softwares For Phylogentic Analysis
PDF
Phylogenetics
PPT
Phylogenetic analysis & their methods.ppt
PPTX
Tools in phylogeny
PPTX
Molecular phylogenetics
PPTX
Distance based method
PPT
Phylogenetic prediction - maximum parsimony method
PDF
Phylogenetics Analysis in R
PPT
Phylogenetics in R
Bioinformatics presentation shabir .pptx
Multiple Sequence Alignment-just glims of viewes on bioinformatics.
EVE161: Microbial Phylogenomics - Class 4 - Phylogeny
Phylogenetic tree by Dr. Amrita Saxena.pptx
Phylogenetic alignment analysis an important tool in computational biology
Maximum parsimony
Plant Molecular Systematics Phylogenetics.ppt
phylogenetics (1)...............................ppt
distance based phylogenetics-methodology
Tree building
Softwares For Phylogentic Analysis
Phylogenetics
Phylogenetic analysis & their methods.ppt
Tools in phylogeny
Molecular phylogenetics
Distance based method
Phylogenetic prediction - maximum parsimony method
Phylogenetics Analysis in R
Phylogenetics in R
Ad

More from ChijiokeNsofor (20)

PPTX
Lectutre Note on BTC 810 Cell and Tissue Culutre.pptx
PPTX
Environmental Antimicrobial Resistance in Nigeria 2020-2025 - Copy.pptx
PPT
Stem_Cell_4.ppt preseation seminar for students
PPTX
stem-cell-1.pptx presentation for students
PDF
stem cell-3.pdf for university students-2
PPT
Stem cells-2.ppt for undergraduate students
PPT
FRS 102-Fingerprints-2.ppt for undergraduates
PPTX
Introductory Seminar.pptx for jsps fellowship
PPTX
Work report-september 2018.pptx in germany
PPT
FRS 506.ppt for undergraduates studentss
PPTX
Genomic Surveillance of Antimicrobial Resistance in Wastewater Sources.pptx
PPTX
FRS 401 Indented Writting.pptx LECTURE NOTES
PPTX
FRS 401.pptx lecture notes for undergradutes
PPTX
Nwachukwu Oluomachi prisca seminar slide (6).pptx
PPTX
emergingndreemerginginfections-ak-180112051019.pptx
PPTX
Emerging and Remerging Infectious dieases.pptx
PPTX
BTC 509 Intersection of AI and Biotechnolog.pptx
PPT
BTC 810 Mass spectrometry and it applications.ppt
PPTX
Work Report- October 2021.pptx
PPT
FRS 310 Bite mark evidence.ppt
Lectutre Note on BTC 810 Cell and Tissue Culutre.pptx
Environmental Antimicrobial Resistance in Nigeria 2020-2025 - Copy.pptx
Stem_Cell_4.ppt preseation seminar for students
stem-cell-1.pptx presentation for students
stem cell-3.pdf for university students-2
Stem cells-2.ppt for undergraduate students
FRS 102-Fingerprints-2.ppt for undergraduates
Introductory Seminar.pptx for jsps fellowship
Work report-september 2018.pptx in germany
FRS 506.ppt for undergraduates studentss
Genomic Surveillance of Antimicrobial Resistance in Wastewater Sources.pptx
FRS 401 Indented Writting.pptx LECTURE NOTES
FRS 401.pptx lecture notes for undergradutes
Nwachukwu Oluomachi prisca seminar slide (6).pptx
emergingndreemerginginfections-ak-180112051019.pptx
Emerging and Remerging Infectious dieases.pptx
BTC 509 Intersection of AI and Biotechnolog.pptx
BTC 810 Mass spectrometry and it applications.ppt
Work Report- October 2021.pptx
FRS 310 Bite mark evidence.ppt
Ad

Recently uploaded (20)

PDF
Myanmar Dental Journal, The Journal of the Myanmar Dental Association (2013).pdf
PDF
Journal of Dental Science - UDMY (2021).pdf
PDF
Environmental Education MCQ BD2EE - Share Source.pdf
PDF
Race Reva University – Shaping Future Leaders in Artificial Intelligence
PDF
FOISHS ANNUAL IMPLEMENTATION PLAN 2025.pdf
PPTX
Climate Change and Its Global Impact.pptx
PDF
Skin Care and Cosmetic Ingredients Dictionary ( PDFDrive ).pdf
PDF
CRP102_SAGALASSOS_Final_Projects_2025.pdf
PPTX
DRUGS USED FOR HORMONAL DISORDER, SUPPLIMENTATION, CONTRACEPTION, & MEDICAL T...
PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 2).pdf
PDF
Hazard Identification & Risk Assessment .pdf
PDF
CISA (Certified Information Systems Auditor) Domain-Wise Summary.pdf
PDF
English Textual Question & Ans (12th Class).pdf
PPTX
B.Sc. DS Unit 2 Software Engineering.pptx
PDF
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
PDF
Climate and Adaptation MCQs class 7 from chatgpt
PDF
Literature_Review_methods_ BRACU_MKT426 course material
DOCX
Cambridge-Practice-Tests-for-IELTS-12.docx
PDF
LEARNERS WITH ADDITIONAL NEEDS ProfEd Topic
PPTX
Unit 4 Computer Architecture Multicore Processor.pptx
Myanmar Dental Journal, The Journal of the Myanmar Dental Association (2013).pdf
Journal of Dental Science - UDMY (2021).pdf
Environmental Education MCQ BD2EE - Share Source.pdf
Race Reva University – Shaping Future Leaders in Artificial Intelligence
FOISHS ANNUAL IMPLEMENTATION PLAN 2025.pdf
Climate Change and Its Global Impact.pptx
Skin Care and Cosmetic Ingredients Dictionary ( PDFDrive ).pdf
CRP102_SAGALASSOS_Final_Projects_2025.pdf
DRUGS USED FOR HORMONAL DISORDER, SUPPLIMENTATION, CONTRACEPTION, & MEDICAL T...
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 2).pdf
Hazard Identification & Risk Assessment .pdf
CISA (Certified Information Systems Auditor) Domain-Wise Summary.pdf
English Textual Question & Ans (12th Class).pdf
B.Sc. DS Unit 2 Software Engineering.pptx
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
Climate and Adaptation MCQs class 7 from chatgpt
Literature_Review_methods_ BRACU_MKT426 course material
Cambridge-Practice-Tests-for-IELTS-12.docx
LEARNERS WITH ADDITIONAL NEEDS ProfEd Topic
Unit 4 Computer Architecture Multicore Processor.pptx

6238578.ppt

  • 1. Phylogenetic Analysis Introduction to bioinformatics Stinus Lindgreen stinus@binf.ku.dk Bioinformatics Centre, University of Copenhagen
  • 2. Outline of the lecture  What is a phylogeny?  Why and how to interpret them  Programs: PHYLIP, PAUP* and BioEdit  Building a tree 1: Multiple alignment  Building a tree 2: The model  Building a tree 3: Construction  Building a tree 4: Evaluation
  • 3. Nothing in Biology Makes Sense Except in the Light of Evolution Theodosius Dobzhansky (1900-1975)
  • 4. Phylogeny  Phylogenetic inference predicts a tree based on characters (of some sort)  Some variation needed  Group together similar species/genes  Connect to most common ancestor  Unrooted tree: Just show connections  Rooted tree: Direction of evolution  Branch lengths can show divergence
  • 5. Before sequences  Phylogenetic trees show evolutionary relationships  Existed longer than sequencing methods  Previously based on morphological characters  Still partly today – at least for checking  Mainly based on biological sequences  DNA or protein  Base phylogeny on mutations
  • 7. Modern tree A A G C G X X
  • 8. Some pitfalls  Determining phylogeny is important for understanding biology  But also a very difficult problem  Beware of incorrect trees  Important to understand models and methods  The programs are helpful tools The result is only as good as the alignment
  • 9. Assumptions Basic concepts of evolutionary theory  Relation to common ancestor  Phylogenetics represented by bifurcating tree  Mutations occur over evolutionary time Necessary to make phylogenetic inference possible
  • 11. Interpretation  Know your model  Both evolutionary and for tree construction  Know the assumptions of the model  Evolution independent? Identical between sites? The same for all sequences?  Are the sequences correct?  And are they representative?  And are they homologous?  Is the multiple alignment correct? What you get out is no better than what you put in
  • 12. Some biological pitfalls Don’t make hasty conclusions!  Does your tree contradict common sense?  Then it’s probably wrong!  Differentiate between the homologs  Orthologs  Speciation, common ancestor, similar function  Paralogs  Gene duplication, within 1 organism, differing functions  Xenologs  Horizontal gene transfer – hard to tell, similar function
  • 13. Software Today we’ll look at the programs before the methods Some programs for phylogenetic analysis  A multiple alignment program: Clustal, T-Coffee, MAFFT, Muscle…  A phylogenetic program: Phylip, PAUP*, MacClade, BioEdit…  Visualizing the tree: TreeView, NJplot
  • 14. PAUP*  Commercial package  Apparently good  Many different methods and analysis methods  But since we don’t own a copy…  Similarly: MacClade only works on Macintosh…
  • 15. PHYLIP  Free package  Many programs  Both distance and character based  Bootstrapping possible  But:  It can be a little difficult  No graphical user interface  And you will need to run many programs
  • 16. BioEdit  Has phylogeny methods built in  Can call Phylip routines  No need for you to learn the command line  But no bootstrapping… (as far as I know)  Point and click:  Select the sequences in the alignment  Choose the wanted phylogeny  Voila!
  • 17. PhyloWin  Another free program  Simple, not many possibilities  But you can make bootstrapping
  • 18. Getting the software  Install BioEdit, PHYLIP, PhyloWin and NJplot  Links on the wiki
  • 19. Constructing a tree To make a phylogenetic tree, four steps are needed: 1. Perform multiple alignment 2. Choose your model 3. Build the tree 4. Evaluate the quality A brief note: Ideally: Parallel alignment and phylogenetic inference  Very difficult – but it has been pursued
  • 20. 1) The multiple alignment Already discussed Some notes:  Recall that MA programs are not exact  Some manual editing often necessary  Consider the algorithm used  Does it consider the phylogeny of the data?  Clustal’s guide tree: Not correct phylogeny  What parameters are used?  Solve ambiguities, remove near-identical sequences  Gappy regions, identical sequences can bias the result
  • 21. 2) The model The model describes the data  Evolutionary events  Overall mutability  Evolutionary model?  Crucial – both for alignment and tree building  Are you looking at nucleotides or amino acids?  Where do we get most information?  Know the basis for the chosen model
  • 22. Nucleotide models  Create 4×4 matrix  Either fixed cost  Character state  Or rate matrices  Probabilities  Used for different kinds of tree estimations  Include site specific information  Third codon position more variable
  • 23. Nucleotide model 1  Fixed cost for transitions and transversion  E.g. transversions are twice as costly as transitions  For a tree: Count the number of transitions/transversions  Calculate cost  Tends to minimize number of transversion  Cluster transitions A C G T A - 2 1 2 C 2 - 2 1 G 1 2 - 2 T 2 1 2 -
  • 24. Nucleotide model 2  Simple substitution rate matrix  Assume same rates AB and BA  Assume all mutations equally likely: Rate α  The Jukes-Cantor model A C G T A -3α α α α C α -3α α α G α α -3α α T α α α -3α
  • 25. Nucleotide model 3 A C G T A -(α2+α1) α2 α1 α2 C α2 -(α2+α1) α2 α1 G α1 α2 -(α2+α1) α2 T α2 α1 α2 -(α2+α1)  More advanced rate matrix  Include transitions/tranversions  Rates α1 and α2  The Kimura 2-parameter model
  • 26. Amino acid models  A 20×20 substitution matrix  The BLOSUM matrices  Fixed cost matrices  Or the PAM matrices  Rate matrices  Described last week
  • 27. 3) Building the tree We have the sequences, the alignment and the model  Find the best tree  What is the best tree?  Two main strategies:  Distance based  Look at dissimilarities (=distances)  Character based  Look at the data
  • 28. Problems with trees  The number of possible trees grows exponentially  For 15 taxa: 2.13·1014 possibilities…  How to search?  Branch and Bound  Branch swapping  Rooting the tree  Not a simple problem  All the following methods produce unrooted trees  Use an outgroup  Midpoint of longest branch
  • 29. Distance methods  Some sequences more similar than others  Closely related sequences should be close in the tree  Abstract view on the data  Loss of information is usually a bad sign  Only use the distances between sequences  Recall Clustal  All methods start with a distance matrix
  • 30. Distance methods  Can we get the correct answer?  Yes, if all mutation events were present  But: After one mutation, the site is ”saturated”  Additional mutations do not give additional info A B C: Distance 2 A C: Distance 1  And mutations back will fool the method A B A: Distance 2 A A: Distance 0
  • 31. UPGMA Unweighted Pair Group Method with Arithmetic Mean  Unweighted: The distances are used as they are  Pair: Find the two closest elements  Group: Put them together in a new group  Arithmetic Mean: Gives distances from the new group  Correct tree assuming a molecular clock  Evolutionary divergence time can be found from mutations  Mutation rates are constant
  • 32. UPGMA illustrated  Find two closest: A and D  Create a new group [A+D]  Update distances: 7 2 6 8 2 B D B A B D] [A          A B C D E A - 8 3 2 5 B - - 5 6 6 C - - - 7 5 D - - - - 3 E - - - - - A+D B C E A+D - 7 5 4 B - - 5 6 C - - - 5 E - - - -  Repeat for all sequences  Next time: Connect [A+D] with E
  • 33. Trying UPGMA  Go to the wiki and do the UPGMA exercise
  • 34. Neighbour joining  A little like UPGMA  Difference: NJ does not assume a molecular clock  But it assumes an additive tree  Distance between two leaves is the sum of the edges  Find the closest pair that is most apart from the rest of the tree  Connect pair and update distances  A little advanced: Take the overall distance to the rest of the tree into account  Corrects for varying mutation  Fast and can give good results
  • 35. Fitch-Margoliash FM method  We have the pairwise distances  Each branch in the tree has a length  The length of all paths can be found  Optimize tree by moving internal nodes around  The best fit minimizes the overall error  The minimum squared deviation   ij 2 ij ij ) p (d
  • 36. Minimum Evolution The ME method  Find the shortest tree  Count number of changes  Similar to FM but only looks at branches FM   ij 2 ij ij ) l (d A B B A ME
  • 37. Trying NJ  Go to the wiki and do the NJ exercise
  • 38. Character methods  Use the data (the actual characters)  All information at hand  More advanced, slower, but also more accurate  Maximum Parsimony (MP)  Occam’s razor: Simplest explanation  Maximum Likelihood (ML)  Advanced statistical method  Most probable tree given the data and the model
  • 39. Maximum parsimony  How does evolution work?  Assumption: Path of least resistance  True evolution gives rise to fewest changes  The tree we want:  Describe the given sequences by fewest changes  The ancestral nodes must be as similar as possible  Predict a tree  Count the number of changes needed
  • 40. MP illustrated A C G G C {A,C} {G} {A,C,G} {C}
  • 41. MP illustrated A C G G C {A,C} {G} {A,C,G} {C} X X Cost: 2 changes
  • 42. MP illustrated A C G G C C G C C CG CA
  • 43. Maximum Likelihood  Given the data, predict the most probable model  Can optimize both tree and substitution model  We know the sequences  What is the most likely substitution rates?  Estimate from the alignment (and the phylogeny)  And what is the most likely tree?  Estimate from alignment and substitution rates  Computationally heavy and rather slow  Normally good results
  • 44. Maximum Likelihood  General practice: Optimize model then tree  Calculate probability for each alignment column  Combine to probability for entire alignment  Averages over low and high probability sites  Likelihood of column given tree A A C A A A A C C A A A C G A L=P +P +P +…
  • 45. Maximum likelihood  Then repeat this for all possible tree topologies  And all possible assignments to internal nodes  And then choose the combination that gives the highest probability…  Clearly very difficult
  • 46. MP and ML exercise  Go to the wiki and do the MP and ML exercises
  • 47. Summary of methods Distance Character based Clustering UPGMA Neighbour Joining Optimality criterion Least Squares Minimum evolution Maximum parsimony Maximum likelihood (Bayesian statistics)
  • 48. The differences  Sometimes the differences can seem minimal  They affect the tree – but the same result is possible UPGMA and NJ  Minimize the overall length of the tree Maximum parsimony  Finds tree with fewest changes Maximum likelihood  Maximizes the probability of the tree given the data
  • 49. 4) Evaluating trees How good is the predicted tree? Some sequence variation needed  Is the signal strong enough? There are so many possible trees  Are there many trees similar to the prediction?  Which one to choose?  Is the tree robust?  Does it change much when e.g. removing a sequence?
  • 50. Randomization  Is it possible that tree is just random?  Permute the columns of the alignment  i.e. shuffle the characters in a column  Build a new tree  Is it (partly) identical?  If the tree is just as likely to be random, then don’t put too much faith in it
  • 51. Bootstrapping  The story of Baron von Münchausen  He pulled himself out of a swamp by his bootstraps  The idea: Evaluate the quality of the result using the same data all over again  Make a large number of new datasets  Create phylogenetic tree  Observe the number of times clades are made
  • 52. Bootstrapping  The datasets should be similar  Thereby: The trees are comparable  Alignments of same size (length and sequences)  Non-parametric: Sample with replacement  Choose a random column and add new alignment  Parametric: Simulate new datasets  Use model that look like your data  Characteristics are preserved (unlike randomization)
  • 53. Bootstrap example  Non-parametric bootstrapping  We have an alignment: A: A G G C U C C A A A B: A G G U U C G A A A C: A G C C C C G A A A D: A U U U C C G A A C #: 0 1 2 0 3 0 1 2 0 1  Sample columns: A: G G G U U U C A A A B: G G G U U U G A A A C: G C C C C C G A A A D: U U U C C C G A A C A B C D A - - - - B 1 - - - C 5 5 - - D 8 7 4 - A B C D
  • 54. Bootstrap example  Sample 2: A: A U U C C C C A A A B: A U U C C G G A A A C: A C C C C G G A A A D: A C C C C G G C C C A B C D A - - - - B 2 - - - C 4 2 - - D 7 5 3 - A B C D
  • 55. Bootstrap example  Sample 3: A: A C C C A A G G C C B: A C C G A A G G U U C: A C C G A A C C C C D: A C C G C C U U U U A B C D A - - - - B 3 - - - C 3 4 - - D 7 4 6 - A B C D
  • 56. Bootstrap example  Calculate consensus tree  Can be done on many ways  Put the bootstrap number at each branch point  The proportions of times this branch is observed  Of course, more than three samples needed A B C D 1.0 0.66
  • 57. Bootstrapping exercise  Do the bootstrapping exercise on the wiki
  • 58. Summary  What is phylogenetic inference?  What can a phylogenetic tree be used for?  Be aware of the multiple alignment  The different models  Tree building methods: NJ, UPGMA, ML and MP  Evaluating trees: Bootstrapping  Programs: Phylip, PAUP*,PhyloWin and BioEdit Next time: Gene finding (with Anders Krogh) Then RNA structure prediction with me again 