SlideShare a Scribd company logo
Phylogenetics Workshop 
Part II : Model-based tree building 
De Landtsheer Sébastien, University of Luxemburg 
Ahead of the BeNeLux Bioinformatics Conference 2011
Outline of the Workshop 
Part I : 
• General introduction 
• Alignments 
• Distance-based methods 
Part II : 
• Maximum likelihood trees 
• Bayesian trees 
Part III : 
• Advanced bayesian phylogenetics 
• Hypothesis testing
Outline of the Workshop 
Programs featured : 
- Topali (http://www.topali.org/download.shtml) 
- ModelTest in Java GUI 
(http://darwin.uvigo.es/software/jmodeltest.html) 
- BEAUti, BEAST and Tracer 
(http://beast.bio.ed.ac.uk/Main_Page) 
-
Model-based tree building 
Reminder : the main problem in phylogenetics is the HUGE 
number of possible trees to choose from 
What makes a good tree ?
Model-based tree building 
• We need to define an optimality criterion : 
– Parsimony (Occam’s razor) : 
• "Pluralitas non est ponenda sine necessitate“ 
• Keep things simple and avoid unnessessary explanations 
• The most plausible explanations are prefered 
– Likelihood : 
• Given the Model M, what is the probability P to observe the 
data D ?
Parsimony
Parsimony 
• What tree represents the minimal number of changes 
along the phylogeny ? 
• Natural way to find the most plausible (less expensive) 
explanation 
• Example 1 : 
ATGTAACGTA 
ATGTCAGGTA 
ATGTCACGTT
Parsimony 
Possible tree 1 : 
ACA 
C->A 
CCA 
T->A 
C->G 
CCT CGA 
ACT 
C->A 
4 changes
Parsimony 
Possible tree 2 : 
ACA 
No change 
ACA 
G->C 
A->T 
AGA ACT 
CGA 
A->C 
3 changes
Parsimony 
Parsimony methods will search for the most 
parsimonious tree using a simple scoring system. Two 
are commonly used : 
- equal : 0 1 1 1 - tv4 : 0 4 1 4 
1 0 1 1 4 0 4 1 
1 1 0 1 1 4 0 4 
1 1 1 0 4 1 4 0
Parsimony 
How to search the tree space ? 
- Brute force : trying all possible trees. Possible up to ~ 11 
taxa on modern computers 
- Heuristic search : trying arrangements of the (so far) 
best tree by making tiny changes. If the new tree is 
better it becomes the (so far) best tree. If not it is 
rejected and another rearrangement is tried
Tree rearrangements 
• Nearest-Neighbor Interchange (NNI)
Tree rearrangements 
• Subtree Pruning and Regrafting (SPR)
Tree rearrangements 
• Tree Bisection and Reconnection (TBR)
Likelihood 
• Probability to observe the data under the model 
– Data : already collected, won’t change, holy material 
– Model : idea about how things happened, with which 
we can do whatever we want : 
– Tree topology + branch lengths = « the tree » 
– Mechanism of sequence evolution = « the model » : 
- nucleotide composition 
- substitution matrix
Likelihood 
• The coin-flipping analogy : 
– Experimental setting : flipping a coin once 
– Data : heads 
– Model 1 : fair coin; Likelihood = 0.5 
– Model 2 : 2-headed coin; Likelihood = 1
Likelihood 
• The coin-flipping analogy : 
– Experimental setting : flipping a coin 10 times 
– Data : h=6, t=4 
– Model 1 : fair coin; Likelihood = 0.205 
– Model 2 : 2-headed coin; Likelihood = 0 
– Model 3 : unfair coin with p=0.6; Likelihood = 0.251
Likelihood 
• Likelihood with DNA : 
– Let’s assume a nucleotide composition of : 
• A : 25% 
• C : 25% 
• G : 25% 
• T : 25% 
– What is the likelihood of the sequence « A » ? 
– What is the likelihood of the sequence « ATGTAACA »?
Likelihood 
• Likelihood with DNA : 
– Let’s assume a nucleotide composition of : 
• A : 25% 
• C : 25% 
• G : 25% 
• T : 25% 
– What is the likelihood of the sequence « A » ? 
25% 
– What is the likelihood of the sequence « ATGTAACA » ? 
25%^8 = 0.00001526
Likelihood 
• Likelihood with DNA : 
– Let’s assume a nucleotide composition of : 
• A : 50% 
• C : 10% 
• G : 10% 
• T : 30% 
– What is the likelihood of the sequence « A » ? 
– What is the likelihood of the sequence « ATGTAACA »?
Likelihood 
• Likelihood with DNA : 
– Let’s assume a nucleotide composition of : 
• A : 50% 
• C : 10% 
• G : 10% 
• T : 30% 
– What is the likelihood of the sequence « A » ? 
50% 
– What is the likelihood of the sequence « ATGTAACA »? 
50%^4 * 10% * 10% * 30%^2 = 0.0001875
Likelihood 
• Likelihood with DNA : 
– Under any model, the sum of the likelihoods of all nucleotides, or 
all dinucleotides, or all possible data is 1 : 
• Let’s A=0.1; C=0.4; G=0.2; T=0.3 
• => L(A)+ L(C)+ L(G)+ L(T)=1 
=> L(AA)+ L(AC)+ L(AG)+ L(AT)+ L(CA)+ L(CC)+ L(CG)+ 
L(CT)+ L(GA)+ L(GC)+ L(GG)+ L(GT)+ L(TA)+ L(TC)+ 
L(TG)+ L(TT)=1 
=> The probability that some data exist is 1 (and that’s reassuring)
Likelihood 
• Likelihood with DNA : 
– Nucleotide composition model : 
A=0.1; C=0.4; G=0.2; T=0.3 
– Nucleotide substitution model : 
A C G T 
A 0.976 0.01 0.007 0.007 
C 0.002 0.983 0.005 0.01 
G 0.003 0.01 0.979 0.007 
T 0.002 0.013 0.005 0.979 
– Tree model : one branch 
– Data : « A » and « C »
Likelihood 
• Likelihood with DNA : 
A C 
Likelihood of « A » = 0.1 
Likelihood of a change from A to C = 0.01 
Likelihood of our Data given the Model = 0.1 * 0.01 = 
0.001
Likelihood 
• Let’s try with a more complicated alignment : 
CCAT CCGT 
Likelihood of our Data given the Model = 
0.4 * 0.983 * 0.4 * 0.983 * 0.1 * 0.007 * 0.3 * 0.979 = 
0.0000300 
But this is for a branch length of arbitrary unit 1
Likelihood 
• We can get the probabilities for longer branches 
by multiplying the matrix by itself : 
2 
0.976 0.01 0.007 0.007 
0.002 0.983 0.005 0.01 
0.003 0.01 0.979 0.007 = 
0.002 0.013 0.005 0.979 
0.953 0.02 0.013 0.015 
0.005 0.966 0.01 0.02 
0.007 0.02 0.959 0.015 
0.005 0.026 0.01 0.959
Likelihood 
• We can calculate the likelihood of D given M 
under different branch lengths : 
Branch length likelihood 
1 0.0000300 
2 0.0000559 
3 0.0000782 
10 0.000162 
15 0.000177 
20 0.000175 
30 0.000152
Likelihood 
• We can calculate the likelihood of D given M 
under different branch lengths :
Likelihood 
• What do the program do ? 
– Programs : PAUP, PAML, IQPNNI 
– General scheme : 
• Generate a tree (Neighbor-joining) 
• Calculate its likelihood 
• Optimizing branch lengths 
• Perform a tree rearrangement (NNI, SPR, TBR) 
• Evaluate the new tree 
– No garantee to find the single best tree, but usually finds a 
nearly-optimal one
Bayesian Trees 
• Are we happy with one tree ? 
– Since we cannot test all trees to find the best, we 
need to find a nearly-optimal one, but how good is a 
nearly optimum ? 
– We would like to find a collection of reasonably good 
trees and evaluate these 
– Probabilistic view : what is the probability of the Data 
given the Model 
– Bayesian view : given the Data, what is the probability 
of the Model ?
Bayesian Trees 
• Exemple : red and black balls in an urn 
We know p=0.5 
What is the chance to take exactly 
4 black and 6 red ? 
We can solve with a binomial 
distribution 
0.3 
0.25 
0.2 
0.15 
0.1 
0.05 
0 
P(D given M) = 20.5% 
1 2 3 4 5 6 7 8 9 10
Bayesian Trees 
• Exemple : red and black balls in an urn 
We do not know p 
We actually take exactly 4 black and 
6 red 
What is p ? With what confidence ? 
We take 10 more balls. They are all 
black. How does the new 
information changes our estimate ? ?
Bayesian Trees 
• Bayesian phylogenetics : the prior 
– We need a prior assumption, which can be strong or not. They 
represent our A PRIORI HYPOTHESIS ABOUT THE 
PHENOMENON. They need to be formulated as a prior 
distribution of the possible values of the parameters 
– Our inference will depend both of the prior and the data. The 
more data we possess, the less the final result will depend of the 
prior
Bayesian Trees 
• Bayesian phylogenetics : example priors 
– 1) « I have no idea whatsoever what the proportion of the red and black 
balls could be » 
0.025 
0.02 
0.015 
0.01 
0.005 
0 
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Bayesian Trees 
• Bayesian phylogenetics : example priors 
– 2) « I am nearly sure the proportion of red and black balls is 0.5 » 
0.08 
0.07 
0.06 
0.05 
0.04 
0.03 
0.02 
0.01 
0 
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Bayesian Trees 
• Bayesian phylogenetics : example priors 
– 3) « I know the balls is the urn are sampled from a big population of 
balls, and this global population has a p of 0.5 » 
0.045 
0.04 
0.035 
0.03 
0.025 
0.02 
0.015 
0.01 
0.005 
0 
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Bayesian Trees 
• Bayesian phylogenetics : example priors 
– 4) « I know the proportion of balls is around 0.82, and deviations are 
more likely to be in high values than low ones. I also know that values 
below 0.5 are impossible for some reason » 
0.06 
0.05 
0.04 
0.03 
0.02 
0.01 
0 
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Bayesian Trees 
• Bayesian phylogenetics : the posterior 
– We get as a result a posterior distribution of the possible values 
of p 
– From this distribution (probability density function) we can 
directly know the confidence we have in our estimate, estimated 
as Highest Probability Density (95%HPD) 
=> We do not need to subsample (bootstrap) 
We can directly see the uncertainty in the data
Bayesian Trees 
• Bayesian phylogenetics : example posteriors 
– 1) « I know the balls is the urn are sampled from a big population of 
balls, and this global population has a p of 0.5 ». Data = 6 blacks, 4 reds 
(orange) or 60 blacks, 40 reds (purple) 
0.18 
0.16 
0.14 
0.12 
0.1 
0.08 
0.06 
0.04 
0.02 
0 
Data 
Prior 
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
• Bayesian phylogenetics : example posteriors 
– 2) « I am nearly sure the proportion of red and black balls is 0.5 ». Data 
= 6 blacks, 4 reds (orange) or 60 blacks, 40 reds (purple) 
0.08 
0.07 
0.06 
0.05 
0.04 
0.03 
0.02 
0.01 
0 
Bayesian Trees 
Data 
Prior 
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1 
0.9 
0.8 
0.7 
0.6 
p 
0.5 
0.4 
0.3 
0.2 
0.1 
0 
1 2 3 Bayesian Trees 
• How to deal with phylogenetic trees ? 
Prior assumption : no topology is prefered
Bayesian Trees 
• After analysis of the data 
1 
0.9 
0.8 
0.7 
0.6 
p 
0.5 
0.4 
0.3 
0.2 
0.1 
0 
1 2 3 Posterior : one of the topologies is prefered
Bayesian Trees 
• Markov Chains : how to find the top of the mountain if you 
are blind ? (the Metropolis-Hastings algorithm) 
– MC are algorithms that will start at some arbitrary point in the parameter 
space and make a small random step 
– The step is accepted with a high probability if it increases the posterior 
probability, low otherwise 
– If the step is rejected, the chain returns to its previous state and tries a 
new random move
Bayesian Trees 
• Markov Chains : how to find the top of the mountain if you 
are blind ? 
New altitude is higher : accept always 
New altitude is higher : accept rarely
Bayesian Trees 
• Markov Chains : how to avoid getting stuck in local 
maxima ? (Metropolis Coupling) 
=> Run multiple « heated » chains and jump from one to another
Bayesian Trees 
• Markov Chains : how to get the posterior distribution 
(Monte-Carlo sampling)
Bayesian Trees 
• Markov Chains : how to get the posterior distribution 
(Monte-Carlo sampling)
Bayesian Trees 
• MCMCMC : what makes a good run ? (a hairy caterpillar) 
– No trend in the stable state (convergence) 
• Sufficient run time 
• Good data 
– Good mixing in order to have a good sample from the posterior 
(adequecy of the steps) 
• If steps too close to the current states : always accepted 
• If steps to different : too often rejected 
• Best mixing is at intermediate values
Bayesian Trees 
• MCMCMC : what makes a good run ? (a hairy caterpillar) 
– 1) Steps too small
Bayesian Trees 
• MCMCMC : what makes a good run ? (a hairy caterpillar) 
– 2) Steps too big
Bayesian Trees 
• MCMCMC : what makes a good run ? (a hairy caterpillar) 
– 3) Steps of acceptable length
Bayesian Trees 
• MCMCMC : what makes a good run ? (a hairy caterpillar)
Bayesian Trees 
• MCMCMC : what makes a good run ? (a hairy caterpillar)
Bayesian Trees 
• How to set up a bayesian phylogenetic analysis ? 
2 main programs : 
• MrBayes : command-line 
• BEAST : User interface
Molecular Clocks 
Global clock No clock 
Local clock Relaxed clock
BEASTING 
Flexible package : 
- BEAUti : to make the xml files 
- BEAST : actual calculations 
Metropolis – Hastings MCMC sampling 
- Tracer : to analyse results 
- TreeAnnotator : to sum trees
BEASTING 
BEAUti : 
Allows dated tips via the Guess dates function 
A priori sub-taxa 
Setting a model : 
- Substitution model (HKY, GTR, TN93) 
- Site heterogeneity (gamma distribution) 
- codon partition : (1, 2, 3) (1+2, 3) or none 
- codon position independence : model parameters and/or rates 
- Molecular clock : from strict to uncorrelated relaxed
BEASTING 
Setting the model priors : 
Difficult task, try several approaches, usually the default are good enough 
If the data is good, the priors do not matter… 
Setting the operators and the tuning parameters : 
Determines the « moves » made by the chain 
→ auto-optimize function 
MCMC settings : 
- chain lenght 10 000 000, sampling 1000 is a good default 
- anyway we want about 10 000 points
BEASTING 
Tracer : result analysis 
1) chain convergence diagnostic 
Bad convergence
BEASTING 
Tracer : result analysis 
1) chain convergence diagnostic 
Good convergence 
Sampling from the 
converged chain 
provides the 
posterior 
distribution of the 
parameters
BEASTING 
Auto-correlation time (ACT) measures independency of sampling 
procedure 
Effective sample size (ESS) = chain length/ACT 
Rule of thumb : ESS < 100 is very bad 
ESS > 200 is good 
Both chain convergence and ESS have to be checked for each parameter 
Depending the problem : 
- increase chain length and decrease sampling frequency 
- increase burn-in 
- question models, alignment, priors, tuning
BEASTING 
Combining runs : 
- different runs with the same xml file 
- longer runs, different burn-in or sampling 
- different priors given to the parameters 
- different operators 
- LogCombiner even allows resampling from the chain 
It does not make sense to combine runs from differents data sets or (even 
slightly) different models
BEASTING 
TreeAnnotator output :
Model Selection
Model Selection : LRTs 
Likelihood Ratio Tests are a standard way to compare the 
fit of two models : 
LRT = 2 (logL1 – logL0) 
When the models are nested, this statistic is approximately 
distributed as a χ2 distribution 
- only for nested models (H0 is a special case of H1) 
- d.f = difference of parameters numbers in the 2 models 
- advantage : statistaically sound 
- disadvantage : the χ2 distribution only holds for enough data
Model Selection : AIC 
Akaike information criterion is another goodness-of-fit 
measure 
AIC = 2k – 2 logL(Data, Model) 
where k is the number of parameters in the model 
- valid for non-nested models 
- sort of likelihood that would be « penalized » if using too many 
parameters 
- more criteria exists (AICc, BIC, etc…)
Bayes Factors 
BF = Marginal Likelihood Model1 / Marginal Likelihood Model 2 
= log-L1 – log-L2 
In practice, the marginal likelihoods are difficult to compute, but are 
approximated by taking the harmonic mean of the log-likelihoods (might 
be better in later versions of BEAST) 
Rule of thumb : 
BF < 3 : no significant improvement with model 1 compared with 
model 2 
3 < BF < 10 : significant improvement with model 1 
BF > 10 : decisive improvement with model 1 
Big advantage of Bayes Factors compared to LRTs : direct comparison of 
two non-nested models
Model Selection : Topali
Model Selection : Topali
Model Selection : Topali 
Click « Run » : runs remotely 
CTRL-click « Run » : runs locally
Model Selection : Topali
Model Selection
Model Selection : Topali
BEASTing : MCMC robot 
http://hydrodictyon.eeb.uconn.edu/people/plewis/software.php
BEASTing
BEASTing
BEASTing
BEASTing
BEASTing
Model Selection
Model Selection
Model Selection 
10 000 samples
BEASTing
Beasting
Beasting 
.fst 
BEAUti 
.beast (.xml) 
.log .trees 
Tracer TreeAnnotator 
Parameters of interest Tree
Beasting
Beasting
Beasting
Beasting
Beasting
Beasting
Beasting
Beasting
Beasting
Beasting
Beasting
Beasting
Beasting
Beasting
Beasting
Beasting : practicals 
The file « FPV.fas » contains an alignment of 7 feline papilloma virus sequences. Our goal is to 
determine the rate of evolution of each lineage based on the divergence dates of their host species. 
We have some estimates of the speciation times from the littérature. 
The file « RSVA.fas » contains an alignment of 35 human respiratory synctitial virus, group A. Our goal is 
to estimate the rate of evolution, the date of the most recent ancestor, and to get a good tree with 
measures of statistical support. 
The file « Flu.fas » contains an alignment of 21 sequences of southeast-asian H5N1 sequences, from 
different hosts. Our goal is to reconstitute the history of population parameters to draw 
epidemiological conclusions. 
1) Open Topali. Load the alignment « FPV.fas ». Go to « Analysis » menu and go 
« Phylogenetics/Model Selection ». Set up two model selection runs : one with PhyML and one with 
MrBayes (these will perform the model evaluation using likelihood-based methods and bayesian 
methods, respectively). Keep the default values and click « Run ». The analyses are run distantly on 
the Scottish Crop Research Institute HPC cluster. One can also CTRL-click on « Run » to run locally. 
2) Open JModelTest. Load the file « FPV.fas » and go to the « Analysis » menu. Go to « Compute 
Likelihood Scores », keep the default values and click « Compute ». Admire the different models 
being evaluated. When the computations are finished, go to the « Analysis » menu, and click « Do 
AIC calculations ». Then go to the « Results » menu and get results. 
3) Compare the results given with these different methods. Does it make sense ? Which model is the 
most appropriate for further analysis ? 
4) Do the same with the files « RSVA.fas » and « Flu.fas ».
Beasting : practicals 
5) Open BEAUti. Load the file « FPV.fas ». Now we will define the calibration nodes based on fossil 
data. 
6) Open the « Taxon » tab. Click the little « + » button and create a new taxon containing the FelixPV1, 
LynxPV1, and PumaPV1 sequences. Call it « Felix/Lynx/Puma » 
7) Repeat the procedure with the SnowLeopardPV1 and AsianLionPV1 sequences. Call this taxon 
« Lion/Leopard » 
8) Repeat once again with all sequences except the RacoonPV1 and CanineOralPV1. Call this taxon 
« Cats » 
9) These are contemporary sequences, therefore no action is taken in the « Tip dates » tab. 
10) Open the « Site models » tab. Based on our model selection, we will chose the model « GTR », set 
the base frequencies to « All equal » and include a « Gamma » site heterogeneity model to take into 
account rate variation between sites. 
11) In the « Clock models » tabs, select the « Uncorrelated Lognormal ». Check the « Estimate » 
checkbox (otherwise the clockrate is calculated from the sample dates, but in this analysis we do not 
have any dated sample) 
12) In the « Trees » tab, select « Speciation, Yule Process ». 
13) Now the cool part : go to the « Priors » tab and select the following dates for the most recent 
ancestors of our taxa : 
- Felis/Lynx/Puma : normal prior (7.15 My +/- 1.36 My) which gives a 95% CI of 4.5 - 9.8 My, 
as indicated by the litterature (Rector et al., 2007) 
- Lion/Leopard : normal prior (3.72 My +/- 1.05 My) which gives a 95% IC of 1.6 - 5.8 My 
- We will not put an informative prior for the Cats taxon. 
5) In the « MCMC » tab, set the chain length to 1000000 (one million) and log parameters every 100 
states. Generate the .xml file. When the computer prompts you with inappropriate priors, set them 
either to their default value if possible by clicking ok, or change the « +INF » to « MAX_VALUE ».
Beasting : practicals 
15) Open BEAST. Select as input file the .xml file you just generated. Click « Run » and wait a few 
minutes. This procedure will produce a .log file and a .trees file. Open Tracer and import the .log file 
16) Look at the trace of the posterior likelihood. Does it look ok (remember the hairy caterpillar) ? 
17) Look at the nucleotide substitution rates. Why do they look like this ? Switch from the « trace » tab to 
the « Estimates » tabs for the different parameters, and get a feeling of how good the ESS is as an 
indicator of the confidence you can have in the estimates of the parameters. What do you 
recommend ? 
18) Check the « treeModel.RootHeigth » parameter. How old is the root of the tree ? 
19) Check the « meanRate » parameter. What is the rate of molecular evolution of FPV ? Include units. 
20) Check the « CoefficientOfVariation » parameter. Does the rate of evolution differ statistically among 
lineages ? Is this a good model ? What do you recommend ? 
21) Open the TreeAnnotator program. Select the .trees file you generated with BEAST as input and fill a 
name like « FPV_MCC.tree » for output. 
22) As burn-in, we want to discard 10% of our run. We have 10000 trees so we enter « 1000 » in the 
« Burnin » field. 
23) Set the Posterior probability limit to 0. Like this we will annotate all nodes, not only the ones above a 
certain threshold. 
24) The default « Maximum Clade Credibility Tree » will find the tree with the highest likelihood and 
annotate it according to the other trees in the .trees file. Select « Mean Heights » and run. 
25) Once done, you can open the tree with the FigTree program. Try to get error bars on the nodes, 
annotate the branches with the posterior probability and color the branches by evolutionary rate. 
26) Which branch has the fastest rate of evolution ? 
27) For comparison, examine the file « FPV-max.fas.log.txt », which was run for 100 times longer.
Beasting : practicals 
28) Now we will examine the RSVA data set. Open BEAUti and load the « RSVA.fas » file. The name of 
the sequence contains the date of sampling. In the « Tip dates » tab, check the « Use tip dates » 
chackbox, click on « Guess dates » and find a way to extract the year of sampling. Ultimately we 
want dates ranging from 1956 to 2002. 
29) In the « Site Models » tab, select « GTR » for the model. Select « Gamma » for site heterogeneity 
and select « 3 partitions » to let each codon position have its own rate of evolution. 
30) We will assume a strict molecular clock and a constant population size. Select a chain length of 
1000000 (one million) and sample every 100 states. Deal with the prior settings as before and 
generate the .xml file. Run it with BEAST. 
31) Open the .log file with Tracer. Examine the posterior trace. Notice that the samples are highly 
correlated, therefore the low ESS indicate that our sample size is small, and that we cannot trust the 
estimates of the other parameters. What are the options at this point ? 
32) Return to BEAUti and change the model to « TN93 ». Given the model comparison we made, how 
big an approximation is this (check with Topali) ? Run this file and compare the results of the two 
models (you can select multiple files and multiple parameters in Tracer) 
33) Check the mutation rates for the different codon positions. Do this by control-clicking the CP1.mu, 
CP2.mu and CP3.mu parameters. Select the « Marginal density » tab. What do you observe ? Is this 
logical ? What could be the explanation ? 
34) In what year did the common ancestor of all RSVA viruses live ? What is the rate of evolution of 
RSVA ? Include units. 
35) Use TreeAnnotator to summarize and annotate the trees as before.
Beasting : practicals 
36) Now we will analyse the H5N1 data set. Open BEAUti and load the « flu.fas » file. In the « Tip 
Dates » tab use the « Guess Dates » tool to set the sampling date for each sequence. 
37) In the « Site Models » tab, leave « HKY » as substitution model. From the model comparison we 
made in Topali, do you think this is very wrong ? Select « 3 partitions », but leave « None » as site 
heterogeneity model. 
38) In the « Trees » tab select « Bayesian Skyline » and reduce the number of bins from 10 to 5, 
because we have only 21 sequences and we do not want to overparameterize the model. 
39) In the « MCMC » tab set the chain length to 5000000 (5 millions) and the sampling frequency to 500. 
Deal with the priors as we are used to and generate the .xml file. Run it with BEAST. 
40) Once finished, open the .log file with Tracer. Check the trace of the posterior for convergence and 
mixing. Does it look right ? 
41) Check the mutation rates for the different codon positions. Do this by control-clicking the CP1.mu, 
CP2.mu and CP3.mu parameters. Select the « Marginal density » tab. What do you observe ? What 
does this tell you about the selective forces acting on this virus ? What would you change in the 
settings for further analysis ? Do you understand why we did not select a site heterogeneity model 
but rather accounted for 3 codon position dependent rates ? 
42) In the « Analysis » menu, click « Bayesian skyline reconstruction ». Select the « flu.fas.trees.txt » file 
Leave the default options and fill « 2005 » for the age of the youngest tip and click OK. What can you 
say about the effective population size of H5N1 over time ? What assumptions have been made in 
the model to generate this graph ? How confident are you with these ? 
43) Use TreeAnnotator to summarize and annotate the trees as before. 
44) Enjoy your newly acquired proficiency in bayesian phylogenetics.

More Related Content

PPTX
Algorithm research project neighbor joining
PPT
Phylogenetics1
PPTX
PPTX
PPTX
Upgma
PPT
Presentation 2009 Journal Club Azhar Ali Shah
PPTX
Tree building
PPT
Prediction of transcription factor binding to DNA using rule induction methods
Algorithm research project neighbor joining
Phylogenetics1
Upgma
Presentation 2009 Journal Club Azhar Ali Shah
Tree building
Prediction of transcription factor binding to DNA using rule induction methods

Viewers also liked (13)

PPTX
2015 bioinformatics phylogenetics_wim_vancriekinge
PDF
Introduction to Probabilistic Models for Bioinformatics
PDF
BIS2C. Biodiversity and the Tree of Life. 2014. L4. Inferring Phylogenetic Trees
PDF
Evolution lectures1&2 2012 slideshare
PPT
Phylogeny
PPS
Phylogenetic tree
PPT
What is a phylogenetic tree
PPT
Chapter 18 Lecture- Classification
PPT
Phylogenetic analysis
PPT
Phylogenetic trees
PDF
Phylogenetics Analysis in R
PPTX
Parsimony analysis
2015 bioinformatics phylogenetics_wim_vancriekinge
Introduction to Probabilistic Models for Bioinformatics
BIS2C. Biodiversity and the Tree of Life. 2014. L4. Inferring Phylogenetic Trees
Evolution lectures1&2 2012 slideshare
Phylogeny
Phylogenetic tree
What is a phylogenetic tree
Chapter 18 Lecture- Classification
Phylogenetic analysis
Phylogenetic trees
Phylogenetics Analysis in R
Parsimony analysis
Ad

Similar to Phylogenetics2 (20)

PPTX
Lec13_Bayes.pptx
PPTX
The application of artificial intelligence
PPTX
Supervised learning: Types of Machine Learning
PPT
Data mining maximumlikelihood
PPT
Data mining maximumlikelihood
PPT
Data mining maximumlikelihood
PPT
Data mining maximumlikelihood
PPT
Data mining maximumlikelihood
PPT
Data mining maximumlikelihood
PPT
Data miningmaximumlikelihood
PPTX
Artificial Intelligence Notes Unit 3
PPT
Artificial Intelligence Bayesian Reasoning
PPT
Earthquake dhnjggbnkkkknvcxsefghjk gyjhvcdyj
PPT
8. Hypothesis Testing.ppt
PDF
Subject-2---Unidimensional-Data-2024.pdf
PPTX
artificial intelligence and uncertain reasoning
PDF
Statistical thinking
PDF
SPIE Conference V3.0
PPT
Introduction to Bayesian Statistics.ppt
PPTX
probability.pptx
Lec13_Bayes.pptx
The application of artificial intelligence
Supervised learning: Types of Machine Learning
Data mining maximumlikelihood
Data mining maximumlikelihood
Data mining maximumlikelihood
Data mining maximumlikelihood
Data mining maximumlikelihood
Data mining maximumlikelihood
Data miningmaximumlikelihood
Artificial Intelligence Notes Unit 3
Artificial Intelligence Bayesian Reasoning
Earthquake dhnjggbnkkkknvcxsefghjk gyjhvcdyj
8. Hypothesis Testing.ppt
Subject-2---Unidimensional-Data-2024.pdf
artificial intelligence and uncertain reasoning
Statistical thinking
SPIE Conference V3.0
Introduction to Bayesian Statistics.ppt
probability.pptx
Ad

Recently uploaded (20)

PPTX
importance of Data-Visualization-in-Data-Science. for mba studnts
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PDF
Microsoft Core Cloud Services powerpoint
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PDF
Introduction to the R Programming Language
PDF
Introduction to Data Science and Data Analysis
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PDF
annual-report-2024-2025 original latest.
importance of Data-Visualization-in-Data-Science. for mba studnts
Optimise Shopper Experiences with a Strong Data Estate.pdf
Microsoft Core Cloud Services powerpoint
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
retention in jsjsksksksnbsndjddjdnFPD.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Topic 5 Presentation 5 Lesson 5 Corporate Fin
SAP 2 completion done . PRESENTATION.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Qualitative Qantitative and Mixed Methods.pptx
Pilar Kemerdekaan dan Identi Bangsa.pptx
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
Introduction to the R Programming Language
Introduction to Data Science and Data Analysis
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
annual-report-2024-2025 original latest.

Phylogenetics2

  • 1. Phylogenetics Workshop Part II : Model-based tree building De Landtsheer Sébastien, University of Luxemburg Ahead of the BeNeLux Bioinformatics Conference 2011
  • 2. Outline of the Workshop Part I : • General introduction • Alignments • Distance-based methods Part II : • Maximum likelihood trees • Bayesian trees Part III : • Advanced bayesian phylogenetics • Hypothesis testing
  • 3. Outline of the Workshop Programs featured : - Topali (http://www.topali.org/download.shtml) - ModelTest in Java GUI (http://darwin.uvigo.es/software/jmodeltest.html) - BEAUti, BEAST and Tracer (http://beast.bio.ed.ac.uk/Main_Page) -
  • 4. Model-based tree building Reminder : the main problem in phylogenetics is the HUGE number of possible trees to choose from What makes a good tree ?
  • 5. Model-based tree building • We need to define an optimality criterion : – Parsimony (Occam’s razor) : • "Pluralitas non est ponenda sine necessitate“ • Keep things simple and avoid unnessessary explanations • The most plausible explanations are prefered – Likelihood : • Given the Model M, what is the probability P to observe the data D ?
  • 7. Parsimony • What tree represents the minimal number of changes along the phylogeny ? • Natural way to find the most plausible (less expensive) explanation • Example 1 : ATGTAACGTA ATGTCAGGTA ATGTCACGTT
  • 8. Parsimony Possible tree 1 : ACA C->A CCA T->A C->G CCT CGA ACT C->A 4 changes
  • 9. Parsimony Possible tree 2 : ACA No change ACA G->C A->T AGA ACT CGA A->C 3 changes
  • 10. Parsimony Parsimony methods will search for the most parsimonious tree using a simple scoring system. Two are commonly used : - equal : 0 1 1 1 - tv4 : 0 4 1 4 1 0 1 1 4 0 4 1 1 1 0 1 1 4 0 4 1 1 1 0 4 1 4 0
  • 11. Parsimony How to search the tree space ? - Brute force : trying all possible trees. Possible up to ~ 11 taxa on modern computers - Heuristic search : trying arrangements of the (so far) best tree by making tiny changes. If the new tree is better it becomes the (so far) best tree. If not it is rejected and another rearrangement is tried
  • 12. Tree rearrangements • Nearest-Neighbor Interchange (NNI)
  • 13. Tree rearrangements • Subtree Pruning and Regrafting (SPR)
  • 14. Tree rearrangements • Tree Bisection and Reconnection (TBR)
  • 15. Likelihood • Probability to observe the data under the model – Data : already collected, won’t change, holy material – Model : idea about how things happened, with which we can do whatever we want : – Tree topology + branch lengths = « the tree » – Mechanism of sequence evolution = « the model » : - nucleotide composition - substitution matrix
  • 16. Likelihood • The coin-flipping analogy : – Experimental setting : flipping a coin once – Data : heads – Model 1 : fair coin; Likelihood = 0.5 – Model 2 : 2-headed coin; Likelihood = 1
  • 17. Likelihood • The coin-flipping analogy : – Experimental setting : flipping a coin 10 times – Data : h=6, t=4 – Model 1 : fair coin; Likelihood = 0.205 – Model 2 : 2-headed coin; Likelihood = 0 – Model 3 : unfair coin with p=0.6; Likelihood = 0.251
  • 18. Likelihood • Likelihood with DNA : – Let’s assume a nucleotide composition of : • A : 25% • C : 25% • G : 25% • T : 25% – What is the likelihood of the sequence « A » ? – What is the likelihood of the sequence « ATGTAACA »?
  • 19. Likelihood • Likelihood with DNA : – Let’s assume a nucleotide composition of : • A : 25% • C : 25% • G : 25% • T : 25% – What is the likelihood of the sequence « A » ? 25% – What is the likelihood of the sequence « ATGTAACA » ? 25%^8 = 0.00001526
  • 20. Likelihood • Likelihood with DNA : – Let’s assume a nucleotide composition of : • A : 50% • C : 10% • G : 10% • T : 30% – What is the likelihood of the sequence « A » ? – What is the likelihood of the sequence « ATGTAACA »?
  • 21. Likelihood • Likelihood with DNA : – Let’s assume a nucleotide composition of : • A : 50% • C : 10% • G : 10% • T : 30% – What is the likelihood of the sequence « A » ? 50% – What is the likelihood of the sequence « ATGTAACA »? 50%^4 * 10% * 10% * 30%^2 = 0.0001875
  • 22. Likelihood • Likelihood with DNA : – Under any model, the sum of the likelihoods of all nucleotides, or all dinucleotides, or all possible data is 1 : • Let’s A=0.1; C=0.4; G=0.2; T=0.3 • => L(A)+ L(C)+ L(G)+ L(T)=1 => L(AA)+ L(AC)+ L(AG)+ L(AT)+ L(CA)+ L(CC)+ L(CG)+ L(CT)+ L(GA)+ L(GC)+ L(GG)+ L(GT)+ L(TA)+ L(TC)+ L(TG)+ L(TT)=1 => The probability that some data exist is 1 (and that’s reassuring)
  • 23. Likelihood • Likelihood with DNA : – Nucleotide composition model : A=0.1; C=0.4; G=0.2; T=0.3 – Nucleotide substitution model : A C G T A 0.976 0.01 0.007 0.007 C 0.002 0.983 0.005 0.01 G 0.003 0.01 0.979 0.007 T 0.002 0.013 0.005 0.979 – Tree model : one branch – Data : « A » and « C »
  • 24. Likelihood • Likelihood with DNA : A C Likelihood of « A » = 0.1 Likelihood of a change from A to C = 0.01 Likelihood of our Data given the Model = 0.1 * 0.01 = 0.001
  • 25. Likelihood • Let’s try with a more complicated alignment : CCAT CCGT Likelihood of our Data given the Model = 0.4 * 0.983 * 0.4 * 0.983 * 0.1 * 0.007 * 0.3 * 0.979 = 0.0000300 But this is for a branch length of arbitrary unit 1
  • 26. Likelihood • We can get the probabilities for longer branches by multiplying the matrix by itself : 2 0.976 0.01 0.007 0.007 0.002 0.983 0.005 0.01 0.003 0.01 0.979 0.007 = 0.002 0.013 0.005 0.979 0.953 0.02 0.013 0.015 0.005 0.966 0.01 0.02 0.007 0.02 0.959 0.015 0.005 0.026 0.01 0.959
  • 27. Likelihood • We can calculate the likelihood of D given M under different branch lengths : Branch length likelihood 1 0.0000300 2 0.0000559 3 0.0000782 10 0.000162 15 0.000177 20 0.000175 30 0.000152
  • 28. Likelihood • We can calculate the likelihood of D given M under different branch lengths :
  • 29. Likelihood • What do the program do ? – Programs : PAUP, PAML, IQPNNI – General scheme : • Generate a tree (Neighbor-joining) • Calculate its likelihood • Optimizing branch lengths • Perform a tree rearrangement (NNI, SPR, TBR) • Evaluate the new tree – No garantee to find the single best tree, but usually finds a nearly-optimal one
  • 30. Bayesian Trees • Are we happy with one tree ? – Since we cannot test all trees to find the best, we need to find a nearly-optimal one, but how good is a nearly optimum ? – We would like to find a collection of reasonably good trees and evaluate these – Probabilistic view : what is the probability of the Data given the Model – Bayesian view : given the Data, what is the probability of the Model ?
  • 31. Bayesian Trees • Exemple : red and black balls in an urn We know p=0.5 What is the chance to take exactly 4 black and 6 red ? We can solve with a binomial distribution 0.3 0.25 0.2 0.15 0.1 0.05 0 P(D given M) = 20.5% 1 2 3 4 5 6 7 8 9 10
  • 32. Bayesian Trees • Exemple : red and black balls in an urn We do not know p We actually take exactly 4 black and 6 red What is p ? With what confidence ? We take 10 more balls. They are all black. How does the new information changes our estimate ? ?
  • 33. Bayesian Trees • Bayesian phylogenetics : the prior – We need a prior assumption, which can be strong or not. They represent our A PRIORI HYPOTHESIS ABOUT THE PHENOMENON. They need to be formulated as a prior distribution of the possible values of the parameters – Our inference will depend both of the prior and the data. The more data we possess, the less the final result will depend of the prior
  • 34. Bayesian Trees • Bayesian phylogenetics : example priors – 1) « I have no idea whatsoever what the proportion of the red and black balls could be » 0.025 0.02 0.015 0.01 0.005 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
  • 35. Bayesian Trees • Bayesian phylogenetics : example priors – 2) « I am nearly sure the proportion of red and black balls is 0.5 » 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
  • 36. Bayesian Trees • Bayesian phylogenetics : example priors – 3) « I know the balls is the urn are sampled from a big population of balls, and this global population has a p of 0.5 » 0.045 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
  • 37. Bayesian Trees • Bayesian phylogenetics : example priors – 4) « I know the proportion of balls is around 0.82, and deviations are more likely to be in high values than low ones. I also know that values below 0.5 are impossible for some reason » 0.06 0.05 0.04 0.03 0.02 0.01 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
  • 38. Bayesian Trees • Bayesian phylogenetics : the posterior – We get as a result a posterior distribution of the possible values of p – From this distribution (probability density function) we can directly know the confidence we have in our estimate, estimated as Highest Probability Density (95%HPD) => We do not need to subsample (bootstrap) We can directly see the uncertainty in the data
  • 39. Bayesian Trees • Bayesian phylogenetics : example posteriors – 1) « I know the balls is the urn are sampled from a big population of balls, and this global population has a p of 0.5 ». Data = 6 blacks, 4 reds (orange) or 60 blacks, 40 reds (purple) 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 Data Prior 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
  • 40. • Bayesian phylogenetics : example posteriors – 2) « I am nearly sure the proportion of red and black balls is 0.5 ». Data = 6 blacks, 4 reds (orange) or 60 blacks, 40 reds (purple) 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 Bayesian Trees Data Prior 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
  • 41. 1 0.9 0.8 0.7 0.6 p 0.5 0.4 0.3 0.2 0.1 0 1 2 3 Bayesian Trees • How to deal with phylogenetic trees ? Prior assumption : no topology is prefered
  • 42. Bayesian Trees • After analysis of the data 1 0.9 0.8 0.7 0.6 p 0.5 0.4 0.3 0.2 0.1 0 1 2 3 Posterior : one of the topologies is prefered
  • 43. Bayesian Trees • Markov Chains : how to find the top of the mountain if you are blind ? (the Metropolis-Hastings algorithm) – MC are algorithms that will start at some arbitrary point in the parameter space and make a small random step – The step is accepted with a high probability if it increases the posterior probability, low otherwise – If the step is rejected, the chain returns to its previous state and tries a new random move
  • 44. Bayesian Trees • Markov Chains : how to find the top of the mountain if you are blind ? New altitude is higher : accept always New altitude is higher : accept rarely
  • 45. Bayesian Trees • Markov Chains : how to avoid getting stuck in local maxima ? (Metropolis Coupling) => Run multiple « heated » chains and jump from one to another
  • 46. Bayesian Trees • Markov Chains : how to get the posterior distribution (Monte-Carlo sampling)
  • 47. Bayesian Trees • Markov Chains : how to get the posterior distribution (Monte-Carlo sampling)
  • 48. Bayesian Trees • MCMCMC : what makes a good run ? (a hairy caterpillar) – No trend in the stable state (convergence) • Sufficient run time • Good data – Good mixing in order to have a good sample from the posterior (adequecy of the steps) • If steps too close to the current states : always accepted • If steps to different : too often rejected • Best mixing is at intermediate values
  • 49. Bayesian Trees • MCMCMC : what makes a good run ? (a hairy caterpillar) – 1) Steps too small
  • 50. Bayesian Trees • MCMCMC : what makes a good run ? (a hairy caterpillar) – 2) Steps too big
  • 51. Bayesian Trees • MCMCMC : what makes a good run ? (a hairy caterpillar) – 3) Steps of acceptable length
  • 52. Bayesian Trees • MCMCMC : what makes a good run ? (a hairy caterpillar)
  • 53. Bayesian Trees • MCMCMC : what makes a good run ? (a hairy caterpillar)
  • 54. Bayesian Trees • How to set up a bayesian phylogenetic analysis ? 2 main programs : • MrBayes : command-line • BEAST : User interface
  • 55. Molecular Clocks Global clock No clock Local clock Relaxed clock
  • 56. BEASTING Flexible package : - BEAUti : to make the xml files - BEAST : actual calculations Metropolis – Hastings MCMC sampling - Tracer : to analyse results - TreeAnnotator : to sum trees
  • 57. BEASTING BEAUti : Allows dated tips via the Guess dates function A priori sub-taxa Setting a model : - Substitution model (HKY, GTR, TN93) - Site heterogeneity (gamma distribution) - codon partition : (1, 2, 3) (1+2, 3) or none - codon position independence : model parameters and/or rates - Molecular clock : from strict to uncorrelated relaxed
  • 58. BEASTING Setting the model priors : Difficult task, try several approaches, usually the default are good enough If the data is good, the priors do not matter… Setting the operators and the tuning parameters : Determines the « moves » made by the chain → auto-optimize function MCMC settings : - chain lenght 10 000 000, sampling 1000 is a good default - anyway we want about 10 000 points
  • 59. BEASTING Tracer : result analysis 1) chain convergence diagnostic Bad convergence
  • 60. BEASTING Tracer : result analysis 1) chain convergence diagnostic Good convergence Sampling from the converged chain provides the posterior distribution of the parameters
  • 61. BEASTING Auto-correlation time (ACT) measures independency of sampling procedure Effective sample size (ESS) = chain length/ACT Rule of thumb : ESS < 100 is very bad ESS > 200 is good Both chain convergence and ESS have to be checked for each parameter Depending the problem : - increase chain length and decrease sampling frequency - increase burn-in - question models, alignment, priors, tuning
  • 62. BEASTING Combining runs : - different runs with the same xml file - longer runs, different burn-in or sampling - different priors given to the parameters - different operators - LogCombiner even allows resampling from the chain It does not make sense to combine runs from differents data sets or (even slightly) different models
  • 65. Model Selection : LRTs Likelihood Ratio Tests are a standard way to compare the fit of two models : LRT = 2 (logL1 – logL0) When the models are nested, this statistic is approximately distributed as a χ2 distribution - only for nested models (H0 is a special case of H1) - d.f = difference of parameters numbers in the 2 models - advantage : statistaically sound - disadvantage : the χ2 distribution only holds for enough data
  • 66. Model Selection : AIC Akaike information criterion is another goodness-of-fit measure AIC = 2k – 2 logL(Data, Model) where k is the number of parameters in the model - valid for non-nested models - sort of likelihood that would be « penalized » if using too many parameters - more criteria exists (AICc, BIC, etc…)
  • 67. Bayes Factors BF = Marginal Likelihood Model1 / Marginal Likelihood Model 2 = log-L1 – log-L2 In practice, the marginal likelihoods are difficult to compute, but are approximated by taking the harmonic mean of the log-likelihoods (might be better in later versions of BEAST) Rule of thumb : BF < 3 : no significant improvement with model 1 compared with model 2 3 < BF < 10 : significant improvement with model 1 BF > 10 : decisive improvement with model 1 Big advantage of Bayes Factors compared to LRTs : direct comparison of two non-nested models
  • 70. Model Selection : Topali Click « Run » : runs remotely CTRL-click « Run » : runs locally
  • 74. BEASTing : MCMC robot http://hydrodictyon.eeb.uconn.edu/people/plewis/software.php
  • 82. Model Selection 10 000 samples
  • 85. Beasting .fst BEAUti .beast (.xml) .log .trees Tracer TreeAnnotator Parameters of interest Tree
  • 101. Beasting : practicals The file « FPV.fas » contains an alignment of 7 feline papilloma virus sequences. Our goal is to determine the rate of evolution of each lineage based on the divergence dates of their host species. We have some estimates of the speciation times from the littérature. The file « RSVA.fas » contains an alignment of 35 human respiratory synctitial virus, group A. Our goal is to estimate the rate of evolution, the date of the most recent ancestor, and to get a good tree with measures of statistical support. The file « Flu.fas » contains an alignment of 21 sequences of southeast-asian H5N1 sequences, from different hosts. Our goal is to reconstitute the history of population parameters to draw epidemiological conclusions. 1) Open Topali. Load the alignment « FPV.fas ». Go to « Analysis » menu and go « Phylogenetics/Model Selection ». Set up two model selection runs : one with PhyML and one with MrBayes (these will perform the model evaluation using likelihood-based methods and bayesian methods, respectively). Keep the default values and click « Run ». The analyses are run distantly on the Scottish Crop Research Institute HPC cluster. One can also CTRL-click on « Run » to run locally. 2) Open JModelTest. Load the file « FPV.fas » and go to the « Analysis » menu. Go to « Compute Likelihood Scores », keep the default values and click « Compute ». Admire the different models being evaluated. When the computations are finished, go to the « Analysis » menu, and click « Do AIC calculations ». Then go to the « Results » menu and get results. 3) Compare the results given with these different methods. Does it make sense ? Which model is the most appropriate for further analysis ? 4) Do the same with the files « RSVA.fas » and « Flu.fas ».
  • 102. Beasting : practicals 5) Open BEAUti. Load the file « FPV.fas ». Now we will define the calibration nodes based on fossil data. 6) Open the « Taxon » tab. Click the little « + » button and create a new taxon containing the FelixPV1, LynxPV1, and PumaPV1 sequences. Call it « Felix/Lynx/Puma » 7) Repeat the procedure with the SnowLeopardPV1 and AsianLionPV1 sequences. Call this taxon « Lion/Leopard » 8) Repeat once again with all sequences except the RacoonPV1 and CanineOralPV1. Call this taxon « Cats » 9) These are contemporary sequences, therefore no action is taken in the « Tip dates » tab. 10) Open the « Site models » tab. Based on our model selection, we will chose the model « GTR », set the base frequencies to « All equal » and include a « Gamma » site heterogeneity model to take into account rate variation between sites. 11) In the « Clock models » tabs, select the « Uncorrelated Lognormal ». Check the « Estimate » checkbox (otherwise the clockrate is calculated from the sample dates, but in this analysis we do not have any dated sample) 12) In the « Trees » tab, select « Speciation, Yule Process ». 13) Now the cool part : go to the « Priors » tab and select the following dates for the most recent ancestors of our taxa : - Felis/Lynx/Puma : normal prior (7.15 My +/- 1.36 My) which gives a 95% CI of 4.5 - 9.8 My, as indicated by the litterature (Rector et al., 2007) - Lion/Leopard : normal prior (3.72 My +/- 1.05 My) which gives a 95% IC of 1.6 - 5.8 My - We will not put an informative prior for the Cats taxon. 5) In the « MCMC » tab, set the chain length to 1000000 (one million) and log parameters every 100 states. Generate the .xml file. When the computer prompts you with inappropriate priors, set them either to their default value if possible by clicking ok, or change the « +INF » to « MAX_VALUE ».
  • 103. Beasting : practicals 15) Open BEAST. Select as input file the .xml file you just generated. Click « Run » and wait a few minutes. This procedure will produce a .log file and a .trees file. Open Tracer and import the .log file 16) Look at the trace of the posterior likelihood. Does it look ok (remember the hairy caterpillar) ? 17) Look at the nucleotide substitution rates. Why do they look like this ? Switch from the « trace » tab to the « Estimates » tabs for the different parameters, and get a feeling of how good the ESS is as an indicator of the confidence you can have in the estimates of the parameters. What do you recommend ? 18) Check the « treeModel.RootHeigth » parameter. How old is the root of the tree ? 19) Check the « meanRate » parameter. What is the rate of molecular evolution of FPV ? Include units. 20) Check the « CoefficientOfVariation » parameter. Does the rate of evolution differ statistically among lineages ? Is this a good model ? What do you recommend ? 21) Open the TreeAnnotator program. Select the .trees file you generated with BEAST as input and fill a name like « FPV_MCC.tree » for output. 22) As burn-in, we want to discard 10% of our run. We have 10000 trees so we enter « 1000 » in the « Burnin » field. 23) Set the Posterior probability limit to 0. Like this we will annotate all nodes, not only the ones above a certain threshold. 24) The default « Maximum Clade Credibility Tree » will find the tree with the highest likelihood and annotate it according to the other trees in the .trees file. Select « Mean Heights » and run. 25) Once done, you can open the tree with the FigTree program. Try to get error bars on the nodes, annotate the branches with the posterior probability and color the branches by evolutionary rate. 26) Which branch has the fastest rate of evolution ? 27) For comparison, examine the file « FPV-max.fas.log.txt », which was run for 100 times longer.
  • 104. Beasting : practicals 28) Now we will examine the RSVA data set. Open BEAUti and load the « RSVA.fas » file. The name of the sequence contains the date of sampling. In the « Tip dates » tab, check the « Use tip dates » chackbox, click on « Guess dates » and find a way to extract the year of sampling. Ultimately we want dates ranging from 1956 to 2002. 29) In the « Site Models » tab, select « GTR » for the model. Select « Gamma » for site heterogeneity and select « 3 partitions » to let each codon position have its own rate of evolution. 30) We will assume a strict molecular clock and a constant population size. Select a chain length of 1000000 (one million) and sample every 100 states. Deal with the prior settings as before and generate the .xml file. Run it with BEAST. 31) Open the .log file with Tracer. Examine the posterior trace. Notice that the samples are highly correlated, therefore the low ESS indicate that our sample size is small, and that we cannot trust the estimates of the other parameters. What are the options at this point ? 32) Return to BEAUti and change the model to « TN93 ». Given the model comparison we made, how big an approximation is this (check with Topali) ? Run this file and compare the results of the two models (you can select multiple files and multiple parameters in Tracer) 33) Check the mutation rates for the different codon positions. Do this by control-clicking the CP1.mu, CP2.mu and CP3.mu parameters. Select the « Marginal density » tab. What do you observe ? Is this logical ? What could be the explanation ? 34) In what year did the common ancestor of all RSVA viruses live ? What is the rate of evolution of RSVA ? Include units. 35) Use TreeAnnotator to summarize and annotate the trees as before.
  • 105. Beasting : practicals 36) Now we will analyse the H5N1 data set. Open BEAUti and load the « flu.fas » file. In the « Tip Dates » tab use the « Guess Dates » tool to set the sampling date for each sequence. 37) In the « Site Models » tab, leave « HKY » as substitution model. From the model comparison we made in Topali, do you think this is very wrong ? Select « 3 partitions », but leave « None » as site heterogeneity model. 38) In the « Trees » tab select « Bayesian Skyline » and reduce the number of bins from 10 to 5, because we have only 21 sequences and we do not want to overparameterize the model. 39) In the « MCMC » tab set the chain length to 5000000 (5 millions) and the sampling frequency to 500. Deal with the priors as we are used to and generate the .xml file. Run it with BEAST. 40) Once finished, open the .log file with Tracer. Check the trace of the posterior for convergence and mixing. Does it look right ? 41) Check the mutation rates for the different codon positions. Do this by control-clicking the CP1.mu, CP2.mu and CP3.mu parameters. Select the « Marginal density » tab. What do you observe ? What does this tell you about the selective forces acting on this virus ? What would you change in the settings for further analysis ? Do you understand why we did not select a site heterogeneity model but rather accounted for 3 codon position dependent rates ? 42) In the « Analysis » menu, click « Bayesian skyline reconstruction ». Select the « flu.fas.trees.txt » file Leave the default options and fill « 2005 » for the age of the youngest tip and click OK. What can you say about the effective population size of H5N1 over time ? What assumptions have been made in the model to generate this graph ? How confident are you with these ? 43) Use TreeAnnotator to summarize and annotate the trees as before. 44) Enjoy your newly acquired proficiency in bayesian phylogenetics.