T a t a r uP a u l a
Deciphering the story in our DNA
PhD defense Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Aarhus, January23rd 2015
Inference of population history
and patterns from molecular data
Supervisors: Christian N.S. Pedersen & Asger Hobolth
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Math Computer science
My PhD studies
2
Bioinformatics
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Mathematical
modeling
Implementation
My PhD studies
2
Bioinformatics
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Mathematical
modeling
Implementation
My PhD studies
2
Bioinformatics
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Mathematical modeling toolbox
3
SCFG
DFA DTMC HMM
CTMC
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Mathematical modeling toolbox
3
Stochastic Context
Free Grammar
Deterministic Finite
Automaton
Discrete Time
Markov Chain
Hidden Markov
Model
Continuous Time
Markov Chain
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Research overview
4
• 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA
secondary structure prediction
SCFG
• 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free
grammars and evolutionary information
SCFG
• 2011 BMC Bioinformatics Comparison of methods for calculating conditional
expectations of sufficient statistics for continuous time Markov chains
CTMC
• 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular
expressions
HMM &
DFA
• 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent
tracts in unrelated individuals
CTMC &
HMM
• In preparation Motif discovery in ranked lists of sequences
DTMC &
DFA
• In preparation Modeling allele frequency data under a Wright-Fisher model of drift,
mutation and selection: the Beta distribution approach
DTMC
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Research overview
4
• 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA
secondary structure prediction
SCFG
• 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free
grammars and evolutionary information
SCFG
• 2011 BMC Bioinformatics Comparison of methods for calculating conditional
expectations of sufficient statistics for continuous time Markov chains
CTMC
• 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular
expressions
HMM &
DFA
• 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent
tracts in unrelated individuals
CTMC &
HMM
• In preparation Motif discovery in ranked lists of sequences
DTMC &
DFA
• In preparation Modeling allele frequency data under a Wright-Fisher model of drift,
mutation and selection: the Beta distribution approach
DTMC
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Research overview
4
• 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA
secondary structure prediction
SCFG
• 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free
grammars and evolutionary information
SCFG
• 2011 BMC Bioinformatics Comparison of methods for calculating conditional
expectations of sufficient statistics for continuous time Markov chains
CTMC
• 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular
expressions
HMM &
DFA
• 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent
tracts in unrelated individuals
CTMC &
HMM
• In preparation Motif discovery in ranked lists of sequences
DTMC &
DFA
• In preparation Modeling allele frequency data under a Wright-Fisher model of drift,
mutation and selection: the Beta distribution approach
DTMC
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Outline
5
• 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure
prediction
SCFG
• 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and
evolutionary information
SCFG
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics
SCFG
DFA DTMC HMM
CTMC1
• 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of
sufficient statistics for continuous time Markov chains
CTMC
• 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions
HMM &
DFA
• In preparation Motif discovery in ranked lists of sequences
DTMC &
DFA
• In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and
selection: the Beta distribution approach
DTMC
2
• 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in
unrelated individuals
CTMC
& HMM
3
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Outline
5
• 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure
prediction
SCFG
• 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and
evolutionary information
SCFG
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics
SCFG
DFA DTMC HMM
CTMC1
• 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of
sufficient statistics for continuous time Markov chains
CTMC
• 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions
HMM &
DFA
• In preparation Motif discovery in ranked lists of sequences
DTMC &
DFA
• In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and
selection: the Beta distribution approach
DTMC
2
• 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in
unrelated individuals
CTMC
& HMM
3
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› Evolution
of a population
forward in time
› Follow
the change
of the allele count
Populations genetics: the Wright-Fisher model
6
individuals
generations(time)
3
2
4
3
4
5
5
allele count
1. Modeling toolbox
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› Allele count
per generation
in a population
of size N
› States
{0, 1, …, N}
0 ≤ i, j ≤ N
› Transitions
binomial
Discrete Time Markov Chain
7
ji
Bin(i | j/N)
Bin(j | N, i/N)
Bin(j | N, j/N)Bin(i | N, i/N)
0.23
0.20
0.33
0.08
1
0.26
DTMC
3
2
4
3
4
5
5
allele count
1. Modeling toolbox
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
individuals
generations(time)
› Allele count
per generation
in a population
of size N
› Difficult
to measure
Hidden Markov Model
8
3
2
4
3
4
5
5
allele count
1. Modeling toolbox
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
individuals
generations(time)
› Allele count
per generation
in a population
of size N
› Difficult
to measure
› Directly affects
allele count
of a sample
of size n
Hidden Markov Model
8
3
2
4
3
4
5
5
allele count
1. Modeling toolbox
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
individuals
generations(time)
› Allele count
per generation
in a population
of size N
› Difficult
to measure
› Directly affects
allele count
of a sample
of size n
Hidden Markov Model
8
3
2
4
3
4
5
5
allele count
2
1
2
2
2
3
3
sample
1. Modeling toolbox
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
individuals
generations(time)
› Allele count
per generation
in a population
of size N
› Difficult
to measure
› Directly affects
allele count
of a sample
of size n
Hidden Markov Model
8
3
2
4
3
4
5
5
allele count
2
1
2
2
2
3
3
sample
Hidden Markov chain
1. Modeling toolbox
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
individuals
generations(time)
› Allele count
per generation
in a population
of size N
› Difficult
to measure
› Directly affects
allele count
of a sample
of size n
Hidden Markov Model
8
3
2
4
3
4
5
5
allele count
2
1
2
2
2
3
3
sample
Observable data
1. Modeling toolbox
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Hidden Markov Model
› Transitions (DTMC) binomial
› Emissions (data) binomial
9
i
Bin(i | N, i/N)
0: Bin(0 | n, i/N)
1: Bin(1 | n, i/N)
2: Bin(2 | n, i/N)
3: Bin(3 | n, i/N)
Bin(j |N, j/N)
0: Bin(0 | n, j/N)
1: Bin(1 | n, j/N)
2: Bin(2 | n, j/N)
3: Bin(3 | n, j/N)
jBin(i | N, j/N)
Bin(j | N, i/N)
1. Modeling toolbox
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Hidden Markov Model
› Standard algorithms
› Forward Likelihood of data
› Viterbi Global decoding
› Posterior decoding Local decoding
10
i
Bin(i | N, i/N)
0: Bin(0 | n, i/N)
1: Bin(1 | n, i/N)
2: Bin(2 | n, i/N)
3: Bin(3 | n, i/N)
Bin(j |N, j/N)
0: Bin(0 | n, j/N)
1: Bin(1 | n, j/N)
2: Bin(2 | n, j/N)
3: Bin(3 | n, j/N)
jBin(i | N, j/N)
Bin(j | N, i/N)
1. Modeling toolbox
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› Allele count
in time
in a population
of size N
› States
{0, 1, …, N}
0 < i < N
› Rates
ri = i (N-i) / N
Continuous Time Markov Chain
11
i
CTMC
i-1 i+1
ri
ri+1ri-1
- 2 ri
-2 ri-1 - 2 ri+1
ri
1. Modeling toolbox
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› Allele count
in time
in a population
of size N
› States
{0, 1, …, N}
0 < i < N
› Rates
ri = i (N-i) / N
Continuous Time Markov Chain
11
i
allele count in DTMC
3 2 4 3 4 5 5
0.23 0.20 0.330.08 10.26
CTMC
i-1 i+1
ri
ri+1ri-1
- 2 ri
-2 ri-1 - 2 ri+1
ri
1/2
3 3 3 4
t2 t6
2 4 5
t1 t4t2
1/2
1/2
1/2
1/2
1/2
allele count in CTMC
1. Modeling toolbox
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Outline
12
• 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure
prediction
SCFG
• 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and
evolutionary information
SCFG
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics
SCFG
DFA DTMC HMM
CTMC1
• 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of
sufficient statistics for continuous time Markov chains
CTMC
• 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions
HMM &
DFA
• In preparation Motif discovery in ranked lists of sequences
DTMC &
DFA
• In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and
selection: the Beta distribution approach
DTMC
2
• 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in
unrelated individuals
CTMC
& HMM
3
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Outline
12
• 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure
prediction
SCFG
• 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and
evolutionary information
SCFG
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics
SCFG
DFA DTMC HMM
CTMC1
• 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of
sufficient statistics for continuous time Markov chains
CTMC
• 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions
HMM &
DFA
• In preparation Motif discovery in ranked lists of sequences
DTMC &
DFA
• In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and
selection: the Beta distribution approach
DTMC
2
• 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in
unrelated individuals
CTMC
& HMM
3
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› DTMC describes sequences
› Allele count in a population
› DFA encodes pattern
› (i)+ (i+1)+ (i+2)+
13
i+2
0 1
2
k
i+1
i+2
3
i i
k, i+2
i+1
i+1
ii
i+2
k
i+1
M. M. Nielsen, P. Tataru, T. Madsen, A. Hobolth, and J. S. Pedersen. Motif discovery in ranked lists of sequences.
In preparation.
ji
Bin(i | j/N)
Bin(j | N, i/N)
Bin(j | N, j/N)Bin(i | N, i/N)
Motif discovery for DTMCs using DFAs
3 2 4 3 4 5 5
2. Brief overview
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Motif discovery for DTMCs using DFAs
14
M. M. Nielsen, P. Tataru, T. Madsen, A. Hobolth, and J. S. Pedersen. Motif discovery in ranked lists of sequences.
In preparation.
› Does the pattern
(i)+ (i+1)+ (i+2)+
occur more frequently
in specific environments?
Populations
DTMC sequences
… 2 4 3 4 5 5
generations (time)
Environment
2. Brief overview
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Motif discovery for DTMCs using DFAs
14
M. M. Nielsen, P. Tataru, T. Madsen, A. Hobolth, and J. S. Pedersen. Motif discovery in ranked lists of sequences.
In preparation.
› Does the pattern
(i)+ (i+1)+ (i+2)+
occur more frequently
in specific environments?
Populations
DTMC sequences
… 2 4 3 4 5 5
generations (time)
Environment › Contribution
› DFA
› New approach to significance
(random walk)
2. Brief overview
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› HMM describes problem
› Allele count in a population
› DFA encodes pattern
› (i)+ (i+1)+ (i+2)+
Restricted algorithms for HMMs using DFAs
15
P. Tataru, A. Sand, A. Hobolth, T. Mailund, and C. N. Pedersen. Algorithms for hidden Markov models restricted
to occurrences of regular expressions. Biology, 2(4):1282–1295, 2013
i
Bin(i | N, i/N)
0: Bin(0 | n, i/N)
1: Bin(1 | n, i/N)
2: Bin(2 | n, i/N)
3: Bin(3 | n, i/N)
Bin(j |N, j/N)
0: Bin(0 | n, j/N)
1: Bin(1 | n, j/N)
2: Bin(2 | n, j/N)
3: Bin(3 | n, j/N)
jBin(i | N, j/N)
Bin(j | N, i/N)
i+2
0 1
2
k
i+1
i+2
3
i i
k, i+2
i+1
i+1
ii
i+2
k
i+1
2 1 2 2 2 3 3
3 2 4 3 4 5 5
2. Brief overview
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Restricted algorithms for HMMs using DFAs
16
P. Tataru, A. Sand, A. Hobolth, T. Mailund, and C. N. Pedersen. Algorithms for hidden Markov models restricted
to occurrences of regular expressions. Biology, 2(4):1282–1295, 2013
i
Bin(i | N, i/N)
0: Bin(0 | n, i/N)
1: Bin(1 | n, i/N)
2: Bin(2 | n, i/N)
3: Bin(3 | n, i/N)
Bin(j |N, j/N)
0: Bin(0 | n, j/N)
1: Bin(1 | n, j/N)
2: Bin(2 | n, j/N)
3: Bin(3 | n, j/N)
jBin(i | N, j/N)
Bin(j | N, i/N)
i+2
0 1
2
k
i+1
i+2
3
i i
k, i+2
i+1
i+1
ii
i+2
k
i+1
› Contribution: new algorithms
› Calculate distribution of #pattern occurrences
› Adapt decoding algorithms to include #pattern occurrences
2. Brief overview
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Restricted algorithms for HMMs using DFAs
17
P. Tataru, A. Sand, A. Hobolth, T. Mailund, and C. N. Pedersen. Algorithms for hidden Markov models restricted
to occurrences of regular expressions. Biology, 2(4):1282–1295, 2013
2. Brief overview
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Conditional expectations for CTMCs
18
P. Tataru and A. Hobolth. Comparison of methods for calculating conditional expectations of sufficient
statistics for continuous time Markov chains. BMC Bioinformatics, 12(1):465, 2011
1/2
3 3 3 4
t2 t6
2 4 5
t1 t4t2
1/2
1/2
1/2
1/2
1/2
ii-1 i+1
ri
ri+1ri-1
- 2 ri
-2 ri-1 - 2 ri+1
ri
2. Brief overview
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› Required expectations
› Time
› Jumps
Conditional expectations for CTMCs
18
P. Tataru and A. Hobolth. Comparison of methods for calculating conditional expectations of sufficient
statistics for continuous time Markov chains. BMC Bioinformatics, 12(1):465, 2011
ii-1 i+1
ri
ri+1ri-1
- 2 ri
-2 ri-1 - 2 ri+1
ri
3 5
t
2. Brief overview
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› Required expectations
› Time
› Jumps
› Contribution:
compare and extend
existing methods
› Accuracy
› Speed
Conditional expectations for CTMCs
18
P. Tataru and A. Hobolth. Comparison of methods for calculating conditional expectations of sufficient
statistics for continuous time Markov chains. BMC Bioinformatics, 12(1):465, 2011
ii-1 i+1
ri
ri+1ri-1
- 2 ri
-2 ri-1 - 2 ri+1
ri
3 5
t
2. Brief overview
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Beta with spikes: approximate Wright-Fisher
19
P. Tataru, T. Bataillon, and A. Hobolth. Modeling allele frequency data under a Wright-Fisher model of
drift, mutation and selection: the Beta distribution approach. In preparation
individuals
generations(time)
3
2
4
3
4
5
5
allele count
2. Brief overview
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Beta with spikes: approximate Wright-Fisher
19
P. Tataru, T. Bataillon, and A. Hobolth. Modeling allele frequency data under a Wright-Fisher model of
drift, mutation and selection: the Beta distribution approach. In preparation
› What is the distribution
of allele count in
the current generation?
2. Brief overview
individuals
generations(time)
3
2
4
3
4
5
5
allele count
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Beta with spikes: approximate Wright-Fisher
19
P. Tataru, T. Bataillon, and A. Hobolth. Modeling allele frequency data under a Wright-Fisher model of
drift, mutation and selection: the Beta distribution approach. In preparation
individuals
generations(time)
3
2
4
3
4
5
5
allele count
› What is the distribution
of allele count in
the current generation?
› Use the beta distribution
› Contribution:
› Add spikes (better fit)
› Include selection
2. Brief overview
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Beta with spikes: approximate Wright-Fisher
20
P. Tataru, T. Bataillon, and A. Hobolth. Modeling allele frequency data under a Wright-Fisher model of
drift, mutation and selection: the Beta distribution approach. In preparation
2. Brief overview
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Outline
21
• 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure
prediction
SCFG
• 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and
evolutionary information
SCFG
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics
SCFG
DFA DTMC HMM
CTMC1
• 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of
sufficient statistics for continuous time Markov chains
CTMC
• 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions
HMM &
DFA
• In preparation Motif discovery in ranked lists of sequences
DTMC &
DFA
• In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and
selection: the Beta distribution approach
DTMC
2
• 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in
unrelated individuals
CTMC
& HMM
3
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Outline
21
• 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure
prediction
SCFG
• 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and
evolutionary information
SCFG
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics
SCFG
DFA DTMC HMM
CTMC1
• 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of
sufficient statistics for continuous time Markov chains
CTMC
• 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions
HMM &
DFA
• In preparation Motif discovery in ranked lists of sequences
DTMC &
DFA
• In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and
selection: the Beta distribution approach
DTMC
2
• 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in
unrelated individuals
CTMC
& HMM
3
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› Trace the genealogy
of sampled individuals
backward in time
22
individuals
generations(time)
Populations genetics: the coalescent model
3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› Trace the genealogy
of sampled individuals
backward in time
22
individuals
generations(time)
Populations genetics: the coalescent model
3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
› Trace the genealogy
of sampled individuals
backward in time
› Coalescent process
terminates when
reaching MRCA
› Time to coalescent
event: CTMC
22
individuals
generations(time)
Populations genetics: the coalescent model
MRCA
3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Recombination
23 3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The coalescent model: adding recombination
24 3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The coalescent model: adding recombination
24 3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The coalescent model: adding recombination
24 3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The coalescent model: adding recombination
24 3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The coalescent model: adding recombination
24 3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The coalescent model: adding recombination
24 3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The coalescent model: adding recombination
24
› Multiple sequences and loci analysis
› HMM: hidden states = (possible) coalescent trees for each locus
› CTMC: probability of the alleles at the leaves
3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Identity by descent
25
IBD tract
3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
diCal-IBD: CTMCs and HMMs
26
› IBD applications
› Association mapping
› Demography inference
› Detection of selection
P. Tataru, J. A. Nirody, and Y. S. Song. diCal-IBD: demography-aware inference of identity-by-descent
tracts in unrelated individuals. Bioinformatics, 30(23):3430-3431, 2014
3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
diCal-IBD: CTMCs and HMMs
26
› IBD applications
› Association mapping
› Demography inference
› Detection of selection
› First method to use
the coalescent with recombination
› One of the first methods to use
sequence data
P. Tataru, J. A. Nirody, and Y. S. Song. diCal-IBD: demography-aware inference of identity-by-descent
tracts in unrelated individuals. Bioinformatics, 30(23):3430-3431, 2014
3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
diCal-IBD: CTMCs and HMMs
26
› IBD applications
› Association mapping
› Demography inference
› Detection of selection
› First method to use
the coalescent with recombination
› One of the first methods to use
sequence data
› Outperforms SNP-based methods
› Comparable with sequence-based method
P. Tataru, J. A. Nirody, and Y. S. Song. diCal-IBD: demography-aware inference of identity-by-descent
tracts in unrelated individuals. Bioinformatics, 30(23):3430-3431, 2014
3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
diCal-IBD: CTMCs and HMMs
27
P. Tataru, J. A. Nirody, and Y. S. Song. diCal-IBD: demography-aware inference of identity-by-descent
tracts in unrelated individuals. Bioinformatics, 30(23):3430-3431, 2014
3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Selection and IBD
28
› IBD tract length depends on MRCA
› The more recent the MRCA, the longer the tract
› Recent MRCA can be an indication of positive selection
› IBD can be used for detecting positive selection1
› Standing variation
1Albrechtsen A et al. Genetics 2010;186:295-308
3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Selection and IBD
29
› IBD segment length depends on MRCA
› The more recent the MRCA, the longer the segment
› Recent MRCA can be an indication of recent selection
› IBD can be used for detecting selection
› Standing variation
1Albrechtsen A et al. Genetics 2010;186:295-308
3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
The SLC24A5 gene: diCal-IBD
30
P. Tataru, J. A. Nirody, and Y. S. Song. diCal-IBD: demography-aware inference of identity-by-descent
tracts in unrelated individuals. Bioinformatics, 30(23):3430-3431, 2014
SLC24A5
› Major influence on natural skin color variation
› Under positive selection in Europeans1
1Wilde S et al. PNAS 2014;111(13):4832-4837
3. Overview: diCal-IBD
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Outline
31
• 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure
prediction
SCFG
• 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and
evolutionary information
SCFG
Population
genetics
Sequence analysis
Pattern matching
Structural
bioinformatics
SCFG
DFA DTMC HMM
CTMC1
• 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of
sufficient statistics for continuous time Markov chains
CTMC
• 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions
HMM &
DFA
• In preparation Motif discovery in ranked lists of sequences
DTMC &
DFA
• In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and
selection: the Beta distribution approach
DTMC
2
• 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in
unrelated individuals
CTMC
& HMM
3
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Population genetics: Methods vs Data
32
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Population genetics: Methods vs Data
32
diCal
ARGWeaver
MSMC
ms
MaCS
PSMC
coalHMM
Kim Tree
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Population genetics: Methods vs Data
32
Beta with spikes
diCal-IBD
diCal
ARGWeaver
MSMC
ms
MaCS
PSMC
coalHMM
Kim Tree
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Population genetics: Methods vs Data
32
Beta with spikes
diCal-IBD
diCal
ARGWeaver
MSMC
ms
MaCS
PSMC
coalHMM
Kim Tree Data
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Population genetics: Methods vs Data
32
Beta with spikes
diCal-IBD
diCal
ARGWeaver
MSMC
ms
MaCS
PSMC
coalHMM
Kim Tree Data
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
Population genetics: Methods vs Data
32
Beta with spikes
diCal-IBD
diCal
ARGWeaver
MSMC
ms
MaCS
PSMC
coalHMM
Kim Tree Data
dimension
reduction
PhD defense
Paula Tataru
AARHUS
UNIVERSITY
Bioinformatics
Research Centre
33
Thank you for your attention!

More Related Content

PPTX
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
PDF
Impact_of_gene_length_on_DEG
PDF
Talk ABRF 2015 (Gunnar Rätsch)
PDF
The Clinical Significance of Transcript Alignment Discrepancies
PPT
Folker Meyer: Metagenomic Data Annotation
PPTX
Differential gene expression
PPTX
Transcript detection in RNAseq
PPTX
RNASeq DE methods review Applied Bioinformatics Journal Club
Mastering RNA-Seq (NGS Data Analysis) - A Critical Approach To Transcriptomic...
Impact_of_gene_length_on_DEG
Talk ABRF 2015 (Gunnar Rätsch)
The Clinical Significance of Transcript Alignment Discrepancies
Folker Meyer: Metagenomic Data Annotation
Differential gene expression
Transcript detection in RNAseq
RNASeq DE methods review Applied Bioinformatics Journal Club

What's hot (20)

PDF
Goodwin2016 ngs 10 years
PPT
Integrating phylogenetic inference and metadata visualization for NGS data
PPTX
Giab for jax long read 190917
PPTX
RNA-seq differential expression analysis
PDF
BM405 Lecture Slides 21/11/2014 University of Strathclyde
PDF
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
PDF
Rna seq
PPTX
Computational Resources In Infectious Disease
PDF
Jan2016 pac bio giab
PDF
presentation
PDF
Identifying the Coding and Non Coding Regions of DNA Using Spectral Analysis
PDF
Next Generation Sequencing Informatics - Challenges and Opportunities
PPTX
Single-molecule real-time (SMRT) Nanopore sequencing for Plant Pathology appl...
PPTX
A collaborative model for bioinformatics education: combining biologically i...
PDF
Bioinformatics.Practical Notebook
PPTX
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
PPTX
AI in Bioinformatics
PDF
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
PPTX
Closing the Gap in Time: From Raw Data to Real Science
PDF
2015_CV_J_SHELTON_linked
Goodwin2016 ngs 10 years
Integrating phylogenetic inference and metadata visualization for NGS data
Giab for jax long read 190917
RNA-seq differential expression analysis
BM405 Lecture Slides 21/11/2014 University of Strathclyde
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
Rna seq
Computational Resources In Infectious Disease
Jan2016 pac bio giab
presentation
Identifying the Coding and Non Coding Regions of DNA Using Spectral Analysis
Next Generation Sequencing Informatics - Challenges and Opportunities
Single-molecule real-time (SMRT) Nanopore sequencing for Plant Pathology appl...
A collaborative model for bioinformatics education: combining biologically i...
Bioinformatics.Practical Notebook
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
AI in Bioinformatics
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Closing the Gap in Time: From Raw Data to Real Science
2015_CV_J_SHELTON_linked
Ad

Viewers also liked (9)

PPTX
PDF
Efficient Probabilistic Logic Programming for Biological Sequence Analysis
PPTX
Applying Hidden Markov Models to Bioinformatics
PPT
Introduction to HMMs in Bioinformatics
PDF
Gene Prediction Using Hidden Markov Model and Recurrent Neural Network
PPTX
PPTX
prediction methods for ORF
PPTX
Hidden markov model
PPT
Hidden markov model ppt
Efficient Probabilistic Logic Programming for Biological Sequence Analysis
Applying Hidden Markov Models to Bioinformatics
Introduction to HMMs in Bioinformatics
Gene Prediction Using Hidden Markov Model and Recurrent Neural Network
prediction methods for ORF
Hidden markov model
Hidden markov model ppt
Ad

Similar to PaulaTataru_PhD_defense (20)

PPT
bioinfomatics
PPTX
Data Management for Quantitative Biology - Data sources (Next generation tech...
PDF
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...
ODP
OVium Bioinformatic Solutions
PPTX
Introduction to bioinformatics and databases .pptx
PDF
Chip Technology Springerverlag Jorg Hoheisel
PDF
Next-generation sequencing and quality control: An Introduction (2016)
PPTX
WikiPathways: how open source and open data can make omics technology more us...
PDF
Forensics: Human Identity Testing in the Applied Genetics Group
PDF
Introduction to Bioinformatics for Molecular Studies
PDF
tranSMART Community Meeting 5-7 Nov 13 - Session 3: Characterization of the c...
PPTX
Qi liu 08.08.2014
PDF
SFSCON24 - Attaullah Buriro - ClapMetrics: Decoding Users Genderand Age Throu...
PDF
Forensics: Human Identity Testing in the Applied Genetics Group
PPTX
Molecular Subtyping of Breast Cancer and Somatic Mutation Discovery Using DNA...
PPT
American Society for Mass Spectrometry Conference 2013
PDF
Visual Exploration of Clinical and Genomic Data for Patient Stratification
PDF
Microbial Genomics and Bioinformatics: BM405 (2015)
PDF
Bioinformatics seminar
bioinfomatics
Data Management for Quantitative Biology - Data sources (Next generation tech...
Massively Parallel Sequencing - integrating the Ion PGM™ sequencer into your ...
OVium Bioinformatic Solutions
Introduction to bioinformatics and databases .pptx
Chip Technology Springerverlag Jorg Hoheisel
Next-generation sequencing and quality control: An Introduction (2016)
WikiPathways: how open source and open data can make omics technology more us...
Forensics: Human Identity Testing in the Applied Genetics Group
Introduction to Bioinformatics for Molecular Studies
tranSMART Community Meeting 5-7 Nov 13 - Session 3: Characterization of the c...
Qi liu 08.08.2014
SFSCON24 - Attaullah Buriro - ClapMetrics: Decoding Users Genderand Age Throu...
Forensics: Human Identity Testing in the Applied Genetics Group
Molecular Subtyping of Breast Cancer and Somatic Mutation Discovery Using DNA...
American Society for Mass Spectrometry Conference 2013
Visual Exploration of Clinical and Genomic Data for Patient Stratification
Microbial Genomics and Bioinformatics: BM405 (2015)
Bioinformatics seminar

More from Paula Tataru (20)

PDF
write_thesis
PDF
Thiele
PDF
PhDretreat2014
PDF
PhDretreat2011
PDF
part A
PDF
birc-csd2012
PPTX
TreeOfLife-jeopardy-2014
PDF
AB-RNA-Mfold&SCFGs-2011
PDF
AB-RNA-comparison-2011
PDF
AB-RNA-alignments-2011
PDF
AB-RNA-Nussinov-2011
PDF
AB-RNA-SCFGdesign=2010
PDF
AB-RNA-SCFG-2010
PDF
AB-RNA-alignments-2010
PDF
AB-RNA-Nus-2010
PDF
PaulaTataruVienna
PDF
PaulaTataruCSHL
PDF
PaulaTataruAarhus
PDF
mgsa_poster
PDF
PaulaTataruOxford
write_thesis
Thiele
PhDretreat2014
PhDretreat2011
part A
birc-csd2012
TreeOfLife-jeopardy-2014
AB-RNA-Mfold&SCFGs-2011
AB-RNA-comparison-2011
AB-RNA-alignments-2011
AB-RNA-Nussinov-2011
AB-RNA-SCFGdesign=2010
AB-RNA-SCFG-2010
AB-RNA-alignments-2010
AB-RNA-Nus-2010
PaulaTataruVienna
PaulaTataruCSHL
PaulaTataruAarhus
mgsa_poster
PaulaTataruOxford

PaulaTataru_PhD_defense

  • 1. T a t a r uP a u l a Deciphering the story in our DNA PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Aarhus, January23rd 2015 Inference of population history and patterns from molecular data Supervisors: Christian N.S. Pedersen & Asger Hobolth
  • 2. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Math Computer science My PhD studies 2 Bioinformatics
  • 3. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Mathematical modeling Implementation My PhD studies 2 Bioinformatics
  • 4. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Mathematical modeling Implementation My PhD studies 2 Bioinformatics Population genetics Sequence analysis Pattern matching Structural bioinformatics
  • 5. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Mathematical modeling toolbox 3 SCFG DFA DTMC HMM CTMC
  • 6. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Mathematical modeling toolbox 3 Stochastic Context Free Grammar Deterministic Finite Automaton Discrete Time Markov Chain Hidden Markov Model Continuous Time Markov Chain
  • 7. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Research overview 4 • 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure prediction SCFG • 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and evolutionary information SCFG • 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of sufficient statistics for continuous time Markov chains CTMC • 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions HMM & DFA • 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals CTMC & HMM • In preparation Motif discovery in ranked lists of sequences DTMC & DFA • In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and selection: the Beta distribution approach DTMC Population genetics Sequence analysis Pattern matching Structural bioinformatics
  • 8. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Research overview 4 • 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure prediction SCFG • 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and evolutionary information SCFG • 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of sufficient statistics for continuous time Markov chains CTMC • 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions HMM & DFA • 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals CTMC & HMM • In preparation Motif discovery in ranked lists of sequences DTMC & DFA • In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and selection: the Beta distribution approach DTMC Population genetics Sequence analysis Pattern matching Structural bioinformatics
  • 9. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Research overview 4 • 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure prediction SCFG • 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and evolutionary information SCFG • 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of sufficient statistics for continuous time Markov chains CTMC • 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions HMM & DFA • 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals CTMC & HMM • In preparation Motif discovery in ranked lists of sequences DTMC & DFA • In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and selection: the Beta distribution approach DTMC Population genetics Sequence analysis Pattern matching Structural bioinformatics
  • 10. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Outline 5 • 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure prediction SCFG • 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and evolutionary information SCFG Population genetics Sequence analysis Pattern matching Structural bioinformatics SCFG DFA DTMC HMM CTMC1 • 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of sufficient statistics for continuous time Markov chains CTMC • 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions HMM & DFA • In preparation Motif discovery in ranked lists of sequences DTMC & DFA • In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and selection: the Beta distribution approach DTMC 2 • 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals CTMC & HMM 3
  • 11. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Outline 5 • 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure prediction SCFG • 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and evolutionary information SCFG Population genetics Sequence analysis Pattern matching Structural bioinformatics SCFG DFA DTMC HMM CTMC1 • 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of sufficient statistics for continuous time Markov chains CTMC • 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions HMM & DFA • In preparation Motif discovery in ranked lists of sequences DTMC & DFA • In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and selection: the Beta distribution approach DTMC 2 • 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals CTMC & HMM 3
  • 12. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre › Evolution of a population forward in time › Follow the change of the allele count Populations genetics: the Wright-Fisher model 6 individuals generations(time) 3 2 4 3 4 5 5 allele count 1. Modeling toolbox
  • 13. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre › Allele count per generation in a population of size N › States {0, 1, …, N} 0 ≤ i, j ≤ N › Transitions binomial Discrete Time Markov Chain 7 ji Bin(i | j/N) Bin(j | N, i/N) Bin(j | N, j/N)Bin(i | N, i/N) 0.23 0.20 0.33 0.08 1 0.26 DTMC 3 2 4 3 4 5 5 allele count 1. Modeling toolbox
  • 14. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre individuals generations(time) › Allele count per generation in a population of size N › Difficult to measure Hidden Markov Model 8 3 2 4 3 4 5 5 allele count 1. Modeling toolbox
  • 15. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre individuals generations(time) › Allele count per generation in a population of size N › Difficult to measure › Directly affects allele count of a sample of size n Hidden Markov Model 8 3 2 4 3 4 5 5 allele count 1. Modeling toolbox
  • 16. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre individuals generations(time) › Allele count per generation in a population of size N › Difficult to measure › Directly affects allele count of a sample of size n Hidden Markov Model 8 3 2 4 3 4 5 5 allele count 2 1 2 2 2 3 3 sample 1. Modeling toolbox
  • 17. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre individuals generations(time) › Allele count per generation in a population of size N › Difficult to measure › Directly affects allele count of a sample of size n Hidden Markov Model 8 3 2 4 3 4 5 5 allele count 2 1 2 2 2 3 3 sample Hidden Markov chain 1. Modeling toolbox
  • 18. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre individuals generations(time) › Allele count per generation in a population of size N › Difficult to measure › Directly affects allele count of a sample of size n Hidden Markov Model 8 3 2 4 3 4 5 5 allele count 2 1 2 2 2 3 3 sample Observable data 1. Modeling toolbox
  • 19. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Hidden Markov Model › Transitions (DTMC) binomial › Emissions (data) binomial 9 i Bin(i | N, i/N) 0: Bin(0 | n, i/N) 1: Bin(1 | n, i/N) 2: Bin(2 | n, i/N) 3: Bin(3 | n, i/N) Bin(j |N, j/N) 0: Bin(0 | n, j/N) 1: Bin(1 | n, j/N) 2: Bin(2 | n, j/N) 3: Bin(3 | n, j/N) jBin(i | N, j/N) Bin(j | N, i/N) 1. Modeling toolbox
  • 20. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Hidden Markov Model › Standard algorithms › Forward Likelihood of data › Viterbi Global decoding › Posterior decoding Local decoding 10 i Bin(i | N, i/N) 0: Bin(0 | n, i/N) 1: Bin(1 | n, i/N) 2: Bin(2 | n, i/N) 3: Bin(3 | n, i/N) Bin(j |N, j/N) 0: Bin(0 | n, j/N) 1: Bin(1 | n, j/N) 2: Bin(2 | n, j/N) 3: Bin(3 | n, j/N) jBin(i | N, j/N) Bin(j | N, i/N) 1. Modeling toolbox
  • 21. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre › Allele count in time in a population of size N › States {0, 1, …, N} 0 < i < N › Rates ri = i (N-i) / N Continuous Time Markov Chain 11 i CTMC i-1 i+1 ri ri+1ri-1 - 2 ri -2 ri-1 - 2 ri+1 ri 1. Modeling toolbox
  • 22. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre › Allele count in time in a population of size N › States {0, 1, …, N} 0 < i < N › Rates ri = i (N-i) / N Continuous Time Markov Chain 11 i allele count in DTMC 3 2 4 3 4 5 5 0.23 0.20 0.330.08 10.26 CTMC i-1 i+1 ri ri+1ri-1 - 2 ri -2 ri-1 - 2 ri+1 ri 1/2 3 3 3 4 t2 t6 2 4 5 t1 t4t2 1/2 1/2 1/2 1/2 1/2 allele count in CTMC 1. Modeling toolbox
  • 23. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Outline 12 • 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure prediction SCFG • 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and evolutionary information SCFG Population genetics Sequence analysis Pattern matching Structural bioinformatics SCFG DFA DTMC HMM CTMC1 • 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of sufficient statistics for continuous time Markov chains CTMC • 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions HMM & DFA • In preparation Motif discovery in ranked lists of sequences DTMC & DFA • In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and selection: the Beta distribution approach DTMC 2 • 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals CTMC & HMM 3
  • 24. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Outline 12 • 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure prediction SCFG • 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and evolutionary information SCFG Population genetics Sequence analysis Pattern matching Structural bioinformatics SCFG DFA DTMC HMM CTMC1 • 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of sufficient statistics for continuous time Markov chains CTMC • 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions HMM & DFA • In preparation Motif discovery in ranked lists of sequences DTMC & DFA • In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and selection: the Beta distribution approach DTMC 2 • 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals CTMC & HMM 3
  • 25. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre › DTMC describes sequences › Allele count in a population › DFA encodes pattern › (i)+ (i+1)+ (i+2)+ 13 i+2 0 1 2 k i+1 i+2 3 i i k, i+2 i+1 i+1 ii i+2 k i+1 M. M. Nielsen, P. Tataru, T. Madsen, A. Hobolth, and J. S. Pedersen. Motif discovery in ranked lists of sequences. In preparation. ji Bin(i | j/N) Bin(j | N, i/N) Bin(j | N, j/N)Bin(i | N, i/N) Motif discovery for DTMCs using DFAs 3 2 4 3 4 5 5 2. Brief overview
  • 26. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Motif discovery for DTMCs using DFAs 14 M. M. Nielsen, P. Tataru, T. Madsen, A. Hobolth, and J. S. Pedersen. Motif discovery in ranked lists of sequences. In preparation. › Does the pattern (i)+ (i+1)+ (i+2)+ occur more frequently in specific environments? Populations DTMC sequences … 2 4 3 4 5 5 generations (time) Environment 2. Brief overview
  • 27. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Motif discovery for DTMCs using DFAs 14 M. M. Nielsen, P. Tataru, T. Madsen, A. Hobolth, and J. S. Pedersen. Motif discovery in ranked lists of sequences. In preparation. › Does the pattern (i)+ (i+1)+ (i+2)+ occur more frequently in specific environments? Populations DTMC sequences … 2 4 3 4 5 5 generations (time) Environment › Contribution › DFA › New approach to significance (random walk) 2. Brief overview
  • 28. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre › HMM describes problem › Allele count in a population › DFA encodes pattern › (i)+ (i+1)+ (i+2)+ Restricted algorithms for HMMs using DFAs 15 P. Tataru, A. Sand, A. Hobolth, T. Mailund, and C. N. Pedersen. Algorithms for hidden Markov models restricted to occurrences of regular expressions. Biology, 2(4):1282–1295, 2013 i Bin(i | N, i/N) 0: Bin(0 | n, i/N) 1: Bin(1 | n, i/N) 2: Bin(2 | n, i/N) 3: Bin(3 | n, i/N) Bin(j |N, j/N) 0: Bin(0 | n, j/N) 1: Bin(1 | n, j/N) 2: Bin(2 | n, j/N) 3: Bin(3 | n, j/N) jBin(i | N, j/N) Bin(j | N, i/N) i+2 0 1 2 k i+1 i+2 3 i i k, i+2 i+1 i+1 ii i+2 k i+1 2 1 2 2 2 3 3 3 2 4 3 4 5 5 2. Brief overview
  • 29. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Restricted algorithms for HMMs using DFAs 16 P. Tataru, A. Sand, A. Hobolth, T. Mailund, and C. N. Pedersen. Algorithms for hidden Markov models restricted to occurrences of regular expressions. Biology, 2(4):1282–1295, 2013 i Bin(i | N, i/N) 0: Bin(0 | n, i/N) 1: Bin(1 | n, i/N) 2: Bin(2 | n, i/N) 3: Bin(3 | n, i/N) Bin(j |N, j/N) 0: Bin(0 | n, j/N) 1: Bin(1 | n, j/N) 2: Bin(2 | n, j/N) 3: Bin(3 | n, j/N) jBin(i | N, j/N) Bin(j | N, i/N) i+2 0 1 2 k i+1 i+2 3 i i k, i+2 i+1 i+1 ii i+2 k i+1 › Contribution: new algorithms › Calculate distribution of #pattern occurrences › Adapt decoding algorithms to include #pattern occurrences 2. Brief overview
  • 30. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Restricted algorithms for HMMs using DFAs 17 P. Tataru, A. Sand, A. Hobolth, T. Mailund, and C. N. Pedersen. Algorithms for hidden Markov models restricted to occurrences of regular expressions. Biology, 2(4):1282–1295, 2013 2. Brief overview
  • 31. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Conditional expectations for CTMCs 18 P. Tataru and A. Hobolth. Comparison of methods for calculating conditional expectations of sufficient statistics for continuous time Markov chains. BMC Bioinformatics, 12(1):465, 2011 1/2 3 3 3 4 t2 t6 2 4 5 t1 t4t2 1/2 1/2 1/2 1/2 1/2 ii-1 i+1 ri ri+1ri-1 - 2 ri -2 ri-1 - 2 ri+1 ri 2. Brief overview
  • 32. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre › Required expectations › Time › Jumps Conditional expectations for CTMCs 18 P. Tataru and A. Hobolth. Comparison of methods for calculating conditional expectations of sufficient statistics for continuous time Markov chains. BMC Bioinformatics, 12(1):465, 2011 ii-1 i+1 ri ri+1ri-1 - 2 ri -2 ri-1 - 2 ri+1 ri 3 5 t 2. Brief overview
  • 33. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre › Required expectations › Time › Jumps › Contribution: compare and extend existing methods › Accuracy › Speed Conditional expectations for CTMCs 18 P. Tataru and A. Hobolth. Comparison of methods for calculating conditional expectations of sufficient statistics for continuous time Markov chains. BMC Bioinformatics, 12(1):465, 2011 ii-1 i+1 ri ri+1ri-1 - 2 ri -2 ri-1 - 2 ri+1 ri 3 5 t 2. Brief overview
  • 34. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Beta with spikes: approximate Wright-Fisher 19 P. Tataru, T. Bataillon, and A. Hobolth. Modeling allele frequency data under a Wright-Fisher model of drift, mutation and selection: the Beta distribution approach. In preparation individuals generations(time) 3 2 4 3 4 5 5 allele count 2. Brief overview
  • 35. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Beta with spikes: approximate Wright-Fisher 19 P. Tataru, T. Bataillon, and A. Hobolth. Modeling allele frequency data under a Wright-Fisher model of drift, mutation and selection: the Beta distribution approach. In preparation › What is the distribution of allele count in the current generation? 2. Brief overview individuals generations(time) 3 2 4 3 4 5 5 allele count
  • 36. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Beta with spikes: approximate Wright-Fisher 19 P. Tataru, T. Bataillon, and A. Hobolth. Modeling allele frequency data under a Wright-Fisher model of drift, mutation and selection: the Beta distribution approach. In preparation individuals generations(time) 3 2 4 3 4 5 5 allele count › What is the distribution of allele count in the current generation? › Use the beta distribution › Contribution: › Add spikes (better fit) › Include selection 2. Brief overview
  • 37. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Beta with spikes: approximate Wright-Fisher 20 P. Tataru, T. Bataillon, and A. Hobolth. Modeling allele frequency data under a Wright-Fisher model of drift, mutation and selection: the Beta distribution approach. In preparation 2. Brief overview
  • 38. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Outline 21 • 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure prediction SCFG • 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and evolutionary information SCFG Population genetics Sequence analysis Pattern matching Structural bioinformatics SCFG DFA DTMC HMM CTMC1 • 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of sufficient statistics for continuous time Markov chains CTMC • 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions HMM & DFA • In preparation Motif discovery in ranked lists of sequences DTMC & DFA • In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and selection: the Beta distribution approach DTMC 2 • 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals CTMC & HMM 3
  • 39. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Outline 21 • 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure prediction SCFG • 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and evolutionary information SCFG Population genetics Sequence analysis Pattern matching Structural bioinformatics SCFG DFA DTMC HMM CTMC1 • 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of sufficient statistics for continuous time Markov chains CTMC • 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions HMM & DFA • In preparation Motif discovery in ranked lists of sequences DTMC & DFA • In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and selection: the Beta distribution approach DTMC 2 • 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals CTMC & HMM 3
  • 40. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre › Trace the genealogy of sampled individuals backward in time 22 individuals generations(time) Populations genetics: the coalescent model 3. Overview: diCal-IBD
  • 41. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre › Trace the genealogy of sampled individuals backward in time 22 individuals generations(time) Populations genetics: the coalescent model 3. Overview: diCal-IBD
  • 42. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre › Trace the genealogy of sampled individuals backward in time › Coalescent process terminates when reaching MRCA › Time to coalescent event: CTMC 22 individuals generations(time) Populations genetics: the coalescent model MRCA 3. Overview: diCal-IBD
  • 43. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Recombination 23 3. Overview: diCal-IBD
  • 44. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre The coalescent model: adding recombination 24 3. Overview: diCal-IBD
  • 45. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre The coalescent model: adding recombination 24 3. Overview: diCal-IBD
  • 46. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre The coalescent model: adding recombination 24 3. Overview: diCal-IBD
  • 47. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre The coalescent model: adding recombination 24 3. Overview: diCal-IBD
  • 48. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre The coalescent model: adding recombination 24 3. Overview: diCal-IBD
  • 49. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre The coalescent model: adding recombination 24 3. Overview: diCal-IBD
  • 50. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre The coalescent model: adding recombination 24 › Multiple sequences and loci analysis › HMM: hidden states = (possible) coalescent trees for each locus › CTMC: probability of the alleles at the leaves 3. Overview: diCal-IBD
  • 51. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Identity by descent 25 IBD tract 3. Overview: diCal-IBD
  • 52. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre diCal-IBD: CTMCs and HMMs 26 › IBD applications › Association mapping › Demography inference › Detection of selection P. Tataru, J. A. Nirody, and Y. S. Song. diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals. Bioinformatics, 30(23):3430-3431, 2014 3. Overview: diCal-IBD
  • 53. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre diCal-IBD: CTMCs and HMMs 26 › IBD applications › Association mapping › Demography inference › Detection of selection › First method to use the coalescent with recombination › One of the first methods to use sequence data P. Tataru, J. A. Nirody, and Y. S. Song. diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals. Bioinformatics, 30(23):3430-3431, 2014 3. Overview: diCal-IBD
  • 54. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre diCal-IBD: CTMCs and HMMs 26 › IBD applications › Association mapping › Demography inference › Detection of selection › First method to use the coalescent with recombination › One of the first methods to use sequence data › Outperforms SNP-based methods › Comparable with sequence-based method P. Tataru, J. A. Nirody, and Y. S. Song. diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals. Bioinformatics, 30(23):3430-3431, 2014 3. Overview: diCal-IBD
  • 55. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre diCal-IBD: CTMCs and HMMs 27 P. Tataru, J. A. Nirody, and Y. S. Song. diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals. Bioinformatics, 30(23):3430-3431, 2014 3. Overview: diCal-IBD
  • 56. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Selection and IBD 28 › IBD tract length depends on MRCA › The more recent the MRCA, the longer the tract › Recent MRCA can be an indication of positive selection › IBD can be used for detecting positive selection1 › Standing variation 1Albrechtsen A et al. Genetics 2010;186:295-308 3. Overview: diCal-IBD
  • 57. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Selection and IBD 29 › IBD segment length depends on MRCA › The more recent the MRCA, the longer the segment › Recent MRCA can be an indication of recent selection › IBD can be used for detecting selection › Standing variation 1Albrechtsen A et al. Genetics 2010;186:295-308 3. Overview: diCal-IBD
  • 58. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre The SLC24A5 gene: diCal-IBD 30 P. Tataru, J. A. Nirody, and Y. S. Song. diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals. Bioinformatics, 30(23):3430-3431, 2014 SLC24A5 › Major influence on natural skin color variation › Under positive selection in Europeans1 1Wilde S et al. PNAS 2014;111(13):4832-4837 3. Overview: diCal-IBD
  • 59. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Outline 31 • 2012 BMC Bioinformatics Evolving stochastic context-free grammars for RNA secondary structure prediction SCFG • 2013 Bioinformatics Oxfold: kinetic folding of RNA using stochastic context-free grammars and evolutionary information SCFG Population genetics Sequence analysis Pattern matching Structural bioinformatics SCFG DFA DTMC HMM CTMC1 • 2011 BMC Bioinformatics Comparison of methods for calculating conditional expectations of sufficient statistics for continuous time Markov chains CTMC • 2013 Biology Algorithms for hidden Markov models restricted to occurrences of regular expressions HMM & DFA • In preparation Motif discovery in ranked lists of sequences DTMC & DFA • In preparation Modeling allele frequency data under a Wright-Fisher model of drift, mutation and selection: the Beta distribution approach DTMC 2 • 2014 Bioinformatics diCal-IBD: demography-aware inference of identity-by-descent tracts in unrelated individuals CTMC & HMM 3
  • 60. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Population genetics: Methods vs Data 32
  • 61. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Population genetics: Methods vs Data 32 diCal ARGWeaver MSMC ms MaCS PSMC coalHMM Kim Tree
  • 62. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Population genetics: Methods vs Data 32 Beta with spikes diCal-IBD diCal ARGWeaver MSMC ms MaCS PSMC coalHMM Kim Tree
  • 63. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Population genetics: Methods vs Data 32 Beta with spikes diCal-IBD diCal ARGWeaver MSMC ms MaCS PSMC coalHMM Kim Tree Data
  • 64. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Population genetics: Methods vs Data 32 Beta with spikes diCal-IBD diCal ARGWeaver MSMC ms MaCS PSMC coalHMM Kim Tree Data
  • 65. PhD defense Paula Tataru AARHUS UNIVERSITY Bioinformatics Research Centre Population genetics: Methods vs Data 32 Beta with spikes diCal-IBD diCal ARGWeaver MSMC ms MaCS PSMC coalHMM Kim Tree Data dimension reduction