On the identifiability of phylogenetic networks under a pseudolikelihood model

Arrigo Coen
University of Wisconsin-Madison
Madison, WI 53705
coencoria@wisc.edu
Cécile Ané
Madison, WI 53705
cecile.ane@wisc.edu
Claudia Solís-Lemus
Madison, WI 53705
solislemus@wisc.edu
On the identifiability of phylogenetic networks under
a pseudolikelihood model
Meeting of Systematics, Biogeography and Evolution 2020
(SBE)

Phylogenetic analysis is the means of estimating the evolutionary relationships of
species. This type of analysis allow us to estimate ancestral traits of past extinct
populations and identify characteristics in fast-adapting organisms. For instance,
Forster et al. (2020) study the COVID phylogeny to trace its routes of infections. In
general, the development of statistical and machine-learning theory phylogenesis, are
paramount in evolutionary biology.
A phylogenetic network is a useful tool to model evolution. This structure is a
generalization of a phylogenetic tree. In mathematical terms, a phylogenetic tree is a
fully bifurcating graph in which internal nodes represent ancestral species that over
time differentiate into two separate species, as depicted in Figure 1. Recently, scientists
have challenged the notion that evolution can be represented with a fully bifurcating
process, as this process cannot capture important biological realities like hybridization,
introgression or horizontal gene transfer, that require two fully separated branches to
join back. A phylogenetic network is then the generalization that considers these
phenomena. Thus, recent years have seen the development of methods to reconstruct
phylogenetic networks (Degnan, 2018).
Why to care about phylogenetic network estimation?
The complexity of the networks’ space makes it hard to find the “best” network. With
each extra species data, the problem of finding this network is exponentially more
computational demanding. Moreover, many algorithms lack identifiability support;
meaning that is hard to know if they output network is the proper one.
In this work we will present a possible solution to this identifiability problem. We will
present the method of Solís-Lemus and Ané (2016) and show under which conditions
this fast algorithm identifies correctly the underlaying network topology.
Figure 1. Phylogenetic network of the bear species from Kumar et al. (2017) Here,
the hybrid node (green) make a connection the ancestral population to the polar
bear/brown bear/American black bear clade and the Asiatic black bear.
One standard input to estimate a phylogenetic network is a set of phylogenetic gene
trees. These trees represent the evolution of each gene under study. By imposing
some criteria, we could use the gene tree topologies to find a phylogenetic network
that explains their variability. An example of this is presented in the next figure.
Figure 2. Example of a theoretical phylogenetic network and its input gene
trees. The blue and red trees are displayed within branches of the gray
network for the species A-E. This network gathers both topologies of its
input trees by having a hybrid branch.

The direct calculation of the likelihood of a network is very computational expensive. A
reason for this is the massive number of possible topologies. For instance, the number
of unrooted 𝑛–taxa binary trees is 2𝑛 − 5 !/(2!"# 𝑛 − 3 !) and each has $!"$
$% possible
selections of 𝑘–hybrid nodes (see Hudson, 2010). For each of the possible topologies it
is also expensive to calculate its likelihood based on branch length values.
Consequently, scalable techniques are required.
An alternative to the likelihood of a phylogenetic network is presented in Solís-Lemus
and Ané (2016), This pseudolikelihood methodology uses the quartet subnetworks to
find the optimum topology; a quartet is a set of four taxa. Let 𝐺 the set of all possible
quartets of a set of taxa, then we define the pseudolikelihood for a fixed topology 𝑁 as
-𝐿 𝑁 = 0
&∈(
𝐿(𝑔) ,
where 𝐿(𝑔) is the likelihood of the quartet 𝑔. Since the quartets are not independent, the
product of likelihoods is not a true likelihood for the network.
To find the quartet likelihood Solís-Lemus and Ané use the concordance factors (CF) of
the quartets. A CF is the percentage of occurrences of this quartet on the input gene
trees (Baum, 2007). As an example, let us assume that we have the gene trees:
Material and Methods: Pseudolikelihood Estimation Using CFs
To calculate the likelihood 𝐿(𝑔) of a given quartet 𝑔 = {𝐴, 𝐵, 𝐶, 𝐷}, let 𝑌 = 𝑌)*, 𝑌)$, 𝑌)#
denote the number of gene trees that match each of the three possible quartet
resolutions. Then 𝑌 follows a multinomial distribution with probabilities
𝐶𝐹)*, 𝐶𝐹)$, 𝐶𝐹)# , given by the theoretical CFs (see Yu et al., 2012) This allow us to get,
𝐿 𝑔 ∝ 𝐶𝐹)*
+!"
𝐶𝐹)$
+!#
𝐶𝐹)#
+!$
.
Thus, given a set of input gene trees we can use their CF to calculate their
pseudolikelihood.
A nice property of this methodology is the robustness of its pseudolikelihood. Since
the input gene trees are unrooted and without branch lengths, we do not have rooting
errors and molecular clock errors. In particular, to account for gene tree estimation
error, we could estimate CF using BUCKy (Ané, 2007).
The logical question that now arises is if there are multiple phylogenetic networks
with the same set of concordance factors. Here, we characterize which types of
phylogenetic networks can be identify from genomic data using Solís-Lemus and Ané
methodology.
Figure 3. Example of input gene trees. The different colors evidence the
quartets options for each 4-taxon subset with {A,B,C,D} (see Figure 4).
Since for each four taxa there are only tree possible unrooted trees (quartets), we have
that:
Figure 4. The three concordance factors for the quartet {A,B,C,D}, with
respect to the input gene trees of Figure 3.

Here, we present our main results over identifiability of pseudolikelihood model. These
results correspond to topology identifiability, and to numerical parameters
identifiability. The proofs of these results could be consulted in Coen et al. (submitted).
To demonstrate identifiability, we assume that the phylogenetic networks are of level-
1. A level-1 network does not have intersecting hybridization cycles. A hybridization
cycle is the minimum cycle, in the graph theory sense, that contains a hybrid node. The
next figure illustrates all these concepts.
Theorem 1:
Let 𝑁 be a level-1 network, then:
• A hybridization with a 2-node cycle is not detectable.
• A hybridization with a k-node cycle, for k>2, is detectable if all its
subnetworks have at least two taxa.
This theorem establishes that in most of the cases, except when the hybridization
is trivial, the concordance factors identify the hybridization correctly. A nice
characteristic of this theorem is that it could be applied to each hybridization of a
network to know which ones are detectable and which ones are not detectable.
This last definition and theorem focus on the topological identifiability of a network;
the next definition and theorem focus on the parameter identifiability.
Results: Identifiability of Pseudolikelihood
Figure 5. Example of level-1 (left) and non level-1 (right) networks. The
difference between them is that in a level-1 network the cycles do not
have nodes/edges in common.
The next definition distinguish cases when the system of CFs are unique in the sense
that another topology could not share them.
Detectable hybridizations:
Let 𝑁 be a level-1 network and we denote by 𝑁′ a generic copy of 𝑁 with the ℎ
hybridization erased and we allow 𝑁 and 𝑁′ to have different branch lengths on the
shared edges. Moreover, we say that the h hybridization is detectable if the system of
CF of 𝑁 does not match the system of CF of 𝑁′.
Figure 6. Example of a network 𝑁 (left) and its copy 𝑁′ (right), which does
not contain the hybridization (denoted by green). The small blue
triangles represent subnetworks and in this case the cycle has eight
nodes (8-node cycle).

All Macaulay2 and Mathematica scripts are available in the GitHub repository
https://github.com/solislemuslab/snaq-identifiability
Reproducibility
Parameter identifiability implies that the equations of CFs have a unique solution with
respect to their parameter values; that is, there are not multiple parameter values that
can produce the same set of CFs. The next definition distinguish cases when the set of
CF equations have finitely many solutions, see Pimentel-Alarcón et al.(2016).
Finitely identifiable:
We say that a hybridization is finitely identifiable if the set of CF equations defined by
its cycle has finitely many parameter value solutions.
Theorem 2:
Let 𝑁 be a level-1 network and h be a hybridization such that all its subnetworks have
at least two taxa, then:
• If h has a 3-node cycle, then it is not finitely identifiable.
• If h has a k-node cycle, with k>3, then it is finitely identifiable.
Theorem 2 identifies the cases when the parameters of a hybridization are finitely
defined under the system of equations of the CFs. This means that for a set of values
of CFs there are finitely many solutions for the parameters. This characterization allows
us to identify the parameter values in a computational inexpensive way.
Discussion
References
We present that hybridization cycles of different sizes vary in their detectability potential.
Cycles of 4 or more nodes are easily detected from concordance factors under a
pseudolikelihood model, while cycles of 2 nodes are totally undetectable. 3-cycle
hybridizations can be detected under certain sampling schemes; in particular, we found
that gene flow between sister species – common in real-life biological data – cannot be
detected at all. We also show that we can estimate numerical parameters on the network
(branch lengths and inheritance probabilities) for hybridization cycles of 4 or more nodes.
These results provide theoretical guarantees to the pseudolikelihood estimation of larger
hybridization cycles, while bringing up attention to the need for novel models and
methods to estimate gene flow between closely related species (small cycles).
Hence, the methodology of Solís-Lemus and Ané (2016) on quartets
pseudolikelihood estimation as a theoretically sound method, which is also highly
scalable and parallelizable to meet the ever-growing needs of big genomic data.
• Ané C., Larget B., Baum D., Smith S., and Rokas A. Bayesian estimation of concordance
among gene trees. Molecular biology and evolution, 24(2):412–26, 2007.
• Baum D. Concordance trees, concordance factors, and the exploration of reticulate
genealogy. Taxon, 56(May):417–426, 2007.
• Coen A., Solís-Lemus C., and Ané C. On the Identifiability of Phylogenetic Networks
under a Pseudolikelihood model. Currently submitted.
• Degnan J. Modeling Hybridization Under the Network Multispecies Coalescent.
Systematic Biology, 67(5):786–799, 05, 2018.
• Forster P, Forster L., Renfrew C., Forster M. Phylogenetic network analysis of SARS-
CoV-2 genomes. Proceedings of the National Academy of Sciences 117 (17): 9241-
9243; 2020.
• Huson D., Rupp R., and Scornavacca C. Phylogenetic networks: concepts, algorithms
and applications. New York: Cambridge University Press, 2010.
• Kumar V, Lammers F., Bidon T., Pfenninger M., Kolter L, Nilsson M., and Janke A. The
evolutionary history of bears is characterized by gene flow across species. Scientific
Reports, 7(1):46487, 2017.
• Pimentel-Alarcón D., Boston N., and Nowak R. A characterization of deterministic
sampling patterns for low-rank matrix completion. IEEE Journal of Selected Topics in
Signal Processing, 10(4):623–636, 2016.
• Solís-Lemus C. and Ané C. Inferring phylogenetic networks with maximum
pseudolikelihood under incomplete lineage sorting. PLOS Genetics, 12(3):e1005896,
2016.
• Yu Y., Degnan J., and Nakhleh L. The probability of a gene tree topology within a
phylogenetic network with applications to hybridization detection. PLOS genetics,
8(4):e1002660, 2012.

On the identifiability of phylogenetic networks under a pseudolikelihood model

More Related Content

What's hot (15)

Similar to On the identifiability of phylogenetic networks under a pseudolikelihood model (20)

Recently uploaded (20)

On the identifiability of phylogenetic networks under a pseudolikelihood model