SlideShare a Scribd company logo
Arrigo Coen
University of Wisconsin-Madison
Madison, WI 53705
coencoria@wisc.edu
Cécile Ané
University of Wisconsin-Madison
Madison, WI 53705
cecile.ane@wisc.edu
Claudia Solís-Lemus
University of Wisconsin-Madison
Madison, WI 53705
solislemus@wisc.edu
On the identifiability of phylogenetic networks under
a pseudolikelihood model
Meeting of Systematics, Biogeography and Evolution 2020
(SBE)
Phylogenetic analysis is the means of estimating the evolutionary relationships of
species. This type of analysis allow us to estimate ancestral traits of past extinct
populations and identify characteristics in fast-adapting organisms. For instance,
Forster et al. (2020) study the COVID phylogeny to trace its routes of infections. In
general, the development of statistical and machine-learning theory phylogenesis, are
paramount in evolutionary biology.
A phylogenetic network is a useful tool to model evolution. This structure is a
generalization of a phylogenetic tree. In mathematical terms, a phylogenetic tree is a
fully bifurcating graph in which internal nodes represent ancestral species that over
time differentiate into two separate species, as depicted in Figure 1. Recently, scientists
have challenged the notion that evolution can be represented with a fully bifurcating
process, as this process cannot capture important biological realities like hybridization,
introgression or horizontal gene transfer, that require two fully separated branches to
join back. A phylogenetic network is then the generalization that considers these
phenomena. Thus, recent years have seen the development of methods to reconstruct
phylogenetic networks (Degnan, 2018).
Why to care about phylogenetic network estimation?
The complexity of the networks’ space makes it hard to find the “best” network. With
each extra species data, the problem of finding this network is exponentially more
computational demanding. Moreover, many algorithms lack identifiability support;
meaning that is hard to know if they output network is the proper one.
In this work we will present a possible solution to this identifiability problem. We will
present the method of Solís-Lemus and Ané (2016) and show under which conditions
this fast algorithm identifies correctly the underlaying network topology.
Figure 1. Phylogenetic network of the bear species from Kumar et al. (2017) Here,
the hybrid node (green) make a connection the ancestral population to the polar
bear/brown bear/American black bear clade and the Asiatic black bear.
One standard input to estimate a phylogenetic network is a set of phylogenetic gene
trees. These trees represent the evolution of each gene under study. By imposing
some criteria, we could use the gene tree topologies to find a phylogenetic network
that explains their variability. An example of this is presented in the next figure.
Figure 2. Example of a theoretical phylogenetic network and its input gene
trees. The blue and red trees are displayed within branches of the gray
network for the species A-E. This network gathers both topologies of its
input trees by having a hybrid branch.
The direct calculation of the likelihood of a network is very computational expensive. A
reason for this is the massive number of possible topologies. For instance, the number
of unrooted 𝑛–taxa binary trees is 2𝑛 − 5 !/(2!"# 𝑛 − 3 !) and each has $!"$
$% possible
selections of 𝑘–hybrid nodes (see Hudson, 2010). For each of the possible topologies it
is also expensive to calculate its likelihood based on branch length values.
Consequently, scalable techniques are required.
An alternative to the likelihood of a phylogenetic network is presented in Solís-Lemus
and Ané (2016), This pseudolikelihood methodology uses the quartet subnetworks to
find the optimum topology; a quartet is a set of four taxa. Let 𝐺 the set of all possible
quartets of a set of taxa, then we define the pseudolikelihood for a fixed topology 𝑁 as
-𝐿 𝑁 = 0
&∈(
𝐿(𝑔) ,
where 𝐿(𝑔) is the likelihood of the quartet 𝑔. Since the quartets are not independent, the
product of likelihoods is not a true likelihood for the network.
To find the quartet likelihood Solís-Lemus and Ané use the concordance factors (CF) of
the quartets. A CF is the percentage of occurrences of this quartet on the input gene
trees (Baum, 2007). As an example, let us assume that we have the gene trees:
Material and Methods: Pseudolikelihood Estimation Using CFs
To calculate the likelihood 𝐿(𝑔) of a given quartet 𝑔 = {𝐴, 𝐵, 𝐶, 𝐷}, let 𝑌 = 𝑌)*, 𝑌)$, 𝑌)#
denote the number of gene trees that match each of the three possible quartet
resolutions. Then 𝑌 follows a multinomial distribution with probabilities
𝐶𝐹)*, 𝐶𝐹)$, 𝐶𝐹)# , given by the theoretical CFs (see Yu et al., 2012) This allow us to get,
𝐿 𝑔 ∝ 𝐶𝐹)*
+!"
𝐶𝐹)$
+!#
𝐶𝐹)#
+!$
.
Thus, given a set of input gene trees we can use their CF to calculate their
pseudolikelihood.
A nice property of this methodology is the robustness of its pseudolikelihood. Since
the input gene trees are unrooted and without branch lengths, we do not have rooting
errors and molecular clock errors. In particular, to account for gene tree estimation
error, we could estimate CF using BUCKy (Ané, 2007).
The logical question that now arises is if there are multiple phylogenetic networks
with the same set of concordance factors. Here, we characterize which types of
phylogenetic networks can be identify from genomic data using Solís-Lemus and Ané
methodology.
Figure 3. Example of input gene trees. The different colors evidence the
quartets options for each 4-taxon subset with {A,B,C,D} (see Figure 4).
Since for each four taxa there are only tree possible unrooted trees (quartets), we have
that:
Figure 4. The three concordance factors for the quartet {A,B,C,D}, with
respect to the input gene trees of Figure 3.
Here, we present our main results over identifiability of pseudolikelihood model. These
results correspond to topology identifiability, and to numerical parameters
identifiability. The proofs of these results could be consulted in Coen et al. (submitted).
To demonstrate identifiability, we assume that the phylogenetic networks are of level-
1. A level-1 network does not have intersecting hybridization cycles. A hybridization
cycle is the minimum cycle, in the graph theory sense, that contains a hybrid node. The
next figure illustrates all these concepts.
Theorem 1:
Let 𝑁 be a level-1 network, then:
• A hybridization with a 2-node cycle is not detectable.
• A hybridization with a k-node cycle, for k>2, is detectable if all its
subnetworks have at least two taxa.
This theorem establishes that in most of the cases, except when the hybridization
is trivial, the concordance factors identify the hybridization correctly. A nice
characteristic of this theorem is that it could be applied to each hybridization of a
network to know which ones are detectable and which ones are not detectable.
This last definition and theorem focus on the topological identifiability of a network;
the next definition and theorem focus on the parameter identifiability.
Results: Identifiability of Pseudolikelihood
Figure 5. Example of level-1 (left) and non level-1 (right) networks. The
difference between them is that in a level-1 network the cycles do not
have nodes/edges in common.
The next definition distinguish cases when the system of CFs are unique in the sense
that another topology could not share them.
Detectable hybridizations:
Let 𝑁 be a level-1 network and we denote by 𝑁′ a generic copy of 𝑁 with the ℎ
hybridization erased and we allow 𝑁 and 𝑁′ to have different branch lengths on the
shared edges. Moreover, we say that the h hybridization is detectable if the system of
CF of 𝑁 does not match the system of CF of 𝑁′.
Figure 6. Example of a network 𝑁 (left) and its copy 𝑁′ (right), which does
not contain the hybridization (denoted by green). The small blue
triangles represent subnetworks and in this case the cycle has eight
nodes (8-node cycle).
All Macaulay2 and Mathematica scripts are available in the GitHub repository
https://github.com/solislemuslab/snaq-identifiability
Reproducibility
Parameter identifiability implies that the equations of CFs have a unique solution with
respect to their parameter values; that is, there are not multiple parameter values that
can produce the same set of CFs. The next definition distinguish cases when the set of
CF equations have finitely many solutions, see Pimentel-Alarcón et al.(2016).
Finitely identifiable:
We say that a hybridization is finitely identifiable if the set of CF equations defined by
its cycle has finitely many parameter value solutions.
Theorem 2:
Let 𝑁 be a level-1 network and h be a hybridization such that all its subnetworks have
at least two taxa, then:
• If h has a 3-node cycle, then it is not finitely identifiable.
• If h has a k-node cycle, with k>3, then it is finitely identifiable.
Theorem 2 identifies the cases when the parameters of a hybridization are finitely
defined under the system of equations of the CFs. This means that for a set of values
of CFs there are finitely many solutions for the parameters. This characterization allows
us to identify the parameter values in a computational inexpensive way.
Discussion
References
We present that hybridization cycles of different sizes vary in their detectability potential.
Cycles of 4 or more nodes are easily detected from concordance factors under a
pseudolikelihood model, while cycles of 2 nodes are totally undetectable. 3-cycle
hybridizations can be detected under certain sampling schemes; in particular, we found
that gene flow between sister species – common in real-life biological data – cannot be
detected at all. We also show that we can estimate numerical parameters on the network
(branch lengths and inheritance probabilities) for hybridization cycles of 4 or more nodes.
These results provide theoretical guarantees to the pseudolikelihood estimation of larger
hybridization cycles, while bringing up attention to the need for novel models and
methods to estimate gene flow between closely related species (small cycles).
Hence, the methodology of Solís-Lemus and Ané (2016) on quartets
pseudolikelihood estimation as a theoretically sound method, which is also highly
scalable and parallelizable to meet the ever-growing needs of big genomic data.
• Ané C., Larget B., Baum D., Smith S., and Rokas A. Bayesian estimation of concordance
among gene trees. Molecular biology and evolution, 24(2):412–26, 2007.
• Baum D. Concordance trees, concordance factors, and the exploration of reticulate
genealogy. Taxon, 56(May):417–426, 2007.
• Coen A., Solís-Lemus C., and Ané C. On the Identifiability of Phylogenetic Networks
under a Pseudolikelihood model. Currently submitted.
• Degnan J. Modeling Hybridization Under the Network Multispecies Coalescent.
Systematic Biology, 67(5):786–799, 05, 2018.
• Forster P, Forster L., Renfrew C., Forster M. Phylogenetic network analysis of SARS-
CoV-2 genomes. Proceedings of the National Academy of Sciences 117 (17): 9241-
9243; 2020.
• Huson D., Rupp R., and Scornavacca C. Phylogenetic networks: concepts, algorithms
and applications. New York: Cambridge University Press, 2010.
• Kumar V, Lammers F., Bidon T., Pfenninger M., Kolter L, Nilsson M., and Janke A. The
evolutionary history of bears is characterized by gene flow across species. Scientific
Reports, 7(1):46487, 2017.
• Pimentel-Alarcón D., Boston N., and Nowak R. A characterization of deterministic
sampling patterns for low-rank matrix completion. IEEE Journal of Selected Topics in
Signal Processing, 10(4):623–636, 2016.
• Solís-Lemus C. and Ané C. Inferring phylogenetic networks with maximum
pseudolikelihood under incomplete lineage sorting. PLOS Genetics, 12(3):e1005896,
2016.
• Yu Y., Degnan J., and Nakhleh L. The probability of a gene tree topology within a
phylogenetic network with applications to hybridization detection. PLOS genetics,
8(4):e1002660, 2012.

More Related Content

PPT
Sequence alignment belgaum
PPT
Bioinformatica 20-10-2011-t3-scoring matrices
PPT
Sequence Alignment In Bioinformatics
DOCX
Bioinformatics_Sequence Analysis
PPTX
Sequence alignment
PPTX
Bioinformatics t5-database searching-v2013_wim_vancriekinge
PDF
dot plot analysis
PPTX
Sequence database
Sequence alignment belgaum
Bioinformatica 20-10-2011-t3-scoring matrices
Sequence Alignment In Bioinformatics
Bioinformatics_Sequence Analysis
Sequence alignment
Bioinformatics t5-database searching-v2013_wim_vancriekinge
dot plot analysis
Sequence database

What's hot (15)

PPTX
Introduction to sequence alignment
PPT
Phylogenetic prediction - maximum parsimony method
PPTX
Phylogenetic tree construction
PPT
B.sc biochem i bobi u 3.1 sequence alignment
PPTX
Sequence Alignment
PDF
An extended stable marriage problem
PDF
sequence alignment
PPTX
Sequence Alignment
PDF
Sequence Alignment
PPTX
Dot matrix Analysis Tools (Bioinformatics)
PDF
Swaati algorithm of alignment ppt
PPT
Phylogenetic trees
PPTX
Presentation for blast algorithm bio-informatice
PPT
Phylogenetics1
PPTX
Parwati sihag
Introduction to sequence alignment
Phylogenetic prediction - maximum parsimony method
Phylogenetic tree construction
B.sc biochem i bobi u 3.1 sequence alignment
Sequence Alignment
An extended stable marriage problem
sequence alignment
Sequence Alignment
Sequence Alignment
Dot matrix Analysis Tools (Bioinformatics)
Swaati algorithm of alignment ppt
Phylogenetic trees
Presentation for blast algorithm bio-informatice
Phylogenetics1
Parwati sihag
Ad

Similar to On the identifiability of phylogenetic networks under a pseudolikelihood model (20)

PDF
A tutorial in Connectome Analysis (1) - Marcus Kaiser
PDF
712201907
PDF
Criminal and Civil Identification with DNA Databases Using Bayesian Networks
PDF
jin-HMG2014-post
PDF
Report-de Bruijn Graph
PDF
Gutell 061.nar.1997.25.01559
PDF
Bioinformatics2015.pdf
PDF
Bioinformatics2015.pdf
PPTX
Perl for Phyloinformatics
PDF
AI Lesson 28
PDF
Lesson 28
PDF
Paper Explained: Understanding the wiring evolution in differentiable neural ...
PDF
International Journal of Computer Science and Security Volume (2) Issue (5)
PDF
Network motifs in integrated cellular networks of transcription–regulation an...
DOC
PDF
Stock markets and_human_genomics
PDF
Urop poster 2014 benson and mat 5 13 14 (2)
PDF
Gutell 119.plos_one_2017_7_e39383
PPTX
Linkage and recombination of gene
PPT
Socialnetworkanalysis (Tin180 Com)
A tutorial in Connectome Analysis (1) - Marcus Kaiser
712201907
Criminal and Civil Identification with DNA Databases Using Bayesian Networks
jin-HMG2014-post
Report-de Bruijn Graph
Gutell 061.nar.1997.25.01559
Bioinformatics2015.pdf
Bioinformatics2015.pdf
Perl for Phyloinformatics
AI Lesson 28
Lesson 28
Paper Explained: Understanding the wiring evolution in differentiable neural ...
International Journal of Computer Science and Security Volume (2) Issue (5)
Network motifs in integrated cellular networks of transcription–regulation an...
Stock markets and_human_genomics
Urop poster 2014 benson and mat 5 13 14 (2)
Gutell 119.plos_one_2017_7_e39383
Linkage and recombination of gene
Socialnetworkanalysis (Tin180 Com)
Ad

Recently uploaded (20)

PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PPTX
Science Quipper for lesson in grade 8 Matatag Curriculum
PPT
6.1 High Risk New Born. Padetric health ppt
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PDF
. Radiology Case Scenariosssssssssssssss
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
PPTX
famous lake in india and its disturibution and importance
PPTX
Introduction to Cardiovascular system_structure and functions-1
PDF
Placing the Near-Earth Object Impact Probability in Context
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PPTX
Pharmacology of Autonomic nervous system
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PPTX
C1 cut-Methane and it's Derivatives.pptx
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PDF
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
PPTX
ECG_Course_Presentation د.محمد صقران ppt
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
Science Quipper for lesson in grade 8 Matatag Curriculum
6.1 High Risk New Born. Padetric health ppt
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
. Radiology Case Scenariosssssssssssssss
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
famous lake in india and its disturibution and importance
Introduction to Cardiovascular system_structure and functions-1
Placing the Near-Earth Object Impact Probability in Context
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
Pharmacology of Autonomic nervous system
TOTAL hIP ARTHROPLASTY Presentation.pptx
C1 cut-Methane and it's Derivatives.pptx
Taita Taveta Laboratory Technician Workshop Presentation.pptx
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
ECG_Course_Presentation د.محمد صقران ppt

On the identifiability of phylogenetic networks under a pseudolikelihood model

  • 1. Arrigo Coen University of Wisconsin-Madison Madison, WI 53705 coencoria@wisc.edu Cécile Ané University of Wisconsin-Madison Madison, WI 53705 cecile.ane@wisc.edu Claudia Solís-Lemus University of Wisconsin-Madison Madison, WI 53705 solislemus@wisc.edu On the identifiability of phylogenetic networks under a pseudolikelihood model Meeting of Systematics, Biogeography and Evolution 2020 (SBE)
  • 2. Phylogenetic analysis is the means of estimating the evolutionary relationships of species. This type of analysis allow us to estimate ancestral traits of past extinct populations and identify characteristics in fast-adapting organisms. For instance, Forster et al. (2020) study the COVID phylogeny to trace its routes of infections. In general, the development of statistical and machine-learning theory phylogenesis, are paramount in evolutionary biology. A phylogenetic network is a useful tool to model evolution. This structure is a generalization of a phylogenetic tree. In mathematical terms, a phylogenetic tree is a fully bifurcating graph in which internal nodes represent ancestral species that over time differentiate into two separate species, as depicted in Figure 1. Recently, scientists have challenged the notion that evolution can be represented with a fully bifurcating process, as this process cannot capture important biological realities like hybridization, introgression or horizontal gene transfer, that require two fully separated branches to join back. A phylogenetic network is then the generalization that considers these phenomena. Thus, recent years have seen the development of methods to reconstruct phylogenetic networks (Degnan, 2018). Why to care about phylogenetic network estimation? The complexity of the networks’ space makes it hard to find the “best” network. With each extra species data, the problem of finding this network is exponentially more computational demanding. Moreover, many algorithms lack identifiability support; meaning that is hard to know if they output network is the proper one. In this work we will present a possible solution to this identifiability problem. We will present the method of Solís-Lemus and Ané (2016) and show under which conditions this fast algorithm identifies correctly the underlaying network topology. Figure 1. Phylogenetic network of the bear species from Kumar et al. (2017) Here, the hybrid node (green) make a connection the ancestral population to the polar bear/brown bear/American black bear clade and the Asiatic black bear. One standard input to estimate a phylogenetic network is a set of phylogenetic gene trees. These trees represent the evolution of each gene under study. By imposing some criteria, we could use the gene tree topologies to find a phylogenetic network that explains their variability. An example of this is presented in the next figure. Figure 2. Example of a theoretical phylogenetic network and its input gene trees. The blue and red trees are displayed within branches of the gray network for the species A-E. This network gathers both topologies of its input trees by having a hybrid branch.
  • 3. The direct calculation of the likelihood of a network is very computational expensive. A reason for this is the massive number of possible topologies. For instance, the number of unrooted 𝑛–taxa binary trees is 2𝑛 − 5 !/(2!"# 𝑛 − 3 !) and each has $!"$ $% possible selections of 𝑘–hybrid nodes (see Hudson, 2010). For each of the possible topologies it is also expensive to calculate its likelihood based on branch length values. Consequently, scalable techniques are required. An alternative to the likelihood of a phylogenetic network is presented in Solís-Lemus and Ané (2016), This pseudolikelihood methodology uses the quartet subnetworks to find the optimum topology; a quartet is a set of four taxa. Let 𝐺 the set of all possible quartets of a set of taxa, then we define the pseudolikelihood for a fixed topology 𝑁 as -𝐿 𝑁 = 0 &∈( 𝐿(𝑔) , where 𝐿(𝑔) is the likelihood of the quartet 𝑔. Since the quartets are not independent, the product of likelihoods is not a true likelihood for the network. To find the quartet likelihood Solís-Lemus and Ané use the concordance factors (CF) of the quartets. A CF is the percentage of occurrences of this quartet on the input gene trees (Baum, 2007). As an example, let us assume that we have the gene trees: Material and Methods: Pseudolikelihood Estimation Using CFs To calculate the likelihood 𝐿(𝑔) of a given quartet 𝑔 = {𝐴, 𝐵, 𝐶, 𝐷}, let 𝑌 = 𝑌)*, 𝑌)$, 𝑌)# denote the number of gene trees that match each of the three possible quartet resolutions. Then 𝑌 follows a multinomial distribution with probabilities 𝐶𝐹)*, 𝐶𝐹)$, 𝐶𝐹)# , given by the theoretical CFs (see Yu et al., 2012) This allow us to get, 𝐿 𝑔 ∝ 𝐶𝐹)* +!" 𝐶𝐹)$ +!# 𝐶𝐹)# +!$ . Thus, given a set of input gene trees we can use their CF to calculate their pseudolikelihood. A nice property of this methodology is the robustness of its pseudolikelihood. Since the input gene trees are unrooted and without branch lengths, we do not have rooting errors and molecular clock errors. In particular, to account for gene tree estimation error, we could estimate CF using BUCKy (Ané, 2007). The logical question that now arises is if there are multiple phylogenetic networks with the same set of concordance factors. Here, we characterize which types of phylogenetic networks can be identify from genomic data using Solís-Lemus and Ané methodology. Figure 3. Example of input gene trees. The different colors evidence the quartets options for each 4-taxon subset with {A,B,C,D} (see Figure 4). Since for each four taxa there are only tree possible unrooted trees (quartets), we have that: Figure 4. The three concordance factors for the quartet {A,B,C,D}, with respect to the input gene trees of Figure 3.
  • 4. Here, we present our main results over identifiability of pseudolikelihood model. These results correspond to topology identifiability, and to numerical parameters identifiability. The proofs of these results could be consulted in Coen et al. (submitted). To demonstrate identifiability, we assume that the phylogenetic networks are of level- 1. A level-1 network does not have intersecting hybridization cycles. A hybridization cycle is the minimum cycle, in the graph theory sense, that contains a hybrid node. The next figure illustrates all these concepts. Theorem 1: Let 𝑁 be a level-1 network, then: • A hybridization with a 2-node cycle is not detectable. • A hybridization with a k-node cycle, for k>2, is detectable if all its subnetworks have at least two taxa. This theorem establishes that in most of the cases, except when the hybridization is trivial, the concordance factors identify the hybridization correctly. A nice characteristic of this theorem is that it could be applied to each hybridization of a network to know which ones are detectable and which ones are not detectable. This last definition and theorem focus on the topological identifiability of a network; the next definition and theorem focus on the parameter identifiability. Results: Identifiability of Pseudolikelihood Figure 5. Example of level-1 (left) and non level-1 (right) networks. The difference between them is that in a level-1 network the cycles do not have nodes/edges in common. The next definition distinguish cases when the system of CFs are unique in the sense that another topology could not share them. Detectable hybridizations: Let 𝑁 be a level-1 network and we denote by 𝑁′ a generic copy of 𝑁 with the ℎ hybridization erased and we allow 𝑁 and 𝑁′ to have different branch lengths on the shared edges. Moreover, we say that the h hybridization is detectable if the system of CF of 𝑁 does not match the system of CF of 𝑁′. Figure 6. Example of a network 𝑁 (left) and its copy 𝑁′ (right), which does not contain the hybridization (denoted by green). The small blue triangles represent subnetworks and in this case the cycle has eight nodes (8-node cycle).
  • 5. All Macaulay2 and Mathematica scripts are available in the GitHub repository https://github.com/solislemuslab/snaq-identifiability Reproducibility Parameter identifiability implies that the equations of CFs have a unique solution with respect to their parameter values; that is, there are not multiple parameter values that can produce the same set of CFs. The next definition distinguish cases when the set of CF equations have finitely many solutions, see Pimentel-Alarcón et al.(2016). Finitely identifiable: We say that a hybridization is finitely identifiable if the set of CF equations defined by its cycle has finitely many parameter value solutions. Theorem 2: Let 𝑁 be a level-1 network and h be a hybridization such that all its subnetworks have at least two taxa, then: • If h has a 3-node cycle, then it is not finitely identifiable. • If h has a k-node cycle, with k>3, then it is finitely identifiable. Theorem 2 identifies the cases when the parameters of a hybridization are finitely defined under the system of equations of the CFs. This means that for a set of values of CFs there are finitely many solutions for the parameters. This characterization allows us to identify the parameter values in a computational inexpensive way. Discussion References We present that hybridization cycles of different sizes vary in their detectability potential. Cycles of 4 or more nodes are easily detected from concordance factors under a pseudolikelihood model, while cycles of 2 nodes are totally undetectable. 3-cycle hybridizations can be detected under certain sampling schemes; in particular, we found that gene flow between sister species – common in real-life biological data – cannot be detected at all. We also show that we can estimate numerical parameters on the network (branch lengths and inheritance probabilities) for hybridization cycles of 4 or more nodes. These results provide theoretical guarantees to the pseudolikelihood estimation of larger hybridization cycles, while bringing up attention to the need for novel models and methods to estimate gene flow between closely related species (small cycles). Hence, the methodology of Solís-Lemus and Ané (2016) on quartets pseudolikelihood estimation as a theoretically sound method, which is also highly scalable and parallelizable to meet the ever-growing needs of big genomic data. • Ané C., Larget B., Baum D., Smith S., and Rokas A. Bayesian estimation of concordance among gene trees. Molecular biology and evolution, 24(2):412–26, 2007. • Baum D. Concordance trees, concordance factors, and the exploration of reticulate genealogy. Taxon, 56(May):417–426, 2007. • Coen A., Solís-Lemus C., and Ané C. On the Identifiability of Phylogenetic Networks under a Pseudolikelihood model. Currently submitted. • Degnan J. Modeling Hybridization Under the Network Multispecies Coalescent. Systematic Biology, 67(5):786–799, 05, 2018. • Forster P, Forster L., Renfrew C., Forster M. Phylogenetic network analysis of SARS- CoV-2 genomes. Proceedings of the National Academy of Sciences 117 (17): 9241- 9243; 2020. • Huson D., Rupp R., and Scornavacca C. Phylogenetic networks: concepts, algorithms and applications. New York: Cambridge University Press, 2010. • Kumar V, Lammers F., Bidon T., Pfenninger M., Kolter L, Nilsson M., and Janke A. The evolutionary history of bears is characterized by gene flow across species. Scientific Reports, 7(1):46487, 2017. • Pimentel-Alarcón D., Boston N., and Nowak R. A characterization of deterministic sampling patterns for low-rank matrix completion. IEEE Journal of Selected Topics in Signal Processing, 10(4):623–636, 2016. • Solís-Lemus C. and Ané C. Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting. PLOS Genetics, 12(3):e1005896, 2016. • Yu Y., Degnan J., and Nakhleh L. The probability of a gene tree topology within a phylogenetic network with applications to hybridization detection. PLOS genetics, 8(4):e1002660, 2012.