Added: Mylan Medford - Date: 22.04.2022 22:59 - Views: 28833 - Clicks: 3396
Phylogenies provide a useful way to understand the evolutionary history of genetic samples, and data sets with more than a thousand taxa are becoming increasingly common, notably with viruses e. Dating ancestral events is one of the first, essential goals with such data.
However, current sophisticated probabilistic approaches struggle to handle data sets of this size. Here, we present very fast dating algorithms, based on a Gaussian model closely related to the Langley—Fitch molecular-clock model.
We show that this model is robust to uncorrelated violations of the molecular clock. Our algorithms apply to serial data, where the tips of the tree have been sampled through times. They estimate the substitution rate and the dates of all ancestral nodes. When the input tree is unrooted, they can provide an estimate for the root position, thus representing a new, practical alternative to the standard rooting methods e. Our algorithms exploit the tree recursive structure of the problem at hand, and the close relationships between least-squares and linear algebra.
We distinguish between an unconstrained setting and the case where the temporal precedence constraint i. With rooted trees, the former is solved using linear algebra in linear computing time i. With unrooted trees the computing time becomes nearly quadratic i. Using simulated data, we show that their estimation accuracy is similar to that of the most sophisticated methods, while their computing time is much faster. We apply these algorithms on a large data set comprising strains of Influenza virus from the pdm09 H1N1 Human pandemic.
Again the show that these algorithms provide a very fast alternative with similar to those of other computer programs. The explosion of genetic data and progress in phylogenetic reconstruction algorithms has resulted in increasing utility and popularity of phylogenetic analyses. Data sets with thousands of taxa are becoming more and more common, especially amongst virus evolution studies. Moreover, a of studies have used molecular-dating techniques to tackle a wide range of biological questions, for example, in systematics for timing the tree of life Hedges and Kumar ; Jetz et al.
Currently, the most popular dating approaches are based on sophisticated probabilistic models, most often implemented in the Bayesian framework and able to for complex priors Thorne and Kishino ; Rannala and Yang ; Drummond and Rambaut ; Guindon et al.
Maximum-likelihood methods have also been deed to deal with simpler models Rambaut Corresponding computer programs take a sequence alignment and a set of known dates as input and return a time-scaled tree, with estimates of the substitution rate s and of the dates of all tree nodes.
Some programs e. These programs typically contain several submodels, which describe the substitution process e. We distinguish the strict molecular clock SMC model, where the substitution rate is assumed to be constant across all tree branches, and uncorrelated and correlated relaxed-clock models. With uncorrelated models, the rate associated with each branch is drawn independently from a common underlying distribution; these models are commonly used with fast-evolving species over short time periods, typically with viruses for which there is no strong evidence of rate correlation among branches Drummond et al.
With correlated also called autocorrelated models, the rate distribution for a particular branch depends on the rate value of the neighboring branches; the use of correlated models seems to be the preferred choice with large groups of slowly evolving species, for example mammals, where it has been demonstrated that some subgroups evolve faster than others e.
However, the advantages and limitations of this large variety of models is still a question of debate Drummond et al. All these models and methods have shown to be useful in a of studies, but they are computationally intensive, making it virtually impossible to deal with the larger data sets available today, even when using sophisticated implementations and powerful computers Ayres et al.
Typically, days of computations are required to analyze a few hundred taxa, although faster approaches are available, using complex algorithmic approaches Akerborg et al. Here we are interested in dating very large phylogenies, typically with a thousand tips or more, a need that is becoming increasingly common, for example, in molecular epidemiology. We propose distance-based algorithms to estimate rates and dates, a mathematical and computational framework that has proven to produce fast and fairly accurate tools in phylogenetics e. Several distance-based as opposed to sequence-based, see above dating methods have already been proposed.
Most of these methods deal with time calibration points, where the dates of certain ancestral nodes in the tree are known, possibly with uncertainty e. These methods input a rooted tree with time calibration points, and return a time-scaled, ultrametric tree. PATHd8 Britton et al. Xia and Yang's method assumes a SMC or two different local clocks, and achieves least-squares estimations under these assumptions. Sanderson'sapproach is based on a penalized-likelihood criterion to for the autocorrelation of rates, combined with standard optimization techniques see also TreePL, Smith and O'Meara Based on computer simulations, these fast methods were shown to be accurate by their authors, producing time-scaled trees similar to those obtained using sequence-based approaches.
The focus of the present study is on serial phylogenies, where the tips of the tree have been sampled through times. Such phylogenies are common with fast-evolving organisms e. Serial phylogenies are also used with ancient DNA Lambert et al. Moreover, close relationships exist between the calibration-points and dated-tips approaches Ronquist et al.
Several methods have been proposed in this framework. One of the very first is root-to-tip regression RTT Shankarappa et al. This method is very fast and can be extended to unrooted trees by searching among all tree branches for the best root position, according to some numerical criterion e. However, this method does not provide estimates for the dates of internal nodes, and thus does not output time-scaled trees. To obtain date estimates of the internal nodes, sUPGMA Drummond and Rodrigo combines a regression method to estimate the substitution rate in a first step, corrects the non-contemporaneous tips into contemporaneous tips in a second step and then uses UPGMA Sokal and Michener to compute the tree.
Unlike the former approaches, Langley and Fitch's LF; method uses an explicit model. The LF method assumes a SMC with a constant substitution rate, and models the of substitutions along each branch of the tree by a Poisson distribution. The estimates of the global substitution rate and of the internal node dates are then obtained by maximizing the likelihood of the input, rooted tree. LF is implemented in r8s Sanderson In this article, we study a model analogous to LF's, but using a normal approximation that allows for a least-squares approach, and show that this model is robust to uncorrelated violations of the molecular clock.
Using the tree recursive structure of the problem at hand, and the close relationships between least-squares and linear algebra, we propose very fast algorithms to estimate the substitution rate and the dates of all internal tree nodes. With rooted trees, the time complexity is nearly linear i. The article is organized as follows: we first define the model and show its ability to handle uncorrelated rate variations among tree branches, as is commonly assumed with virus data.
We then present our two main algorithms, distinguishing the unconstrained setting and the case where the temporal precedence constraints i. Last, we compare these algorithms to standard approaches using simulated data and a large influenza data set. Our algorithms take as input a binary phylogenetic tree with branch lengths, inferred by any tree building program, and sampling dates associated with the taxa.
As our algorithms are very fast, it is consistent to combine them with fast tree-building methods, for example distance-based methods e. However, we shall see that obtained with both approaches are close. The algorithms accept a rooted or unrooted tree, and for unrooted trees we propose a method to estimate the root position, though simulations show that the use of an outgroup is generally preferable. Given a set of n serially dated sequences, let R be the input rooted binary phylogenetic tree on these sequences with known branch lengths.
Node 1 corresponds to the root. The date of node i is denoted by t i. For every internal node ilet s 1 i and s 2 i be the two direct descendants of i. Let b i be the length of the branch ia i ; b i is an estimate of the of substitutions per site that occurred along the branch from time t a i to t i.
With a SMC, the substitution rate i. The higher c is, the closer we are to equal variances, that is, ordinary least squares OLS. To summarize, our model Eq. This corresponds to the default option in several programs e. We certainly do not pretend that this model depicts all the complexity of sequence evolution, but it makes possible very efficient calculations with little loss in terms of estimation accuracy, as described later.
This is an obvious requirement, analogous to the positivity of branch lengths in phylogenetic trees. However, not all dating methods comply with this requirement e. The reasons for this are mostly computational. Imposing positivity constraints has a computational cost, as we shall see below in our dating context.
This function is a convex quadratic form O'Meara and has a unique minimum see Proof in the Online Appendix. Therefore, Equation 2 also has a unique minimum. We propose two different algorithms. One takes into the temporal precedence constraints, while the other does not. We present the weighted versions in the following, as the unweighted versions are simply obtained by fixing the w i to 1. This algorithm can be extended to non-binary trees. However, nothing guarantees that the date estimates satisfy the temporal precedence constraints.
This is why we deed the QPD quadratic programming dating algorithm, which we describe now.Speed dating montpellier 2015
email: [email protected] - phone:(491) 575-6602 x 7063