utcs Phylogenetics and Metagenomics
· Publications · Software · People · Datasets

Large-scale multiple sequence alignment and phylogeny estimation

  • We design new methods for simultaneous estimation of alignments and trees, capable of producing highly accurate trees and alignments on very large datasets. Our early work developed the SATe method (Liu et al, Science 2009 and Systematic Biology 2012), which can produce highly accurate trees and alignments for large datasets. We recently improved the algorithmic design and came up with PASTA (Mirarab, Nguyen, and Warnow, RECOMB 2014), an even more accurate co-estimation method that can analyze datasets with up to 200,000 sequences. More recent work (not yet published) is developing a new approach to large-scale multiple sequence alignment estimation, called UPP (ultra-large alignment using SEPP, Mirarab et al., in preparation). UPP uses a novel technique we call "HMM Families" to represent a seed alignment (either computed on the fly for a subset of the input dataset, or given as input) with a collection of HMMs (Hidden Markov Models). This HMM Family is then used to align the remaining sequences, thus producing an alignment of the entire dataset. Our preliminary data shows UPP produces much more accurate alignments and trees than other methods, and is highly robust to fragmentary sequences. Furthermore, UPP can scale to very large datasets, even up to 1,000,000 (one million) sequences. This research program establishes that markers that evolve very quickly and seem very difficult to align can be aligned well using these new methods, and hence used to advantage in large-scale phylogenetic analyses. This research is funded by an NSF grant under the ATOL (Assembling the Tree of Life) program; see our ATOL project webpage for more information.


  • We are working on improving methods for taxon identification of short reads found during metagenomic analyses. Our first work in this area appeared in PSB 2012, and provided a new method for phylogenetic placement of short reads. We call this method SEPP, for SATe-enabled phylogenetic placement. SEPP produces more accurate placements than the leading methods, PaPaRa, pplacer, or EPA. We extended SEPP using statistical support considerations to produce TIPP (Taxon Identification and Phylogenetic Profiling), a marker-based method for taxonomic profiling. Our preliminary results show that TIPP provides improved accuracy compared to existing methods. TIPP is a collaboration with Mihai Pop.

Estimating species trees from gene trees

Supertree methods

  • The main goal of this project is the design of fast and scalable supertree methods, capable of producing highly accurate trees on very large datasets (with tens of thousands of taxa). The secondary goal is to understand the taxon sampling strategies for assembling supertree datasets that yield the most accurate supertrees. The outcome of this project will include distribution of usable open source software to the research community. We have developed a very fast method, SuperFine (see the Systematic Biology 2012 paper, which gives very fast and accurate supertrees. SuperFine is a meta-method that estimates the supertree in two steps: first a partially resolved tree is estimated, and then each high degree node (polytomy) in that tree is refined using a base supertree method. Our initial studies used MRP, based upon heuristics in PAUP* for maximum parsimony, for this refinement step. Improvements to SuperFine in terms of accuracy and speed have been obtained using parallelism (see ACM-SAC 2012 paper) or alternative base supertree methods (Algorithms for Molecular Biology 2012 paper). This research was supported by the NSF through a large ITR grant to the CIPRES project, and also through the ATOL grant for large-scale simultaneous multiple sequence alignment and phylogeny estimation.

Fast techniques for ultra-large phylogeny estimation

  • We design new methods for estimating trees from ultra-large datasets, containing upwards of 10,000 taxa. Our early work produced the Rec-I-DCM3 software that is part of the CIPRES project software distribution. Rec-I-DCM3 speeds up maximum parsimony (PAUP*) and maximum likelihood software (RAxML) for very large datasets. Our current work is developing a new method, DACTAL, for producing trees for ultra-large datasets without ever requiring that a multiple sequence alignment of the entire dataset be estimated. DACTAL is under development.

Estimating phylogenies from genome rearrangements

  • Whole genomes evolve under many processes that change the order and copy number of genes, as well as the number of chromosomes. Events such as inversions, transpositions, and inverted transpositions, change the gene order and strandedness, while duplications, deletions, and insertions change the number of copies of each gene within each chromosome. Finally, events such as fissions and fusions change the number of chromosomes within the genome. Estimating phylogenies from gene order and content data presents very interesting mathematical and computational challenges. We work with Bernard Moret at EPFL (Switzerland) to develop scalable methods for estimating histories from whole genomes.

Computational Historical Linguistics

  • We design methods to estimate evolutionary histories for languages, with a particular focus on Indo-European. We also model language evolution, including "borrowing" between languages, as a stochastic process. This research is a collaboration with linguist Donald Ringe at the University of Pennsylvania, probabilist Steve Evans at UC Berkeley, and Luay Nakhleh at Rice University. See The Computational Phylogenetics in Historical Linguistics webpage for more information.
Online list of papers.
Copyright © 2009-2010 Computational Phylogenetics Lab | ACES 3.304 | University of Texas | Austin, TX 78712
Site help/questions/feedback/requests: e-mail Tandy Warnow