sequence alignment and phylogeny estimation
We design new methods for simultaneous estimation of alignments and trees, capable of producing highly accurate trees
and alignments on very large datasets.
Our early work developed the SATe method (Liu et al, Science 2009
and Systematic Biology 2012),
which can produce highly accurate trees and alignments
for large datasets. We recently improved the algorithmic
design and came up with PASTA (Mirarab, Nguyen, and Warnow,
RECOMB 2014), an even more accurate co-estimation
method that can analyze datasets with up to 200,000
More recent work (not yet published) is developing
a new approach to large-scale multiple sequence
estimation, called UPP (ultra-large alignment using
SEPP, Mirarab et al., in preparation). UPP uses
a novel technique we call "HMM Families"
to represent a seed alignment (either computed
on the fly
for a subset of the input dataset, or given as input)
with a collection of HMMs (Hidden Markov Models).
This HMM Family is then used to align the remaining
sequences, thus producing an alignment of the entire dataset.
Our preliminary data shows UPP produces much more accurate
trees than other methods, and is highly robust to fragmentary
sequences. Furthermore, UPP can scale to very large
datasets, even up to 1,000,000 (one million) sequences.
This research program establishes that markers that evolve very quickly and
seem very difficult to align can
be aligned well using these new
methods, and hence used to advantage in large-scale phylogenetic
This research is
funded by an NSF grant under the ATOL (Assembling the
Tree of Life) program; see
our ATOL project webpage for more information.
We are working on improving methods for taxon identification of
short reads found during metagenomic analyses. Our
first work in this area appeared in
PSB 2012, and provided a new
method for phylogenetic
placement of short reads. We call this
method SEPP, for SATe-enabled phylogenetic placement.
SEPP produces more accurate placements than the leading
methods, PaPaRa, pplacer, or EPA.
We extended SEPP using statistical support considerations to
produce TIPP (Taxon Identification and Phylogenetic Profiling),
a marker-based method for
taxonomic profiling. Our preliminary results show
that TIPP provides improved accuracy compared to
TIPP is a collaboration with
Estimating species trees from gene trees
The main goal of this project is the design of fast and scalable supertree methods, capable of producing highly
accurate trees on very large datasets (with tens of thousands of taxa).
The secondary goal is to understand the taxon sampling strategies for
assembling supertree datasets that yield the most accurate supertrees.
The outcome of this project will include distribution of usable open source
software to the research community.
We have developed a very fast method, SuperFine
Systematic Biology 2012 paper, which gives very fast
and accurate supertrees.
SuperFine is a meta-method that estimates the supertree
in two steps: first a partially resolved
tree is estimated, and then each high degree node (polytomy)
in that tree is refined using a base supertree method.
Our initial studies used MRP, based upon heuristics in PAUP*
for maximum parsimony, for this refinement step.
Improvements to SuperFine in terms of accuracy and speed
have been obtained using parallelism (see
ACM-SAC 2012 paper)
or alternative base supertree methods
for Molecular Biology 2012 paper).
This research was supported
by the NSF through a large ITR grant to the
and also through the ATOL grant for large-scale simultaneous multiple sequence
alignment and phylogeny estimation.
Fast techniques for ultra-large phylogeny estimation
We design new methods for estimating
trees from ultra-large datasets, containing upwards of 10,000 taxa.
Our early work produced the
that is part of the
CIPRES project software distribution.
Rec-I-DCM3 speeds up maximum parsimony (PAUP*) and maximum
likelihood software (RAxML) for very large datasets. Our current work
is developing a new method, DACTAL, for producing trees for ultra-large datasets without
ever requiring that a multiple sequence alignment of the entire dataset
be estimated. DACTAL is under development.
Estimating phylogenies from genome rearrangements
- Whole genomes evolve under many processes that change the
order and copy number of genes, as well as the number
of chromosomes. Events such as inversions,
transpositions, and inverted transpositions,
change the gene order and strandedness, while duplications,
deletions, and insertions change the number of copies of
each gene within each chromosome. Finally,
events such as fissions and fusions change the number of
chromosomes within the genome. Estimating phylogenies from
gene order and content data presents very interesting
mathematical and computational
challenges. We work with
Bernard Moret at EPFL (Switzerland)
to develop scalable methods for estimating histories
from whole genomes.
Computational Historical Linguistics
list of papers.
- We design methods to estimate evolutionary histories
for languages, with a particular focus on Indo-European.
We also model language evolution, including "borrowing" between
languages, as a stochastic process.
This research is a collaboration with
linguist Donald Ringe at the University of
Pennsylvania, probabilist Steve Evans at UC Berkeley, and
Luay Nakhleh at Rice University. See The
Computational Phylogenetics in Historical Linguistics webpage for more