K. Liu, S. Raghavan, S. Nelesen, C. R. Linder, T. Warnow, "Rapid and Accurate Large-Scale Coestimation of Sequence Alignments and Phylogenetic Trees," Science, vol. 324, no. 5934, pp. 1561-1564, 19 June 2009.

Data and Programs



SATe program

A user-friendly graphical interface version for
various popular operating systems is now available here.
Please note that testing and development of this program
is ongoing.



How to open a compressed file

All files are in compressed .tar.bz2 format. To extract a compressed <file>, use the command:

tar xjf <file>



Simulated nucleotide datasets

There are 37 model conditions, with either 1000, 500, or 100 taxa, and either long, medium, or short gap length types. Each model condition is referred to with the string "<number of taxa><gap length type: L for long, M for medium, S for short><id number>". Each model condition has 20 replicate datasets. For each model condition, an associated compressed file is provided, e.g. 1000L1.tar.bz2 corresponds to model 1000L1. The compressed file contains the 20 replicate datasets for that model condition.

After extracting a compressed file, the directory structure <model condition parameter string>/R<replicate number>/
will have the following files:

1000 long
1000 medium
1000 short
500 long
500 medium
500 short
100 long
100 medium
100 short
1000L1
1000L2
1000L3
1000L4
1000L5
1000M1
1000M2
1000M3
1000M4
1000M5
1000S1
1000S2
1000S3
1000S4
1000S5
500L1
500L2
500L3
500L4
500L5
500M1
500M2
500M3
500M4
500M5
500S1
500S2
500S3
500S4
500S5
100L1
100L2
100M1
100M2
100M3
100S1
100S2



Biological datasets

There are 6 biological datasets. Each dataset consists of either DNA or RNA sequences from a variety of markers and taxa.
For each dataset, an associated link to a tarball is listed below. The tarball contains a single directory <dataset name>/R0/ that contains the following files:


rRNA datasets
23S.M.aa_ag
23S.M
16S.M
16S.M.aa_ag
23S.E
23S.E.aa_ag

For reference purposes, the original Gutell lab CRW datasets can be accessed at the following links. Please note that these original datasets differ from the cleaned datasets above, which were the datasets actually used in the experiments. Furthermore, these original uncleaned curated alignments are not in FASTA format, and are instead in a more verbose GenBank format explained here.

Original uncleaned datasets
23S.M.aa_ag
23S.M
16S.M
16S.M.aa_ag
23S.E
23S.E.aa_ag




tds executable

libstdc++-3 is required to run this program.
This program has only been verified to work on
Ubuntu 8.04 LTS "Hardy Heron" for 32-bit processors.
tds was originally developed by Prof. Daniel Huson and
was modified by Prof. Usman Roshan.

tds is available here.

To obtain usage information, run:

./usmantds2 -h

To perform the ultrametricity deviation calculation with parameter <u>, run:

./usmantds2 -id tree -od tree -i <newick tree> -c stretch -cm <u>

To obtain the diameter of a tree, run:

./usmantds2 -id tree -od dist -i <newick tree> -o <output file>

To scale all branch lengths by a constant <factor>:

./usmantds2 -id tree -od tree -i <newick tree> -c scale -cm <factor>



Contact

If you have questions or comments, please contact tandy<at>cs<dot>utexas<dot>edu.



Old SATe program

If you are interested in running SATe on your own datasets,
we recommend that you use the newer graphical interface
version of SATe above, instead of this older program.

For archival purposes only, the original version 1.0 implementation of the
SATe algorithm is located here.
After decompressing that file, read the file in the directory
./sate_dist/README and follow those instructions to run SATe.

Most modern desktop machines should support SATe analysis on smaller
datasets of several hundred taxa and several hundred aligned sites. A
desktop machine with 4 GB main memory and at least 1 GB disk space for
the working directory is recommended for the largest datasets from the
paper and for datasets with similar numbers of taxa and sequence
length. These datasets generally have up to several thousand taxa and
several thousand aligned sites. If you encounter errors while
analyzing datasets with greater numbers of taxa or sequence length
than that, you can use a version of SATe that omits starting
alignments with high memory requirements by using the -s 2 option.

SATe is distributed under the GPLv3 license. This distribution contains
other programs which are licensed under different licenses.
Please see ./sate_dist/LICENSE for more details.