Datasets and software for the paper "Multiple Sequence Alignment: a challenge for large-scale phylogenetics, Liu, Linder, and Warnow, 2010. PLoS Currents - Tree of Life.



How to open a compressed .tar.bz2 file

To extract a compressed <file> with suffix .tar.bz2, use the command:

tar xjf <file>



Biological datasets

There are 9 rRNA datasets listed below. Each dataset's compressed file
contains a single directory ./<dataset name>/R0/ that contains the following files:

rRNA datasets
16S.B.ALL
16S.T
16S.3
16S.M.aa_ag
16S.M
23S.M
23S.M.aa_ag
23S.E.aa_ag
23S.E

For reference purposes, the original Gutell lab CRW datasets
can be accessed at the following links. Please note that
these original datasets differ from the cleaned datasets above,
which were the datasets actually used in the experiments.
Furthermore, these original uncleaned curated alignments are not
in FASTA format, and are instead in a more verbose GenBank format
explained here.

Original uncleaned datasets
16S.B.ALL
16S.T
16S.3
16S.M.aa_ag
16S.M
23S.M
23S.M.aa_ag
23S.E.aa_ag
23S.E




Simulated datasets

The first replicate of the nucleotide 16S simulation performed by Price et al. 2010 (doi:10.1371/journal.pone.0009490)
can be obtained at this link or from their publication's online supplementary website.



Reference trees, estimated trees, and estimated alignments

The reference trees, estimated trees, and estimated alignments for all datasets can be obtained here. The contents of the compressed file are organized in the following directory structure:
<dataset> is the name of a simulated or biological dataset from the study, and <method> is one of the following:
Note that not all <method> choices are available for all <dataset> choices. Please refer to the tables in the study for more details.



Missing branch rate calculation program

The program CompareTree.pl which was used to calculate 1 - MBR,
where MBR is the missing branch rate between two trees,
can be obtained here.



Alignment SP-FN error calculation program

The code used to calculate the alignment SP-FN error of an estimated alignment
with respect to a true alignment is available here. After compiling with javac,
the program is run as follows:

java BigDataMatrix -v <FASTA true alignment file with full path> -f <FASTA estimated alignment file with full path> -sp



SATé program

A user-friendly graphical interface version for SATé on
various popular operating systems is now available here.
You can also obtain the software we used to run these experiments directly from us (contact kliu<at>cs<dot>utexas<dot>edu).



Modified SATé program for 64-bit processors on large datasets

A modified version 1.1 of SATe is located here. This version enables 64-bit computation
and additional features to permit phylogenetic analysis of very large datasets such as
those found in the paper. After decompressing the compressed file, read the file in the directory
./sate_dist/README and follow those instructions to run SATe version 1.1.

SATe is distributed under the GPLv3 license. This distribution contains
other programs which are licensed under different licenses.
Please see ./sate_dist/LICENSE for more details.



Contact

If you have questions or comments,
please contact kliu<at>cs<dot>utexas<dot>edu.