utcs Phylogenetics
Research · Publications · Software · People ·

Alignment Estimation Datasets


Here we present datasets that are challenging for molecular sequence alignment estimation methods. These can be used to test alignment methods. Each dataset is given with a reference alignment. In the case of the simulated datasets, the true alignment is provided. The empirical datasets have highly reliable curated alignments. If you would like to contribute benchmarks to this resource, please email tandy@cs.utexas.edu.

Simulated Data

Nucleic Acid Data

RNASim simulated data: Junhyong Kim's group has simulated RNA data available for download on their webpage. The datasets were generated to reflect secondary structure dynamics and vary in size from 128 to 16,384 taxa. However, larger subsets of aligned sequences can be obtained through the use of their software.

SATé - Simulated: The companion page to our SATé paper contains simulated nucleotide datasets. These datasets include 20 replicates for each of 37 distinct model conditions; each model condition is defined by a distribution of gap lengths (short, medium, or long) and number of taxa (100, 500, or 1000).
Described in [1]
Studied in [1]

indel-Seq-Gen version 2.1.0 Simulated: Large nucleotide benchmark datasets were simulated using indel-Seq-Gen version 2.1.0 using similar conditions to those in the data sets from the SATé paper. These datasets include 20 replicates for each of 125 distinct model conditions for 5000 taxa and 68 model conditions for 10,000 taxa.

Amino Acid Data

Wang et al. - TCBB 09: Li-San Wang's companion webpage to [3] has amino acid data on sets ranging from 20 to 100 taxa, which were simulated using Rose [4] under various model conditions.
Described in [3]
Studied in [3]

Capella-Gutierrez et al. - Bioinformatics 2009: The trimAl software package provides a method for identifying sites within multiple sequence alignments that are considered unrelaible, and should be trimmed (masked) before given to a phylogenetic estimation method. The Bioinformatics paper [11] studied this method on simulated AA datasets [12] produced using Rose [4] under a range of model conditions: various number of sequences (up to 64), of various lengths, simulated from different tree topologies and divergence rates. The data set includes original "true" alignments (only for sets of 32 and 64 sequences) and trees [11].
Described in [12]
Studied in [11]

Empirical Data

Nucleic Acid Data

CRW data: Robin Gutell's Comparative RNA Website (CRW) has curated alignments on RNA datasets spanning the tree of life, which can be used as benchmarks for alignment estimation techniques.
Described in [5]
Studied in [1]

Amino Acid Data

BAliBASE: BAliBASE is a database of benchmark protein multiple sequence alignments.
Described in [7, 8]

BENCH: Robert Edgar has assembled a collection of several benchmark datasets (BALIBASE v3, PREFAB v4, OXBENCH, SABRE) for amino acid sequence alignment and standardized them to FASTA format for ease of use.

OXBench: OXBench contains benchmark datasets that can serve as inputs for protein sequence alignment methods.
Described in [10]

PREFAB: Robert Edgar's PREFAB contains benchmark amino acid datasets for multiple alignment methods.
Described in [9]

SABmark: This page has downloads for SABmark, which is a benchmark database for protein sequence alignment.



Links to external web sites are for datasets and software available through other laboratories and organizations. The respective labs and organizations are responsible for these datasets and software; please contact them if you have any problems or questions regarding their material. If you experience any problems with our datasets or software, please feel free to contact us at tandy@cs.utexas.edu.
Copyright © 2009-2010 Computational Phylogenetics Lab | ACES 3.304 | University of Texas | Austin, TX 78712
Site help/questions/feedback/requests: e-mail Tandy Warnow