|
utcs Phylogenetics
|
|
[All]
[Phylogeny Estimation]
[Alignment] [Supertrees] [Simulation Tools] [restricted access] |
Alignment Estimation DatasetsOverview
Here we present datasets that are challenging for molecular
sequence alignment estimation methods. These can be used to
test alignment methods. Each dataset is given with a reference
alignment. In the case of the simulated datasets, the true
alignment is provided. The empirical datasets have highly
reliable curated alignments. If you would like to contribute
benchmarks to this resource, please email tandy@cs.utexas.edu.
Simulated DataNucleic Acid Data
RNASim simulated data:
Junhyong Kim's group has simulated RNA data available for download on
their webpage. The
datasets were generated to reflect secondary structure dynamics and vary
in size from 128 to 16,384 taxa.
However, larger subsets of aligned sequences can be obtained through the use of
their software.
SATé - Simulated: The companion page to our SATé paper contains simulated nucleotide datasets. These datasets include 20 replicates for each of 37 distinct model conditions; each model condition is defined by a distribution of gap lengths (short, medium, or long) and number of taxa (100, 500, or 1000).
Described in [1]
Studied in [1] indel-Seq-Gen version 2.1.0 Simulated: Large nucleotide benchmark datasets were simulated using indel-Seq-Gen version 2.1.0 using similar conditions to those in the data sets from the SATé paper. These datasets include 20 replicates for each of 125 distinct model conditions for 5000 taxa and 68 model conditions for 10,000 taxa. Amino Acid Data
Wang et al. - TCBB 09: Li-San Wang's
companion webpage
to [3] has amino acid data on sets ranging from 20 to 100 taxa, which were simulated using
Rose [4] under various model conditions.
Described in [3]
Studied in [3] Capella-Gutierrez et al. - Bioinformatics 2009: The trimAl software package provides a method for identifying sites within multiple sequence alignments that are considered unrelaible, and should be trimmed (masked) before given to a phylogenetic estimation method. The Bioinformatics paper [11] studied this method on simulated AA datasets [12] produced using Rose [4] under a range of model conditions: various number of sequences (up to 64), of various lengths, simulated from different tree topologies and divergence rates. The data set includes original "true" alignments (only for sets of 32 and 64 sequences) and trees [11].
Described in [12]
Studied in [11] Empirical DataNucleic Acid Data
CRW data:
Robin Gutell's
Comparative RNA Website (CRW) has curated alignments on RNA datasets spanning the
tree of life, which can be used as benchmarks for alignment estimation techniques.
Described in [5]
Studied in [1] Amino Acid Data
BAliBASE:
BAliBASE is a database of benchmark protein multiple sequence alignments.
Described in [7, 8]
BENCH: Robert Edgar has assembled a collection of several benchmark datasets (BALIBASE v3, PREFAB v4, OXBENCH, SABRE) for amino acid sequence alignment and standardized them to FASTA format for ease of use. OXBench: OXBench contains benchmark datasets that can serve as inputs for protein sequence alignment methods.
Described in [10]
PREFAB: Robert Edgar's PREFAB contains benchmark amino acid datasets for multiple alignment methods.
Described in [9]
SABmark: This page has downloads for SABmark, which is a benchmark database for protein sequence alignment. References
Disclaimer
Links to external web sites are for datasets and software available through
other laboratories and organizations. The respective labs and organizations
are responsible for these datasets and software; please contact them if you
have any problems or questions regarding their material. If you experience
any problems with our datasets or software, please feel free to
contact us at tandy@cs.utexas.edu.
|
|
Copyright © 2009-2010 Computational Phylogenetics Lab |
ACES 3.304 |
University of Texas |
Austin, TX 78712 Site help/questions/feedback/requests: e-mail Tandy Warnow |