Simulated Datasets for Phylogeny Estimation


Here we present datasets that are challenging for phylogeny estimation methods. These can be used to test methods for maximum likelihood (ML) and maximum parsimony (MP) with respect to the related optimization criteria, or any methods with respect to the tree accuracy. Some of these datasets are given with a reference tree. In the case of the simulated datasets, the true (model) tree and alignment are provided. Reference trees for empirical datasets with highly reliable (curated) alignments are typically provided by running RAxML with bootstrapping, and retaining only the highly supported edges. If you would like to contribute benchmarks to this resource, please email tandy@cs.utexas.edu.

Simulated Data

FastTree - simulated: The companion page for Price et al.'s FastTree publications [1, 2] contains downloads for the simulated alignments used to infer phylogenies. These include protein alignments on 250, 1250, and 5000 taxa and a nucleotide alignment on 78,132 taxa. The simulated data were created using Rose [4].
SATé - Simulated: The companion page to our SATé paper contains simulated nucleotide datasets. These datasets include 20 replicates for each of 37 distict model conditions; each model condition is defined by a distribution of gap lengths (short, medium, or long) and number of taxa (100, 500, or 1000). The downloads include both true alignments and potentially inferrable model trees (PIMT), making them useful for testing tree estimation techniques.
RNASim simulated data: Junhyong Kim's group has simulated RNA data available for download on their webpage. The datasets were generated to reflect secondary structure dynamics and vary in size from 128 to 16384 taxa. However, larger subsets of aligned sequences can be obtained through the use of their software.

indel-Seq-Gen version 2.1.0 Simulated: Large nucleotide benchmark datasets were simulated using indel-Seq-Gen version 2.1.0 using similar conditions to those in the data sets from the SATé paper. These datasets include 20 replicates for each of 125 distinct model conditions for 5000 taxa and 68 model conditions for 10000 taxa.



Links to external web sites are for datasets and software available through other laboratories and organizations. The respective labs and organizations are responsible for these datasets and software; please contact them if you have any problems or questions regarding their material. If you experience any problems with our datasets or software, please feel free to contact us at tandy@cs.utexas.edu.
