utcs Phylogenetics
Research · Publications · Software · People ·
Datasets

Empirical Datasets for Testing Optimization-Based Methods

Overview

Here we present datasets that are challenging for maximum likelihood and maximum parsimony-based phylogeny estimation methods. If you would like to contribute benchmarks to this resource, please email tandy@cs.utexas.edu.

Nucleic Acid Data

16S.B.ALL (27,643 sequences)

This alignment contains 27,643 ribosomal RNA sequences (each with 6,857 sites) of the 16S gene taken from bacteria.

Source: Gutell Lab CRW [1] ("Primary Alignments" section, Table 1)

Alignment Statistics
Number of Taxa: 27,643
Number of Sites: 6,857
Percent Indels: 80.0
Average Gap Length: 4.9

Reference Alignment: 16S.B.ALL.tar.bz2
Alignment method: based on secondary structure
Alignment modifications: removed sites consisting solely of indels; removed taxa consisting of 50% or more indel characters

ML solutions
Tree: 16S.B.ALL.raxml.nwk
ML score: -1589345.388
Estimation Time: 647.32 hours on a machine with access to 256 GB RAM
Estimated Transition Matrix: [ac ag at cg ct gt] = [0.824176 1.974111 1.198831 0.797367 3.046562 1.000000]
Estimated Rates-across-sites GAMMA distribution shape parameter: 1.000000
RAxML [2] version 7.2.6 was used to estimate an ML tree using the following command: "raxmlHPC -m GTRCAT -n identifier -s input_alignment -j"
RAxML version 7.2.6 was used to compute an ML score using the following command: "raxmlHPC -m GTRGAMMA -n identifier -s input_alignment -f e -t input_tree"

MP Solutions
Tree:
Parsimony score:
Estimation Time:
Estimation Method:
Parsimony score computed by:


16S.T (7,350 sequences)

This alignment contains 7,350 ribosomal RNA sequences (each with 11,856 sites) of the 16S gene taken from three phylogenetic domains.

Source: Gutell Lab CRW [1] ("Primary Alignments" section, Table 1)

Alignment Statistics
Number of Taxa: 7,350
Number of Sites: 11,856
Percent Indels: 87.4
Average Gap Length: 12.1

Reference Alignment: 16S.T.tar.bz2
Alignment method: based on secondary structure
Alignment modifications: removed sites consisting solely of indels; removed taxa consisting of 50% or more indel characters

ML Solutions
Tree: 16S.T.raxml.nwk
ML score: -1727410.940
Estimation Time: 305.30 hours on a machine with access to 32 GB RAM
Estimated Transition Matrix: [ac ag at cg ct gt] = [0.890623 1.923443 1.397081 0.849050 3.169051 1.000000]
Estimated Rates-across-sites GAMMA distribution shape parameter: 1.000000
RAxML [2] version 7.2.6 was used to estimate an ML tree using the following command: "raxmlHPC -m GTRCAT -n identifier -s input_alignment -j"
RAxML version 7.2.6 was used to compute an ML score using the following command: "raxmlHPC -m GTRGAMMA -n identifier -s input_alignment -f e -t input_tree"

MP Solutions
Tree:
Parsimony score:
Estimation Time:
Estimation Method:
Parsimony score computed by:


16S.3 (6,323 sequences)

This alignment contains 6,323 ribosomal RNA sequences (each with 8,716 sites) of the 16S gene taken from three phylogenetic domains.

Source: Gutell Lab CRW [1] ("Primary Alignments" section, Table 1)

Alignment Statistics
Number of Taxa: 6,323
Number of Sites: 8,716
Percent Indels: 82.1
Average Gap Length: 9.4

Reference Alignment: 16S.3.tar.bz2
Alignment method: based on secondary structure
Alignment modifications: removed sites consisting solely of indels; removed taxa consisting of 50% or more indel characters

ML Solutions
Tree: 16S.3.raxml.nwk
ML score: -1376371.670
Estimation Time: 322.09 hours on a machine with access to 32 GB RAM
Estimated Transition Matrix: [ac ag at cg ct gt] = [0.857074 2.036266 1.333605 0.922316 3.214373 1.000000]
Estimated Rates-across-sites GAMMA distribution shape parameter: 1.000000
RAxML [2] version 7.2.6 was used to estimate an ML tree using the following command: "raxmlHPC -m GTRCAT -n identifier -s input_alignment -j"
RAxML version 7.2.6 was used to compute an ML score using the following command: "raxmlHPC -m GTRGAMMA -n identifier -s input_alignment -f e -t input_tree"

MP Solutions
Tree:
Parsimony score:
Estimation Time:
Estimation Method:
Parsimony score computed by:


16S.M.aa_ag (1,028 sequences)

This alignment contains 1,028 ribosomal RNA sequences (each with 4,907 sites) of the 16S gene sampled from eucarya mitochondria. [3]

Source: Gutell Lab CRW [1] ("Alignments Used in Specific Analyses" section)

Alignment Statistics
Number of Taxa: 1,028
Number of Sites: 4,907
Percent Indels: 82.6
Average Gap Length: 22.0

Reference Alignment: 16S.M.aa_ag.tar.bz2
Alignment method: based on secondary structure
Alignment modifications: removed sites consisting solely of indels; removed taxa consisting of 50% or more indel characters

ML Solutions
Tree: 16S.M.aa_ag.raxml.nwk
ML score: -279439.630
Estimation Time: 7.54 hours on a machine with access to 4 GB RAM
Estimated Transition Matrix: [ac ag at cg ct gt] = [1.733083 3.610794 2.690970 0.552364 6.884599 1.000000]
Estimated Rates-across-sites GAMMA distribution shape parameter: 1.000000
RAxML [2] version 7.2.6 was used to estimate an ML tree using the following command: "raxmlHPC -m GTRCAT -n identifier -s input_alignment -j"
RAxML version 7.2.6 was used to compute an ML score using the following command: "raxmlHPC -m GTRGAMMA -n identifier -s input_alignment -f e -t input_tree"

MP Solutions
Tree:
Parsimony score:
Estimation Time:
Estimation Method:
Parsimony score computed by:


16S.M (901 sequences)

This alignment contains 901 ribosomal RNA sequences (each with 4,722 sites) of the 16S gene sampled from mitochondria.

Source: Gutell Lab CRW [1] ("Primary Alignments" section, Table 1)

Alignment Statistics
Number of Taxa: 901
Number of Sites: 4,722
Percent Indels: 78.1
Average Gap Length: 17.2

Reference Alignment: 16S.M.tar.bz2
Alignment method: based on secondary structure
Alignment modifications: removed sites consisting solely of indels; removed taxa consisting of 50% or more indel characters

ML Solutions
Tree: 16S.M.raxml.nwk
ML score: -288262.556
Estimation Time: 5.90 hours on a machine with access to 4 GB RAM
Estimated Transition Matrix: [ac ag at cg ct gt] = [1.473491 2.906494 2.165754 0.532959 5.374066 1.000000]
Estimated Rates-across-sites GAMMA distribution shape parameter: 1.000000
RAxML [2] version 7.2.6 was used to estimate an ML tree using the following command: "raxmlHPC -m GTRCAT -n identifier -s input_alignment -j"
RAxML version 7.2.6 was used to compute an ML score using the following command: "raxmlHPC -m GTRGAMMA -n identifier -s input_alignment -f e -t input_tree"

MP Solutions
Tree:
Parsimony score:
Estimation Time:
Estimation Method:
Parsimony score computed by:


23S.M (278 sequences)

This alignment contains 278 ribosomal RNA sequences (each with 10,738 sites) of the 23S gene sampled from mitochondria.

Source: Gutell Lab CRW [1] ("Primary Alignments" section, Table 1)

Alignment Statistics
Number of Taxa: 278
Number of Sites: 10,738
Percent Indels: 83.7
Average Gap Length: 31.9

Reference Alignment: 23S.M.tar.bz2
Alignment method: based on secondary structure
Alignment modifications: removed sites consisting solely of indels; removed taxa consisting of 50% or more indel characters

ML Solutions
Tree: 23S.M.raxml.nwk
ML score: -241189.477
Estimation Time: 2.33 hours on a machine with access to 4 GB RAM
Estimated Transition Matrix: [ac ag at cg ct gt] = [1.316435 2.243019 2.474016 0.411570 3.588729 1.000000]
Estimated Rates-across-sites GAMMA distribution shape parameter: 1.000000
RAxML [2] version 7.2.6 was used to estimate an ML tree using the following command: "raxmlHPC -m GTRCAT -n identifier -s input_alignment -j"
RAxML version 7.2.6 was used to compute an ML score using the following command: "raxmlHPC -m GTRGAMMA -n identifier -s input_alignment -f e -t input_tree"

MP Solutions
Tree:
Parsimony score:
Estimation Time:
Estimation Method:
Parsimony score computed by:


23S.M.aa_ag (263 sequences)

This alignment contains 263 ribosomal RNA sequences (each with 10,305 sites) of the 23S gene sampled from eucarya mitochondria. [3]

Source: Gutell Lab CRW [1] ("Alignments Used in Specific Analyses" section)

Alignment Statistics
Number of Taxa: 263
Number of Sites: 10,305
Percent Indels: 83.5
Average Gap Length: 34.2

Reference Alignment: 23S.M.aa_ag.tar.bz2
Alignment method: based on secondary structure
Alignment modifications: removed sites consisting solely of indels; removed taxa consisting of 50% or more indel characters

ML Solutions
Tree: 23S.M.aa_ag.raxml.nwk
ML score: -226619.213
Estimation Time: 2.25 hours on a machine with access to 4 GB RAM
Estimated Transition Matrix: [ac ag at cg ct gt] = [1.368771 2.189628 2.382343 0.412065 3.613153 1.000000]
Estimated Rates-across-sites GAMMA distribution shape parameter: 1.000000
RAxML [2] version 7.2.6 was used to estimate an ML tree using the following command: "raxmlHPC -m GTRCAT -n identifier -s input_alignment -j"
RAxML version 7.2.6 was used to compute an ML score using the following command: "raxmlHPC -m GTRGAMMA -n identifier -s input_alignment -f e -t input_tree"

MP Solutions
Tree:
Parsimony score:
Estimation Time:
Estimation Method:
Parsimony score computed by:


23S.E.aa_ag (144 sequences)

This alignment contains 144 ribosomal RNA sequences (each with 8,619 sites) of the 23S gene sampled from eucarya nuclei. [3]

Source: Gutell Lab CRW [1] ("Alignments Used in Specific Analyses" section)

Alignment Statistics
Number of Taxa: 144
Number of Sites: 8,619
Percent Indels: 61.1
Average Gap Length: 13.5

Reference Alignment: 23S.E.aa_ag.tar.bz2
Alignment method: based on secondary structure
Alignment modifications: removed sites consisting solely of indels; removed taxa consisting of 50% or more indel characters

ML Solutions
Tree: 23S.E.aa_ag.raxml.nwk
ML score: -194710.628
Estimation Time: 0.99 hours on a machine with access to 4 GB RAM
Estimated Transition Matrix: [ac ag at cg ct gt] = [0.881197 1.846993 1.255713 1.351159 3.331208 1.000000]
Estimated Rates-across-sites GAMMA distribution shape parameter: 1.000000
RAxML [2] version 7.2.6 was used to estimate an ML tree using the following command: "raxmlHPC -m GTRCAT -n identifier -s input_alignment -j"
RAxML version 7.2.6 was used to compute an ML score using the following command: "raxmlHPC -m GTRGAMMA -n identifier -s input_alignment -f e -t input_tree"

MP Solutions
Tree:
Parsimony score:
Estimation Time:
Estimation Method:
Parsimony score computed by:


23S.E (117 sequences)

This alignment contains 117 ribosomal RNA sequences (each with 9,079 sites) of the 23S gene sampled from eukaryotes.

Source: Gutell Lab CRW [1] ("Primary Alignments" section, Table 1)

Alignment Statistics
Number of Taxa: 117
Number of Sites: 9,079
Percent Indels: 59.7
Average Gap Length: 12.6

Reference Alignment: 23S.E.tar.bz2
Alignment method: based on secondary structure
Alignment modifications: removed sites consisting solely of indels; removed taxa consisting of 50% or more indel characters

ML Solutions
Tree: 23S.E.raxml.nwk
ML score: -190752.258
Estimation Time: 0.78 hours on a machine with access to 4 GB RAM
Estimated Transition Matrix: [ac ag at cg ct gt] = [0.873682 1.922131 1.306106 1.299669 3.566717 1.000000]
Estimated Rates-across-sites GAMMA distribution shape parameter: 1.000000
RAxML [2] version 7.2.6 was used to estimate an ML tree using the following command: "raxmlHPC -m GTRCAT -n identifier -s input_alignment -j"
RAxML version 7.2.6 was used to compute an ML score using the following command: "raxmlHPC -m GTRGAMMA -n identifier -s input_alignment -f e -t input_tree"

MP Solutions
Tree:
Parsimony score:
Estimation Time:
Estimation Method:
Parsimony score computed by:

References

Disclaimer

Links to external web sites are for datasets and software available through other laboratories and organizations. The respective labs and organizations are responsible for these datasets and software; please contact them if you have any problems or questions regarding their material. If you experience any problems with our datasets or software, please feel free to contact us at tandy@cs.utexas.edu.
Copyright © 2009-2010 Computational Phylogenetics Lab | ACES 3.304 | University of Texas | Austin, TX 78712
Site help/questions/feedback/requests: e-mail Tandy Warnow