Datasets


SATe experiment simulated nucleotide datasets README

There are 37 model conditions, with either 1000, 500, or 100 taxa, and either long, medium, or short gap length types. Each model condition is referred to with the string "<number of taxa><gap length type: L for long, M for medium, S for short><id number>".
Each model condition has 20 replicate datasets.
For each model condition, an associated link to a tarball is listed below. The tarball contains the 20 replicate datasets for that model condition.
After decompressing a tarball, the directory structure ./R<replicate number>/ will have the following files:

1000 long
1000 medium
1000 short
500 long
500 medium
500 short
100 long
100 medium
100 short
1000L1
1000L2
1000L3
1000L4
1000L5
1000M1
1000M2
1000M3
1000M4
1000M5
1000S1
1000S2
1000S3
1000S4
1000S5
500L1
500L2
500L3
500L4
500L5
500M1
500M2
500M3
500M4
500M5
500S1
500S2
500S3
500S4
500S5
100L1
100L2
100M1
100M2
100M3
100S1
100S2

These model conditions have the following empirical statistics.


True Alignment Statistics (Set-wise)
PIMT Statistics
(Branch-wise)
PIMT Statistics
Model Avg p-dist
Max p-dist
% indels
No. of cols
Avg gap len Median gap len e-ANHD e-gap True tree percent reso
1000L1 69.5% 76.9% 73.2% 3817.5 13.6 10.7 25.3% 0.2% 99.7%
1000L2 69.6% 76.9% 57.7% 2406.9 11.6 9.3 25.1% 0.1% 99.6%
1000L3 68.7% 76.3% 85.2% 7042.8 20.0 16.1 22.8% 0.3% 99.7%
1000L4 50.0% 60.8% 58.6% 2446.2 11.4 9.2 6.5% 0.1% 97.6%
1000L5 49.6% 60.6% 42.6% 1764.8 10.4 8.0 6.5% 0.0% 97.4%
1000M1 69.5% 76.9% 74.4% 3965.0 10.1 8.0 24.8% 0.2% 99.6%
1000M2 68.4% 76.2% 74.2% 3972.3 10.3 7.9 22.1% 0.2% 99.5%
1000M3 66.0% 74.1% 62.8% 2722.6 7.6 5.6 18.0% 0.1% 99.4%
1000M4 49.5% 60.6% 60.5% 2570.6 7.6 5.8 6.6% 0.1% 97.6%
1000M5 49.9% 60.2% 44.2% 1810.0 6.2 4.4 6.8% 0.1% 97.8%
1000S1 69.4% 76.8% 53.0% 2141.2 4.0 3.4 24.5% 0.1% 99.7%
1000S2 69.3% 76.8% 35.0% 1546.0 2.9 2.4 23.8% 0.0% 99.7%
1000S3 68.6% 76.3% 37.0% 1595.2 2.9 2.4 22.3% 0.0% 99.6%
1000S4 50.1% 60.8% 24.6% 1328.1 2.5 2.0 6.7% 0.0% 97.9%
1000S5 49.8% 61.1% 14.1% 1165.2 2.3 2.0 6.5% 0.0% 97.8%
500L1 67.0% 74.9% 80.5% 5419.3 17.0 13.0 21.2% 0.5% 99.5%
500L2 65.7% 73.9% 80.9% 5475.9 16.9 12.7 19.3% 0.5% 99.4%
500L3 65.8% 74.1% 68.6% 3306.9 12.9 10.3 19.4% 0.3% 99.4%
500L4 49.9% 60.7% 69.6% 3390.2 13.3 10.0 7.5% 0.3% 98.3%
500L5 49.7% 60.6% 51.0% 2075.3 10.9 8.6 7.1% 0.1% 97.8%
500M1 67.4% 74.8% 70.9% 3522.3 9.1 6.8 22.7% 0.3% 99.5%
500M2 65.8% 74.0% 69.7% 3394.5 9.0 6.5 19.4% 0.3% 99.5%
500M3 65.7% 73.7% 53.5% 2185.2 6.8 4.9 19.4% 0.2% 99.4%
500M4 49.1% 60.5% 52.5% 2154.6 6.7 4.9 7.0% 0.1% 98.0%
500M5 49.5% 60.5% 35.6% 1568.2 5.7 4.2 7.3% 0.1% 98.1%
500S1 67.3% 75.2% 48.7% 1962.2 3.6 3.0 22.2% 0.2% 99.5%
500S2 65.5% 73.7% 48.0% 1935.8 3.6 2.9 19.1% 0.2% 99.5%
500S3 65.6% 73.6% 31.7% 1468.2 2.7 2.0 19.4% 0.1% 99.5%
500S4 49.2% 60.6% 31.3% 1459.8 2.7 2.0 7.2% 0.1% 98.2%
500S5 49.8% 60.4% 18.7% 1231.0 2.3 1.9 7.3% 0.0% 98.0%
100L1 62.3% 71.0% 56.4% 2459.7 11.2 8.1 20.5% 0.8% 99.7%
100L2 32.2% 44.4% 53.6% 2281.9 11.4 8.5 4.3% 0.7% 95.9%
100M1 62.6% 71.0% 55.1% 2316.8 7.1 4.8 20.4% 0.8% 99.7%
100M2 53.1% 63.3% 53.6% 2262.9 7.0 4.8 12.7% 0.8% 99.3%
100M3 45.0% 56.4% 38.9% 1681.9 5.9 4.2 8.1% 0.4% 98.4%
100S1 58.0% 67.5% 40.4% 1698.2 3.1 2.6 16.1% 0.6% 99.0%
100S2 32.0% 43.6% 57.2% 2418.3 4.6 3.6 4.6% 1.1% 96.8%





Main biological DNA/RNA datasets README
There are 171 biological datasets. Each dataset consists of either DNA or RNA sequences from a variety of markers and taxa.
For each dataset, an associated link to a tarball is listed below. The tarball contains a single directory ./R0/ that contains the following files:


rRNA datasets
tRNA datasets
Intronic datasets Avian datasets
Miscellaneous
16S.A.ALL
16S.A.aa_ag
16S.A.crenarchaeota
16S.A.euryarchaeota
16S.B.aquificae
16S.B.bacteroidetes-chlorobi
16S.B.chlamydiae-verrucomicrobia
16S.B.chloroflexi
16S.B.clostridia
16S.B.cyanobacteria
16S.B.deinococcus-thermus
16S.B.delta-epsilon-proteobacteria
16S.B.fibrobacteres-acidiobacteria
16S.B.fusobacteria
16S.B.mollicutes
16S.B.nitrospirae
16S.B.planctomycetes
16S.B.spirochaetes
16S.B.thermotogae
16S.C.aa_ag
16S.C
16S.E.aa_ag
16S.E.acanthamoeba
16S.E.bacillariophyta
16S.E.blastocystis
16S.E.coccidia
16S.E.entamoebidae
16S.E.giardiinae
16S.E.haemosporida
16S.E.haplosporida
16S.E.heterolobosea
16S.E.litostomatea
16S.E.lobosea
16S.E.oomycetes
16S.E.parabasalidea
16S.E.perkinsea
16S.E.piroplasmida
16S.E.trypanosomatidae
16S.M.aa_ag
16S.M
16S.fungint.all
16S.fungint.nint
16S.fungint.wint
23S.3
23S.A.aa_ag
23S.A
23S.B.aa_ag
23S.B.gammaproteobacteria
23S.B.proteobacteria
23S.C.aa_ag
23S.C
23S.E.aa_ag
23S.E
23S.M.aa_ag
23S.M
23S.fungint.all
23S.fungint.nint
5S.3
5S.A
5S.B.ALL
5S.B.actinobacteria
5S.B.alphaproteobacteria
5S.B.bacillusclostridium
5S.B.betaproteobacteria
5S.B.firmicutes
5S.B.gammaproteobacteria
5S.B.proteobacteria
5S.C
5S.E
5S.M
5S.T
seed.16S.A
seed.16S.B
seed.23S.A
trna.A.nm
trna.C.nm
trna.D.nm
trna.E.nm
trna.F.nm
trna.G.nm
trna.H.nm
trna.I.nm
trna.K.nm
trna.M.nm
trna.N.nm
trna.P.nm
trna.Q.nm
trna.W.nm
trna.Y1.nm
I1.A
I1.B
I1.C1
I1.C2
I1.C3
I1.D
I1.E
I2.A
I2.B
avian_ACA_all
avian_ACA_intron1
avian_ALD_all
avian_ALD_intron3
avian_ALD_intron4
avian_ALD_intron5
avian_ALD_intron6
avian_ALD_intron7
avian_BDNF_all
avian_CHC_all
avian_CHC_intron4
avian_CHC_intron5
avian_CMYC_UTR
avian_CMYC_all
avian_CMYC_intronB
avian_DCOH_all
avian_DCOH_intron2
avian_DCOH_intron3
avian_EEF_all
avian_EEF_intron4
avian_EEF_intron5
avian_EEF_intron6
avian_EEF_intron7
avian_EGR1_UTR
avian_EGR1_all
avian_EXONS_ALL
avian_FIB4_all
avian_FIB4_intron4
avian_FIB5_all
avian_FIB5_intron5
avian_FIB67_all
avian_FIB67_intron6
avian_FIB67_intron7
avian_FIB_all
avian_HMG_all
avian_HMG_intron2
avian_HMG_intron3
avian_HMG_intron4
avian_HMG_intron5
avian_IRF2_all
avian_IRF2_intron2
avian_MUSK_all
avian_MUSK_intron4
avian_MYO_all
avian_MYO_intron2
avian_NGF_UTR
avian_NGF_all
avian_NT3_all
avian_RHOD_all
avian_RHOD_intron1
avian_RHOD_intron2
avian_RHOD_intron3
avian_SOMA_all
avian_SOMA_intron2
avian_SOMA_intron3
avian_TGF_all
avian_TGF_intron5
avian_TROP_all
avian_TROP_intron6
avian_UTRS_ALL
Aster_ITS
E_hildenbrandia
diatom
linder_asteraceae
linder_helianthus_104
linder_helianthus_actin
mollusk_mito
moody_halorag
morrison_coccidia
rhodstram.18S
rhodstram.intron
theriot_diatom_ssu
theriot_heterokont_ssu




Largest biological RNA datasets README

There are 171 biological datasets. Each dataset consists of RNA sequences from a variety of markers and taxa.
For each dataset, an associated link to a tarball is listed below. The tarball contains a single directory ./R0/ that contains the following files:

16S.3
16S.B.ALL
16S.B.aa_ag
16S.B.actinobacteria
16S.B.alphaproteobacteria
16S.B.bacilli
16S.B.betaproteobacteria
16S.B.firmicutes
16S.B.gammaproteobacteria
16S.B.proteobacteria
16S.E.ALL
16S.E.streptophyta
16S.T
23S.B.ALL
23S.T
seed.23S.B



Two-phase experiment simulated DNA datasets README

There are 84 model conditions, with either 1000, 500, or 100 taxa, and either long, medium, or short gap length types. Each model condition is referred to with the string "<number of taxa><gap length type: L for long, M for medium, S for short>-<id number>".
Each model condition has 20 replicate datasets.
For each model condition, an associated link to a tarball is listed below. The tarball contains the 20 replicate datasets for that model condition.
After decompressing a tarball, the directory structure ./R<replicate number>/ will have the following files:

TODO

1000 long
1000 medium
1000 short
500 long
500 medium
500 short
100 long
100 medium
100 short
1000L-1
1000L-2
1000L-3
1000L-4
1000L-5
1000L-6
1000L-7
1000L-8
1000L-9
1000L-10
1000M-1
1000M-2
1000M-3
1000M-4
1000M-5
1000M-6
1000M-7
1000M-8
1000S-1
1000S-2
1000S-3
1000S-4
1000S-5
1000S-6
1000S-7
1000S-8
1000S-9
1000S-10
500L-1
500L-2
500L-3
500L-4
500L-5
500L-6
500L-7
500L-8
500L-9
500L-10
500M-1
500M-2
500M-3
500M-4
500M-5
500M-6
500M-7
500M-8
500S-1
500S-2
500S-3
500S-4
500S-5
500S-6
500S-7
500S-8
500S-9
500S-10
100L-1
100L-2
100L-3
100L-4
100L-5
100L-6
100L-7
100L-8
100L-9
100L-10
100M-1
100M-2
100M-3
100M-4
100M-5
100M-6
100M-7
100M-8
100S-1
100S-2
100S-3
100S-4
100S-5
100S-6
100S-7
100S-8
100S-9
100S-10