PhD Proposal: Siavash Mir Arabbaygi, GDC 4.516

Contact Name: 
Lydia Griffith
May 22, 2014 10:00am - 12:00pm

PhD Proposal:  Siavash Mir Arabbaygi

Date: May 22nd
Time: 10 am.
Place: GDC 4.516
Research supervisor:  Tandy Warnow

Novel scalable approaches for multiple sequence alignment and phylogenetic reconstruction


The amount of biological data is exploding, and there are many biological questions that can be best addressed by analyzing these large volumes of data. In this dissertation, we focus on two important and related biological problems, namely Multiple Sequence Alignment (MSA) and phylogenomics, with the goal of analyzing very large datasets. In the context of MSA, we develop new methods for estimating alignments and for efficiently comparing two alignments. Our alignment tool, PASTA, can build very accurate MSAs on tens to hundreds of thousands of sequences within reasonable time frames. Our alignment comparison tool, FastSP, runs in linear time with respect to the number of cell in the alignments compared, and has been used successfully on alignments of one million sequences. In the context of phylogenomics, we focus on the problem of species tree reconstruction in the presence of discordance between gene trees and species trees due to incomplete lineage sorting. We show in comprehensive simulation studies that existing methods for species tree reconstruction have reduced accuracy in conditions that are likely to arise with genome-scale data. We then propose a new statistical binning approach to address this issue. We show that by binning genes into larger units using a graph-theoric formulation of their statistical compatibility, we can improve the estimation of gene trees, which leads to dramatic improvements in the estimation of species tree. We apply this method to a dataset of 48 avian full genomes –arguably the first genome-scale study in vertebrates– and present a highly resolved species tree estimated based on thousands of genes. Finally, we develop ASTRAL, a new method for estimating species trees by summarizing the input gene trees. We show that ASTRAL is more accurate than MP-EST, the best of the existing methods, and is computationally more efficient.