Topics: This graduate course covers computational issues in molecular biology and historical linguistics. The biological topics we will cover include genome assembly, multiple sequence alignment, the estimation of evolutionary trees and networks, and metagenomics. In biology, we will computational problems for estimating evolutionary histories of families of natural languages (Indo-European, Dravidian, Finno-Ugric, etc.). These problems are individually enormously computationally intensive - computational approaches generally involve attempts to solve NP-hard optimization problems. Algorithmic development is needed for improved scientific analyses. We will cover statistical models, primary optimization problems, and the mathematical theory needed to predict the performance of estimation methods under the models. The course will also present a large spectrum of algorithms (both deterministic with established theory, and heuristics) that have been developed for these problems. In addition, we will have lectures from guest biologists.
Background Expected for Students: The course is designed for students in computer science or mathematics, and no background in biology or linguistics is assumed or even needed. However, if you are a student from another disciplines (such as biology) and are interested in taking the course, please see me.
Research Opportunities: The goal of the course is to enable the students to do high quality research (both mathematical and algorithmic) for these problems. Furthermore, abundant research opportunities are available, including development of new heuristics for (typically NP-hard) discrete optimization problems, analyses of biological or linguistic data using new methods, simulation studies of existing methods, etc. Phylogeny and multiple sequence alignment estimation has several easily formulated and accessible research challenges, which new students could make progress on - without needing to know any biology or linguistics. Therefore, much of the course will be focused there.
Grading Scheme:
Homework assignments will be of three types: pen and paper (doing calculations, proving theorems, etc.), programming (developing, implementing, and/or testing methods for computational biology or computational historical linguistics problems), and discussing published papers.
The course requires a final project of each student. You are strongly encouraged to do a research project, but you can also do a survey paper on some topic relevant to the course material. In both cases, your project should be a paper (of about 15 pages) in a format and style appropriate for submission to a journal. Research projects can involve two students, but survey papers must be done by yourself.
Grades on the final project depend upon the kind of project you do. For a research paper, your grade will be 30% writing, 40% scientific/algorithmic rigor, and 30% impact. If you do a survey paper, the grade will be 30% writing, 30% summary of the literature you discuss, and 40% commentary (i.e., insight, critical and thoughtful discussion of the issues that come up).
Schedule:
Note: all too often, I have found that some students do not understand that copying text (with, perhaps, slight modifications) amounts to plagiarism, unless the copied text is put in quotes, and the original source correctly attributed. Please do not plagiarize, even unintentionally.
Click here for more about the final project.
The final exam will be comprehensive. If the entire class agrees to it, the final exam could be a take-home exam.