CS 394C: Algorithms for Computational Biology
-
TuTh 3:30-5:00 in Taylor 3.144
-
Instructor: Tandy Warnow
- Office hours TuTh 1:00-2:00 in ACES 3.428.
-
TA: Kevin Liu, kliu@cs.utexas.edu. Office hours by appointment only.
The course topic is algorithm design in computational
molecular biology. This year we will focus our study on
multiple sequence alignment, phylogeny
(evolutionary history) reconstruction,
metagenomics, and historical linguistics.
These problems are
individually enormously computationally intensive -
computational approaches generally involve attempts to
solve NP-hard optimization problems, and months of
analysis can be used to estimate the phylogeny, without
any guarantee of optimality. Thus, algorithmic
development, firmly grounded in mathematical theory,
is needed by the biological research community.
The goal of the course is to enable the students to do high
quality research (both mathematical and algorithmic) for these
problems.
The mathematical foundations of phylogeny reconstruction
are quite elegant and deep, and so most of our discussion
will be focused there.
We will cover the statistical models used to describe evolution of
molecular sequences, the primary optimization problems, and the
mathematical theory needed to predict the performance of
reconstruction methods under the models. The course will also
present a large spectrum of algorithms (both deterministic
with established theory, and heuristics) that have been developed
for these problems. In addition, we will have several lectures from
guest biologists with whom I collaborate.
No background in biology is assumed for this course.
The mathematics used in developing algorithms for both phylogeny
reconstruction and multiple sequence alignment combines combinatorics
and graph theory (including the theory of chordal graphs), complexity
theory, and probability and statistics. Students with good preparation
in at least one of these areas will be able to begin doing algorithm
development early on. However, abundant research opportunities exist
for those students without significant mathematical training: simulation
studies of existing methods is an important methodology within the field,
and has resulted in many highly influential and important publications
in major scientific journals. Therefore, students of all backgrounds
(including from biology) are welcome.
The grading scheme is:
- HW: 50%
- Class participation: 10%
- Project: 15%
-
Final: 25%
Class presentations
-
August 27, 2009 (PPT)
-
September 1, 2009 (PPT).
-
September 3, 2009 (PPT).
Today we introduce statistical issues of phylogeny estimation, and
begin a discussion about maximum likelihood and Bayesian methods.
-
September 8, 2009 (PDF, part 1)
and
(PPT, part 2).
We continue with more about statistical issues, covering
sequence length requirements of distance-based methods.
-
September 10, 2009. Theoretical results about consensus and
supertree methods.
(PPT)
-
September 15, 2009.
We will read and discuss papers about the
performance of phylogeny reconstruction methods on simulated data.
Read this paper (about
quartet-based phylogeny estimation methods)
and this paper
(about sequence length requirements)
-
September 17, 2009.
SupreFine - a new supertree method
(Guest speaker: Dr. Shel Swenson.)
-
September 22, 2009.
Divide-and-conquer methods in phylogenetics: introduction.
(PPT)
-
September 24, 2009.
Intro to multiple sequence alignments.
(PPT)
-
September 29, 2009.
SATe: Simultaneous estimation of trees and alignments
(PDF).
(Guest speaker: Kevin Liu.)
-
October 1, and 6, 2009.
The "Next SATe" algorithm.
-
October 6, 2009.
Optimizing treelength: a good approach
to phylogeny estimation?
(PPT),
based upon the 2009 TCBB paper "Barking up the
wrong treelength" by Liu et al.
-
October 8, 2009.
Guest speaker: Shel Swenson.
Topic: Using Quartets MaxCut to improve supertree estimation.
-
October 13, 2009.
Discussion regarding proposed projects.
-
October 15, 2009.
Introduction to chordal graphs (Shel Swenson).
-
October 20, 2009.
Disk-covering methods, absolute convergence,
and speeding up heuristics for hard optimization problems.
- October 22, 2009.
Reticulate evolution, and gene tree/species tree
discord (preparation for Luay Nakhleh's presentation).
-
October 26, 2009.
9-10 AM. Luay Nakhleh (Rice University)
will give a talk in
ACES 2.402,
"Computational techniques for inferring
phylogentic relationships using multiple loci."
- November 10, 2009. 3:30-5.
Practice talk for
Serita Nelesen's PhD defense, in ACES 5.444.
-
November 12, 2009. Discussion of Serita's work.
-
November 17, 2009.
Young-suk Lee will present his undergraduate thesis work.
-
November 19, 2009.
Introduction to whole genome phylogeny reconstruction (PPT).
Discussion of term projects.
-
November 24, 2009. No class!
-
November 26-28, holidays.
-
December 1, 2009
Review.
-
December 3, 2009.
Review.
Final Project
The course requires a final project of each student:
click here for a list of suggested topics.
Your final project should be a paper of about 15 pages, in a format
and style
appropriate for submission to a journal.
The grade on the final project will be 30% writing, 30% summary of the
literature you discuss, and 40% commentary (i.e., insight,
critical and thoughtful discussion of the issues that come up).
The final project is due at the beginning of the final exam.
Note: all too often, I have found that some students do not understand
that copying text (with, perhaps, slight modifications) amounts to
plagiarism, unless the copied text is put in quotes, and the
original source correctly
attributed.
Please do not plagiarize, even unintentionally.
Final Exam
The final exam will cover all the material discussed in the course,
and will be closed book.
Homework
-
The majority of the homework problems for the course can be found
here; these
are based upon the textbook, available
here.
-
Homework #1
(due September 10) is
on the last slide of the powerpoint presentation for
September 1, 2009.
-
Homework #2 (due September 15):
Read this paper (about
quartet-based phylogeny estimation methods)
and this paper
(about sequence length requirements). Hand in a one page
summary of one of these papers, and include questions for the class (in your write-up).
Your questions can address the methodology in the simulation study,
the conclusions (you are free to disagree!), or suggestions for
further research.
-
Homework #3 (due September 29):
- Read the Liu et al. paper from Science 2009. This will be presented
by the first author (Kevin Liu, your T.A.) on Tuesday, September 29. Read the
paper in advance, and come with a list of questions. (Hand in the questions after
class.)
The paper can be downloaded by a link (see paper #89) at
this page.
- Do problems 5.2(3), 5.2(4), 5.2(5). Extra credit: 5.2(6)
- Homework #4 (due Oct 1):
Read any one of the following papers. Some are
theoretical, some are experimental (using simulation to
evaluate methods), and some do both.
Pick just one, and write up a one page summary, and provide
questions (just like the previous homework).
-
Huson et al.,
"Disk-Covering, a fast converging method for phylogenetic tree reconstruction."
Journal of Computational Biology (special issue for RECOMB 1999)
(PDF).
-
T. Warnow, B.M.E. Moret, and K. St. John. 2001. "Absolute Convergence: True Trees From Short Sequences."
ACM-SIAM Symposium on Discrete Algorithms (SODA) 2001.
(PS).
-
Nakhleh et al., 2001(a). "Designing fast converging phylogenetic methods."
9th Int'l Conf. on Intelligent Systems for Molecular Biology (ISMB 2001), Copenhagen,
in Bioinformatics 17, Suppl. 1, (2001), pp. S190-S198.
(PDF).
-
Nakhleh et al., 2001(b).
"The performance of phylogenetic methods on trees of bounded diameter."
Proceedings, The First International Workshop on Algorithms in Bioinformatics (WABI).
(PDF).
- Homework #5 (due Oct 6): Do problems 6.1(1), 6.1(2), and 6.1(3).
Skim the chapter on designing disk-covering
methods available here.
- Homework #6 (due Oct 8): Read
"Improving supertree methods using Quartets MaxCut" by Swenson
et al.
(PDF)
Write a summary and questions.
- Homework #7 (due Oct 13):
Bring in a first draft of your planned final project will
cover.
You are all encouraged to discuss this with me before the
homework is due. The
point of this homework is to get you started, and to ensure
that you have identified a topic that is feasible.
Whether you choose to write a paper surveying some
papers (summarizing, critiquing, comparing) or doing
a research project involving programming,
you will be asked to
identify one or more papers, read them quickly and
briefly summarize them. The topic and project you propose
will be reviewed by me, and you will get feedback. It is
likely that what you propose here will not be what you
end up doing, so don't be too worried about getting this
"right" the first time. (Also, whatever topic you pick,
the paper(s) you pick should not all be based upon
papers we've already read, and should include at least
one paper that doesn't include me as a co-author!)
-
If you wish
to read and discuss one or more papers on a topic: suggest a topic,
and identify two to five papers that might be relevant.
Download all the papers, and skim through them. Give very
brief (one paragraph) summaries, and
roughly say what your project will
discuss.
-
If you prefer to do a project that
involves programming:
suggest a topic, identify a relevant paper,
download it, and write up a summary.
Specify what your
project will do (at a minimum).
- Homework #8 (due October 22): Read the first of
the following three papers, and skim through the other two:
- A paper by Luay Nakhleh at
this
location
- A 1997 paper by Wayne Maddison
that set the ground for Luay's paper,
at this
location,
and
-
A paper by Degnan and
Rosenberg, available
at here.
Write a summary of each paper (detailed for the first, and brief for the other two),
and bring questions for the first paper. Luay will visit us and talk on Monday,
October 26, so your questions will be addressed to him on that day.
- Homework #9 (due November 5).
Read papers [1]-[3] given at
Dannie Durand's Notung page.
Read also the paper on DupTree, given at
DupTree webpage.
Summarize each paper, and
compare to the problems studied by Luay Nakhleh.
Prepare questions, focusing on
possible research directions.
Readings
-
The draft of the textbook is available
here.
This draft is being updated regularly;
please do not circulate.
The material in this textbook is very basic -- please
start with this!
-
Tutorial on Phylogenetic Tree Estimation
by J. Kim and T. Warnow. ISMB 1999.
(This is a very statistically oriented tutorial, covering
Markov models of evolution and the performance of
phylogeny estimation methods under these models.)
-
An overview of phylogeny
reconstruction, by C. Randall Linder
and T. Warnow.
Handbook of Computational Molecular Biology, Chapman and Hall, 2005, Chapter 19-1.
This overview contains much more about the biological data and the issues
involved, from a biologist's perspective.
-
Disk Covering Methods: improving
the accuracy and speed of large-scale phylogenetic analyses,
by T. Warnow,
Handbook of Computational Molecular Biology, Chapman and Hall, 2005,
Chapter 19-2.
-
Network (Reticulate) Evolution: Biology, Models, and Algorithms,
(PDF),
by C. Randal Linder, Bernard M.E. Moret, Luay Nakhleh, and Tand Warnow.
Tutorial presented at the Pacific Symposium on Biocomputing 2004.
-
For an updated introduction to reticulate evolution and the
computational problems related to this area, see
this book
chapter, written by Luay Nakhleh (former PhD student, now at Rice).
-
Steven Salzberg (University of Maryland) is
teaching a course on metagenomics this semester. I have
copied his reading list, and made it available
here.
News articles
- The New York Times has some
interesting blogs that concern evolution and biotechnology.
Here's one from September 1, 2009.
Additional resources
-
Bioinformatics: Sequence
Analysis, a course taught by Luay Nakhleh at Rice University. His
course website has slides from his course, a reading list,
and a list of software tools (some for alignment and some for
phylogeny reconstruction), all of which are likely to be helpful
to you.
-
See Evolution
101, a website provided by UC Berkeley for teaching about evolution.