CS 394C: Algorithms for Computational Biology
Course meets MW 3:30-5:00, BUR 208.
Office hours Tu 3-5 PM or by appointment, in ACES 3.428.
The course topic is algorithm design in computational
molecular biology. This year we will focus our study on
genome assembly,
multiple sequence alignment, phylogeny
(evolutionary history) reconstruction,
and
metagenomics.
These problems are
individually enormously computationally intensive -
computational approaches generally involve attempts to
solve NP-hard optimization problems.
Algorithmic
development, firmly grounded in mathematical theory,
is needed by the biological research community.
The goal of the course is to enable the students to do high
quality research (both mathematical and algorithmic) for these
problems.
The mathematical foundations of phylogeny reconstruction
are quite elegant and deep, and so most of our discussion
will be focused there.
We will cover statistical models,
primary optimization problems, and the
mathematical theory needed to predict the performance of
estimation methods under the models. The course will also
present a large spectrum of algorithms (both deterministic
with established theory, and heuristics) that have been developed
for these problems. In addition, we will have several lectures from
guest biologists with whom I collaborate.
No background in biology is assumed for this course.
The mathematics used in developing algorithms for these problems
combines combinatorics
and graph theory (including the theory of chordal graphs), complexity
theory, and probability and statistics. Students with good preparation
in at least one of these areas will be able to begin doing algorithm
development early on. However, abundant research opportunities exist
for those students without significant mathematical training: simulation
studies of existing methods is an important methodology within the field,
and has resulted in many highly influential and important publications
in major scientific journals. Therefore, students of all backgrounds
(including from biology) are welcome.
Undergraduates registered for the course will be given assignments
at a level different from that given to the graduate students. However,
research opportunities exist for both undergraduate and graduate students.
The grading scheme is:
- HW: 20%
- Class participation: 20% (presentation of a paper)
- Project: 30% (research paper or survey article)
-
Final: 30%
Final Project
The course requires a final project of each student.
You are strongly encouraged to do a research project, but
you can also do
a survey paper on some topic relevant to the
course material. In both cases, your project should
be a paper (of about 15 pages) in a format
and style
appropriate for submission to a journal.
Research projects can involve two students, but
survey papers must be done by yourself.
Grades on the final project depend upon the kind of
project you do. For a research paper, your grade will be
30% writing, 40% scientific/algorithmic rigor, and 30%
impact.
If you do a survey paper, the grade will be
30% writing, 30% summary of the
literature you discuss, and 40% commentary (i.e., insight,
critical and thoughtful discussion of the issues that come up).
Schedule:
-
The proposal for your final project is due Monday March 19, 2012.
-
One to three page draft, due April 9.
Give details about what you will do!
Provide a proper bibliography.
Submit as a PDF via email, and also hand in hardcopy.
-
The first draft is due April 25, 2012.
This should be fairly complete -- and well written.
Submit as a PDF via email, and also hand in hardcopy.
-
Final version,
due Friday May 11, 2012 (emailed as a PDF).
Note: all too often, I have found that some students do not understand
that copying text (with, perhaps, slight modifications) amounts to
plagiarism, unless the copied text is put in quotes, and the
original source correctly
attributed.
Please do not plagiarize, even unintentionally.
Click here for more about the final project.
Final Exam
The final exam is
here.
This will be a take home exam, and you are free to consult your notes.
Please return this by email
as a PDF by noon on Monday, May 7, and
also deliver it in hardcopy to Laurie Alvarez
in PAT 141.
Homework
The majority of the homework problems for the course can be found
here; these
are based upon the textbook, available
here.
-
Homework Assignment #1 (due Monday, Jan 30):
Read Chapters 1-5
from the textbook.
Do problems
2.3(2), 2.5(1), 2.5(2), 2.5(3), 2.5(5) from
this.
Undergrads and biology students: you have the option of replacing these five problems with any 5 problems from
Chapter 2.
-
Monday Feb 6: Do problems 5.2(3), 5.2(4), and 5.2(5). Extra credit: 5.2(6) and 5.2(7).
Undergrads and biology students: you have the options of replacing these three problems with
any 3 problems from Chapters 3-5.
Read Chapter 6 from the textbook.
Read papers 44, 45, 48, 61 and 85 from
here, and
this paper (note addition of paper 44).
Read this paper, in preparation for Warren Hunt's lecture on Feb 8.
-
Monday Feb 13:
Do problems 6.1(1)-6.1(3).
Write up a discussion of at least two of the papers
you read for Feb 6, pointing out what they
contribute, how their conclusions are different or the same, and what
questions are left unanswered. Do not copy text from any paper!
Submit this as a PDF by email before class on Monday Feb 13.
Read papers 31, 40, and 44 from here.
Find at least one paper from the last 5 years that is relevant to
these papers, and read it.
Solutions are available here.
-
Monday Feb 20:
Write up a discussion of at least two of the papers
you read for Feb 13, as above.
Submit your write-up as a PDF by
email before class on Feb 20.
-
Monday Feb 27:
(What follows is aimed at helping you find a
research topic for the course project.)
Pick one topic of interest to you, and
select 3-6 papers on this topic.
Read these papers.
Write up a brief (at most one page)
description of the topic that interests you,
and how these papers relate to this topic.
Submit this write-up along with hardcopy
of each paper.
If you are working with someone else in
the class on this course project, you may both
write this up together.
If you have no idea what you'd like to do,
come see me in office hours.
I will respond to this write-up with comments and
suggestions, in some cases directing you to
a different topic or to different papers for the
same project.
-
Wednesday, March 21.
- Read the primer on genome sequence assembly
here.
- This problem is only for biologists: find a paper
on an assembly technique, read it, and prepare a short
(15 minute) presentation to the class about the assembly
method.
-
Consider an 8x8 chessboard.
Recall that a knight can move two steps horizontally followed
by one step vertically, or two steps vertically followed by
one step horizontally.
- Can a knight placed on the top left hand corner visit
every square in the chessboard exactly once, and end up on
the bottom right hand corner?
- Can a knight placed in an arbitrary square travel
around the chessboard,
and visit every square exactly once, and end up back where
it started? (Does that depend upon where it starts?)
-
Consider the same questions for 9x9 and for 10x10 chessboards.
-
Consider the following set of eight 3-mers:
AGT,AAA,ACT,AAC,CTT,GTA,TTT, and TAA.
-
Find the shortest common superstring for these 3-mers.
-
Construct the digraph with 8 vertices, one for each 3-mer (the
Hamiltonian path approach). Find a Hamiltonian
path in the graph. Is this path also Eulerian?
Write the superstring corresponding to this Hamiltonian path.
-
Construct the digraph with one vertex for every
2-mer prefix or suffix (the Eulerian path approach).
Find an Eulerian path in the graph.
Is this path also Hamiltonian? Write
the superstring corresponding to the path.
-
Is there a string that has all possible
2-digit numbers appearing exactly once, with digits
taken from {0,1,2}? If so, find it. Explain your analysis.
-
Is there a string that has all possible 2-digit numbers
appearing exactly once, with digits taken from any
finite alphabet? Explain your reasoning.
-
Find a shortest common superstring of the 8 3-digit
binary
numbers (i.e., the set {000,001,010,011,100,101,110,111}.
-
This problem is only for graduate students
in CS, Math, or ECE.
Consider the SBH (Sequencing by Hybridization) problem, where the input
is the l-mer spectrum of a sequence, and we wish to find the sequence.
Suppose what we have is the union of the k-mer spectra of two sequences, X and Y.
Consider the problem of obtaining two sequences X' and Y' given the
union Z of their spectra (i.e. Spectrum(X',k) union Spectrum(Y',k) = Z).
-
Give an approach to this problem by modifying the Hamiltonian Path
formulation.
-
Give an approach to this problem by modifying the Eulerian Path formulation.
Suppose we have n DNA fragments and we know n and also
how many times each k-mer appears in each string, and hence
the total number of times a given k-mer appears over all
the fragments.
Give an algorithm to find n fragments that are
consistent with this information.
Readings
-
The draft of the textbook is available
here.
This draft is being updated regularly;
please do not circulate.
The material in this textbook is very basic -- please
start with this!
- Many of my papers are available at
this webpage. These might
serve as a good place to enter the literature.
Feel free to browse.
-
Tutorial on Phylogenetic Tree Estimation
by J. Kim and T. Warnow. ISMB 1999.
(This is a very statistically oriented tutorial, covering
Markov models of evolution and the performance of
phylogeny estimation methods under these models.)
-
An overview of phylogeny
reconstruction, by C. Randal Linder
and T. Warnow.
Handbook of Computational Molecular Biology, Chapman and Hall, 2005, Chapter 19-1.
This overview contains much more about the biological data and the issues
involved, from a biologist's perspective.
-
Disk Covering Methods: improving
the accuracy and speed of large-scale phylogenetic analyses,
by T. Warnow,
Handbook of Computational Molecular Biology, Chapman and Hall, 2005,
Chapter 19-2.
-
Network (Reticulate) Evolution: Biology, Models, and Algorithms,
(PDF),
by C. Randal Linder, Bernard M.E. Moret, Luay Nakhleh, and Tandy Warnow.
Tutorial presented at the Pacific Symposium on Biocomputing 2004.
-
For an updated introduction to reticulate evolution and the
computational problems related to this area, see
this book
chapter, written by Luay Nakhleh (former PhD student, now at Rice).
-
Steven Salzberg
taught a course on metagenomics; his
reading list is available
here.
- Some papers that overview challenges in
species tree estimation include the following two:
Bushes in the Tree of Life, by Rokas and Carroll, and
Resolving Difficult Phylogenetic Questions: Why more sequences are not
enough" by Herve Phillippe et al., both published in PLoS Biology.
-
For
gene tree discord due to incomplete
lineage sorting, see
the papers by Degnan and Rosenberg
(PDF)
and by Liu et al.
(PDF).
You might also want to look at
the paper by Than and Nakhleh (PDF), which
shows how to solve a related discrete math problem, MDC,
and how that relates to species tree
estimation under ILS.
(Finally, some other papers on this topic are
available here.
-
For gene duplication and loss,
you should look at
the paper on iGTP, software for
estimating species trees by trying to minimize
the number of duplications and losses
(PDF).
News articles
- The New York Times has some
interesting blogs that concern evolution and biotechnology.
Here's one from September 1, 2009.
Additional resources
-
Bioinformatics: Sequence
Analysis, a course taught by Luay Nakhleh at Rice University. His
course website has slides from his course, a reading list,
and a list of software tools (some for alignment and some for
phylogeny reconstruction), all of which are likely to be helpful
to you.
-
See Evolution
101, a website provided by UC Berkeley for teaching about evolution.
-
For material on Genome Sequence Assembly, see this
Genome
Assembly Primer, provided by the Center for Bioinformatics and
Computational Biology at the University of Maryland.
Introductory papers on assemblers include:
- Genome assembly reborn: recent computational challenges, by
Mihai Pop (webpage)
- Paper by Nagarajan and Pop (PDF)
- Assembly algorithms for next-generation sequencing
data, by Miller, Koren, and Sutton
(PDF).
- Another paper by Steven Salzberg
(PDF)
-
A new assembler called SGA (String Graph Assembler),
here by
Simpson and Durbin, and discussed
on this page.
For background material,
read this paper
by Gene Myers on string graphs,
and this paper
by Simpson and Durbin.
-
See also this review
by Salzberg.
-
For a course on metagenomics, see
Bioinformatics
for Metagenomics, taught by Mihai Pop at the University of Maryland.
Lectures
- Day 1 (Jan 18, 2012)
(PPT)
(PDF)
- Day 2 (Jan 23, 2012)
(PPT)
(PDF)
- Day 3 (Jan 25, 2012)
(PPT)
(PDF)
- Day 4 (Jan 30, 2012)
(PPT)
(PDF)
- Day 5 (Feb 1, 2012)
Siavash Mirarab will give a
talk about
taxon identification of
metagenomic data, and the
phylogenetic placement problem.
See the PSB paper on SEPP for background.
And here is
a talk on the same topic that I gave (slightly modified).
- Day 6 (Feb 6, 2012)
(PPT)
(PDF)
- Feb 8, 2012: Guest Lecturer, Prof. Warren
Hunt, discussing this paper.
- Feb 15, 2012: Introduction to multiple
sequence alignment
(PPT)
(PDF)
- Feb 20, 2012:
Review of homework.
Dicussion of papers.
- Feb 25, 2012:
Chordal graphs.
- Feb 27, 2012 and Feb 27, 2012:
Discussion of proposed projects
- March 5, 2012:
Introduction to genome assembly
(PDF)
(PPT),
(with material taken from this page).
- March 7, 2012:
Species tree estimation from gene trees,
introduction (PDF)
- March 19, 2012:
More on genome assembly
(PDF)
(PPT)
- March 21, 2012:
Phylogeny estimation in historical linguistics
(PDF)
(PPT),
in preparation for Annemarie Verkerk's presentation on March 26.
- March 26, 2012.
Annemarie Verkerk will talk about a
simulation study evaluating phylogeny
estimation methods for historical
linguistics:
Nakhleh et al. 2005; A comparison of phylogenetic reconstruction methods on an IE
dataset
(PDF).
Her talk is here.
- March 28, 2012: Alex Hu and Matthew Tien will give talks (one each)
on assemblers. Alex will talk
about Arachne
and Matthew will talk about Velvet.
- April 2, 2012: Jacob Menashe will talk
about a phylogeny estimation
method, available here.
- April 4, 2012: Bayzid will talk about
a new method he has developed for estimating a species
tree that minimizes the total number
of duplications with respect to a set of
gene trees.
- April 9, 2012:
Nam Nguyen will talk
about techniques for masking alignments
- April 11, Vini will talk
about
this paper,
which compares
genome assemblers.
- April 16,
Andrei Margea will
talk (PDF)
(PPT)
about
NAST, a method for large-scale alignment.
- April 18,
Siavash will
talk
about Fast-SP.
Laura will talk about a new method for
supertree estimation, called MRL.
- April 23, 2012: Hector and Luis will
talk about the
alignment method POA,
available here.
See also
D.S. Parker, C. Lee,
"Pawise Partial Order Alignment as a Graph Problem — Aligning Alignments
Revisited", Journal of Bioinformatics and Computational Biology,
2003
RECOMB 2003,
and C. Lee, D.S. Parker,
"Multiple Partial Order Alignment as a Graph Problem", Journal of
Bioinformatics and Computational Biology,
2003
JBCB 2003.
- April 25, Eric will talk
about this
paper, which gives a way to evaluate
the reliability of a multiple sequence alignment.
Sangman Kim will talk about
compression in the context of phylogeny estimation, in
this paper.