CS 395T: Algorithms for Computational Biology

Instructor: Tandy Warnow
 Location: CBA 4.340
 Time: TuTh 2:003:30
 Office hours for April 28  May 13:
 Wednesday May 30, 45
 Wednesday May 7, 35 (sign up for 15 minute slots). Please
bring your homeworks and exam (and project proposal) to this time slot.
The course topic is algorithm design in computational
molecular biology, but we will focus our study on two
related topics: multiple sequence alignment and phylogeny
(evolutionary history) reconstruction. These problems
are intimately related because the inference of an
evolutionary history generally involves first obtaining
a multiple alignment of the sequences, and then reconstructing
a tree on the aligned sequences (though a simultaneous
approach can also be taken). Both these problems are
individually enormously computationally intensive 
computational approaches generally involve attempts to
solve NPhard optimization problems, and months of
analysis can be used to estimate the phylogeny, without
any guarantee of optimality. Thus, algorithmic
development, firmly grounded in mathematical theory,
is needed by the biological research community.
The goal of the course is to enable the students to do high
quality research (both mathematical and algorithmic) for both
multiple sequence alignment and phylogeny reconstruction.
The mathematical foundations of phylogeny reconstruction
are quite elegant and deep, and so most of our discussion
will be focused there.
We will cover the statistical models used to describe evolution of
molecular sequences, the primary optimization problems, and the
mathematical theory needed to predict the performance of
reconstruction methods under the models. The course will also
present a large spectrum of algorithms (both deterministic
with established theory, and heuristics) that have been developed
for these problems. In addition, we will have several lectures from
guest biologists with whom I collaborate.
No background in biology is assumed for this course.
The mathematics used in developing algorithms for both phylogeny
reconstruction and multiple sequence alignment combines combinatorics
and graph theory (including the theory of chordal graphs), complexity
theory, and probability and statistics. Students with good preparation
in at least one of these areas will be able to begin doing algorithm
development early on. However, abundant research opportunities exist
for those students without significant mathematical training: simulation
studies of existing methods is an important methodology within the field,
and has resulted in many highly influential and important publications
in major scientific journals. Therefore, students of all backgrounds
(including from biology) are welcome.
The grading scheme is:
 Class participation: 5%

Class notes: 10%

HW: 10%

Project: 25%

Midterm: 20%

Final: 30%
Final Project
The course requires a final project of each student:
click here for a list of suggested topics.
Your final project should be a paper of about 15 pages, in a format
and style
appropriate for submission to a journal.
The grade on the final project will be 30% writing, 30% summary of the
literature you discuss, and 40% commentary (i.e., insight,
critical and thoughtful discussion of the issues that come up).
The final project is due at the beginning of the final exam.
Final Exam
The final exam will cover all the material discussed in the course,
and will be closed book.
You are also responsible for four of the papers that were presented
by students:
the paper by Wayne Maddison (presented on March 19)
(PDF),
the paper
on phylogenetic networks by Jin et al. (presented
on March 25)
(PDF),
the paper by Nakhleh et al. (presented
on April 8)
(PDF),
and the paper by
Gusfield et al. (presented on April 24)
(PPT).
You are not responsible for the proofs but are responsible for
knowing the results (but only the ones that
appear in the talks  not everything).
Recommended (not required) reading

An overview of phylogeny
reconstruction, by C. Randall Linder
and T. Warnow.
Handbook of Computational Molecular Biology, Chapman and Hall, 2005, Chapter 191.

Disk Covering Methods: improving
the accuracy and speed of largescale phylogenetic analyses,
by T. Warnow,
Handbook of Computational Molecular Biology, Chapman and Hall, 2005,
Chapter 192.

Network (Reticulate) Evolution: Biology, Models, and Algorithms,
(PDF),
by C. Randal Linder, Bernard M.E. Moret, Luay Nakhleh, and Tand Warnow.
Tutorial presented at the Pacific Symposium on Biocomputing 2004.
Additional resources
Bioinformatics: Sequence
Analysis, a course taught by Luay Nakhleh at Rice University. His
course website has slides from his course, a reading list,
and a list of software tools (some for alignment and some for
phylogeny reconstruction), all of which are likely to be helpful
to you.
See Evolution
101, a website provided by UC Berkeley for teaching about evolution.
Lectures:

Jan 15, 2008: (PPT) (PDF)

Jan 17, 2008: (PPT)
(PDF)

Jan 22, 2008. Part 1:
(PPT) (PDF) and
Part 2:
(PDF).

February 5, 2008. Analyzing sets of trees: consensus methods

February 7, 2008. Serita Nelesen will talk about
her research on efficient techniques for storing
sets of trees compactly.

February 12, 2008. Recap of earlier work.

February 14, 2008. Dynamic programming and global pairwise alignment.

February 19, 2008. Species phylogenies and gene trees: conflict and resolution.

February 21, 2008. Multiple sequence alignment: NPhardness.
Scribe notes by Rahul Suri
(PDF).

February 25, 2008. Beginnings of reticulate evolution.

February 28, 2008. Scribe notes by Sindhu Raghavan
(PDF)

March 4, 2008.
Scribe notes by Bakhtiyar Uddin
(PDF).

March 19, 2008. Rajan presented the paper by Wayne Maddison
(PDF)

March 25, 2008. Sindhu Raghavan presented a paper by
Jin, Nakhleh, Snir, and Tuller,
Inferring phylogenetic networks by the parsimony criterion: a
case study.
(PDF).
Scribe notes by Badri Narayanan Champakesan
(PDF).

March 27, 2008. T.D. Luckett and Rahul Suri
presented
the RIATAHGT paper by Nakhleh et al.
Click here for
the PPT, and
here for the PDF
of their presentation.
Also see (PDF) for the scribe
notes.
 April 1. Jeffrey Matthew presented the paper
by AddarioBerry, Hallet, and Lagergren,
"Towards identifying lateral gene transfer events", PSB 2008.
(PPT)
 April 8. Mahesh Prabhu presented
"Reconstructing reticulate evolution in species  theory and
practice", by Nakhleh et al (JCB 2005)
(PDF)
 April 10, review of previous papers.
 April 15. Bakhtiyar Uddin presents the paper
"Constructing Splits Graphs" by Andreas Dress and Daniel Huson
(PPT).
Click here for Rajan's
scribe notes.
 April 17 and 22. Vikas Taliwal, discussing
paper by Gusfield, Bansal, Bafna, and Song,
"A decomposition theory for phylogenetic networks and
incompatible characters." (PPT)
See here for Mahesh's
scribe notes.
 April 24. Two talks: Badri Narayanan, talking about
"Optimal reconstruction of rootunknown phylogenetic networks with
constrained and structured recombination" by Dan Gusfield
(PPT), and
John Leavitt, talking about
"A regulator of G protein signaling interaction surface linked to effector specificity"
(PPT).
See (PDF) for Peggy Wang's scribe notes.
 April 29. Two talks: Peggy Wang (PPT), discussing
D. Huson and D. Bryant: Application of phylogenetic networks in evolution. MBE 23(2):254267, 2006,
and
Razieh Nakbeh Zaeem (PPT).

May 1, 2008. Guest lecturer: Professor Bill Press (UTCS).
(PPT).
Homework

Homework #2 (due Feb 21): Do
one of the following problems.
(1) Give a dynamic programming algorithm for computing the longest leaftoleaf path in a tree. (Here we define the length of a path to be the number of edges in the path.) This is also called the "topological diameter" of the tree. Analyze the running time.
(2)
Give a dynamic programming algorithm for the longest common subsequence
between two strings. This is the same problem as finding
the minimum number of deletions from the two strings so
that the result is two identical strings. Thus, the longest common
subsequence of AAAAAAAAAA and CATTAGAA is AAAA. Analyze the running time.
Exams

Take home midterm, due Thursday March 20 (in class).
Click here for the PDF file.
Important! There are some slight problems with the midterm.
Please click here for
a PDF file with the corrections, and some additional comments.

In class final exam (May 13), closed book.
Final project
Your final project is due on May 13, the day of the final exam. Please
provide me with a 2 page proposal for the project by
April 29 (list paper(s) you will
read and questions you will discuss). I will give this
proposal back to you with comments by May 1.
Papers on networks to read
Click here
for the "Who's Who of Phylogenetic Networks" and some of these papers.
 W. Maddison, Gene Trees in Species Trees, Systematic Biology 46(3): 523536,
1997.
 L. Nakhleh, T. Warnow, C. R. Linder, and K. St. John.
Reconstruction of phylogenetic networks: theory and practice.
Journal Computational Biology, Vol 12(6): 796811, 2005
(PDF).

L. Nakhleh, D. Ruths, and LS. Wang.
RIATAHGT: a fast and accurate heuristic for reconstructing horizontal gene
transfer. COCOON 2005, Vol. 3595:8493, 2005
(PDF).
 G. Jin, L. Nakhleh, S. Snir, and
T. Tuller. Maximum likelihood of phylogenetic networks. Bioinformatics,
Vol 22(21):26042611, 2006.
 L. Nakhleh and LS. Wang. Phylogenetic networks: properties and
relationship to trees and clusters. TCSB2, Vol. 3680:8299, LNCS, 2005.
 D. Bryant, V. Moulton, and A. Spiller, Consistency of the NeighborNet
Algorithm. AMB, Vol 2(8), 2007.
 G. Jin, L. Nakhleh, S. Snir, T. Tuller. Inferring phylogenetic networks
by the maximum parsimony criterion: a case study. MBE, Vol. 24(1):324337, 2007.
 B. Holland, G. Conner, K. Huber, and V. Moulton. Imputing
supertrees and supernetworks from quartets. Systematic Biology, Vol. 56(1): 5767,
2007.
 O. Gauthier, FJ. Lapointe. Seeing the trees for the network: consensus,
information content, and superphylogenies.
Systematic Biology, Vol 56(2):345355, 2007.
 M. Bordewich and C. Semple. Computing the hybridization number of two
phylogenetic trees is fixedparameter tractable. TCBB, Vol 4:458466, 2007.
 D. Gusfield, V. Bansal, V. Bafna, and Y. Song.
A decomposition theory for phylogenetic networks and incompatible characters.
Journal Computational Biology, Vol 14(10):12471272, 2007.
 D. Gusfield. Optimal, efficient reconstruction of rootunknown
phylogenetic networks with constrained and structured recombination.
JCSS, 2005 Special Issue on Computational Biology, Vol. 70, p. 381398.
(PDF)
 Y. Song, Z. Ding, D. Gusfield, C. Langley, Y. Wu.
Algorithms to distinguish the role of geneconversion from singlecrossover
recombination in the derivation of SNP sequences in populations.
J. Computational Biology, Dec. 2007, Vol. 14, No. 10: 12731286.
 Y. Song, Y. Wu and D. Gusfield.
Efficient computation of close lower and upper bounds on the minimum number
of recombinations in biological sequence evolution.
Proceedings of ISMB 2005.
 M. Hallet, J. Lagergren, and A. Tofigh.
Simultaneous identification of duplications and lateral transfers.
in RECOMB 2004, pp. 347356.
(PDF)
 M. Hallett and J. Lagergren. Efficient algorithms for
laeral gene transfers problems.
Submitted to SICOMP.
(PDF)
 M. Hallett and J. Lagergren. Efficient algorithms for
lateral gene transfers problems.
RECOMB 2001, pp. 141148.
 L. AddarioBerry, M. Hallet and J. Lagergren.
Towards identifying lateral gene transfer events.
PSB 2008.
(PDF)
 D. Huson. Split networks and reticulate networks.
In Reconstructing Evolution, new mathematical and computational advances,
Edited by O. Gascuel and M. Steel,
pp. 247276, Oxford Univ. Press. 2007.

D. Huson and T. Kloepper. Beyond Galled Trees  decomposition
and computation of galled networks.
RECOMB 2007, Vol. 4453:211227.

D. Huson and D. Bryant. Application of phylogenetic networks
in evolution. MBE 23(2):254267, 2006.
 A. Dress and D. Huson. Constructing splits graphs.
TCBB Vol 1(3):109115, 2004.