CS 394C: Algorithms for Computational Biology
Instructor: Tandy Warnow, tandy@cs.utexas.edu
Professor Warnow's office hours: M 12:301:30 PM, GDC 4.510,
and Wednesday December 11 from 35.
Teaching Assistant: Eshan Chattopadhyay
 Email: eshan.c@gmail.com.
 Office hours Friday 1112, GDC 4.440.
Course meets MW 23:30 PM, GDC 2.502
Final exam Thursday December 12, 9 AM to noon, GDC 2.502.
Conference: Tuesday December 17, 14 PM in
ACES 2.402.
Topics: This graduate course covers computational
issues in
molecular biology and historical
linguistics. The biological topics we will
cover include
genome assembly,
multiple sequence alignment, the estimation of
evolutionary trees and networks,
and
metagenomics.
In biology, we will computational problems
for estimating evolutionary histories of
families of
natural languages (IndoEuropean, Dravidian, FinnoUgric, etc.).
These problems are
individually enormously computationally intensive 
computational approaches generally involve attempts to
solve NPhard optimization problems.
Algorithmic development
is needed for improved scientific analyses.
We will cover statistical models,
primary optimization problems, and the
mathematical theory needed to predict the performance of
estimation methods under the models. The course will also
present a large spectrum of algorithms (both deterministic
with established theory, and heuristics) that have been developed
for these problems. In addition, we will have lectures from
guest biologists.
Background Expected for Students: The course is
designed for students in
computer science or mathematics, and no
background in biology or linguistics is assumed
or even needed.
However, if you are a student from another disciplines (such as
biology) and are interested in taking the course, please
see me.
Research Opportunities:
The goal of the course is to enable the students to do high
quality research (both mathematical and algorithmic) for these
problems.
Furthermore, abundant research opportunities are available, including
development of new heuristics for (typically NPhard) discrete
optimization problems,
analyses of biological or linguistic data using
new methods, simulation
studies of existing methods, etc.
Phylogeny and multiple
sequence alignment estimation has several easily formulated and accessible
research challenges, which new students could make progress
on  without needing to know any biology or linguistics.
Therefore, much of the course
will be focused there.
Syllabus
 Basics: Hidden Markov models, statistical inference,
and computational complexity: 1 week
 Phylogeny estimation methods and models: 3 weeks
 Multiple sequence alignment methods
and models: 2 weeks
 Phylogenomics: 2 weeks
 Genome Assembly: 1 week
 Metagenomics: 1 week
 Historical Linguistics: 1.5 weeks
Grading Scheme:
 HW: 40%
 Class participation: 10% (presentation of a paper)
 Project: 30% (research paper or survey article)

Final: 20%
Homework
All University policies relating to plagiarism (e.g., copying
text with minor modifications and/or
paraphrasing) apply to homework.
In addition,
you must submit your own solutions to homework.
While it is fine to discuss homework assignments with
other students in the class, you must write up your
solutions yourself, and so there should
not be any sharing of any written
solutions between you. Furthermore, if you do
discuss a homework assignment with someone else,
you must
indicate this (and identify the other student)
on your submitted homework.
See below for more information about plagiarism.
Homework assignments
will be of three types: pen and paper (doing calculations,
proving theorems, etc.), programming (developing, implementing,
and/or testing methods for computational biology
or computational historical linguistics
problems), and discussing
published papers.
The majority of the homeworks related to phylogeny
estimation are available
here,
and are based on the 394C
course notes.
Biology students: if you find a problem requires
computational background that you do not have, please contact
me. If I agree, you will be welcome
to substitute reading and critiquing a recent publication
related to the same subject matter; these critiques will need
to be 2 pages long, and will be graded for content and writing.
Final Project
The course requires a final project of each student.
You are strongly encouraged to do a research project, but
you can also do
a survey paper on some topic relevant to the
course material. In both cases, your project should
be a paper (of about 15 pages) in a format
and style
appropriate for submission to a journal.
Research projects can involve two students, but
survey papers must be done by yourself.
Grades on the final project depend upon the kind of
project you do. For a research paper, your grade will be
30% writing, 40% scientific/algorithmic rigor, and 30%
impact.
If you do a survey paper, the grade will be
30% writing, 30% summary of the
literature you discuss, and 40% commentary (i.e., insight,
critical and thoughtful discussion of the issues that come up).
Schedule:
Plagiarism
Note: all too often, I have found that some students do not understand
that copying text (with, perhaps, slight modifications) amounts to
plagiarism, unless the copied text is put in quotes, and the
original source is correctly
attributed.
Paraphrasing is also potentially a problem and.
Please do not copy text (with minor
editing) or paraphrase without
attribution; these
both amount to plagiarism.
The University of Texas imposes very serious
penalties.
Furthermore, plagiarism in
publications is even more serious.
See this webpage for more information on plagiarism.
Final Exam
The final exam will be comprehensive, and will take place
on Thursday December 12 from 9 AM to noon.
This will be a closed book exam, and you
will not need a calculator.
Class Presentations
 August 28, 2013.
Introduction to phylogeny estimation. (PDF)
 September 4, 2013. Representations of
trees using subtrees, distances, clades, and bipartitions.
(PDF)
(PPT)
 September 911, 2013.
Basic tree construction methods (maximum parsimony,
maximum compatibility, and supertree methods).
(PDF)
(PPT)
 September 16, 2013. Phylogenetic
inference under stochastic models of
evolution
(PDF)
(PPT)
 September 18, 2013. Introduction
to computational historical linguistics.
(PDF)
(PPT)
(Time permitting, I will cover some of the following
lecture as well.)
Evaluating linguistic phylogeny estimation methods on
IndoEuropean data.
(PDF)
(PPT)
 September 23, 2013. Guest Lecture by
Professor Danny Law of UTAustin Linguistics;
talk, and
paper.
 September 25, 2013.
Siavash Mirarab will give a talk about three methods
(SEPP, TIPP, and UPP) that use multiple HMMs in order
to improve phylogenetic analysis and/or alignment estimation.
(PDF).
 September 30, 2013.
Determining whether maximum parsimony
is statistically consistent on a given model tree,
and similar statistical issues.
Homework review.
Review of class material so far.
If time permits, we will
have a discussion of possible research topics for
course projects.
 October 2, 2013.
Multiple sequence alignments
and estimating species trees
from gene trees.
(PPTX)
(PDF)
Then, discussion of possible research topics.
 October 7, 2013.
Discussion of Splids paper.
 October 9, 2013.
Review of the phylogenetic estimation pipeline, and
discussion of "posttree analyses".
 October 14, 2013.
Discussion of two papers:
"Quartet maxcut" by Snir and Rao, Molecular
Phylogenetics and Evolution 2012, 62(1): 18, and
"Quartetbased phylogenetic inference: improvements and limits"
by Ranwez and Gascuel, Molecular Biology and Evolution 2001,
18(6):11031016.
 October 16, 2013.
Introduction to genome assembly.
(PDF)
(PPT)
 October 21, 2013. Discussion about
genome assembly challenges.
 October 23, 2013. Absolute
fast converging phylogeny estimation methods.
(PDF)
(PPT)
 October 28, 2013
Discussion of genome assembly papers.
 October 30, 2013.
Dakota will give a talk on
sequencing machines.
See David's PDF about
Velvet.
 November 4, 2013.
Ben Selfridge will talk about
the ALLPATHS assembler; please see
this paper and its
supplementary materials,
and also this paper.
 November 6, 2013.
Discussion of homework problems.
 November 11, 2013. More about assemblers.
Theo's talk about CABOG
 November 13, 2013. Comparison of assemblers  the
GAGE study by Steven Salzberg (discussion led by David)
(PDF)
 November 18, 2013. SuperFine and divideandconquer methods
(PPTX)
 November 20, 2013. Lineage sorting and estimating
trees from incongruent gene trees.
(PDF)
(PPTX)
 November 25, 2013. Bayzid will talk about
"statistical binning" for improving coalescentbased
species tree estimation methods.
 November 27, 2013. Nam Nguyen will discuss
TIPP, taxonomic identification using phylogenetic placement.
 December 2, 2013. Kevin Liu (guest lecturer
from Rice University) will give a talk on phylogenomics.
 December 4, 2013. Review of the course (or student
presentations).
Reading Materials
The following are potentially useful to you
for background material.
 Course notes: The course does not have
any published textbook; however, I have prepared
course notes for the mathematical and
computational basics
of phylogeny estimation.
(No biology required at all!)
These are incomplete, and I'll be adding to them over the
semester. Please let me know if you find any errors, or have
suggestions for improvements!

Survey Article on largescale estimation problems:
I recently wrote a survey on largescale estimation of alignments
and phylogenies, which will appear in a book published
by Springer, celebrating David
Sankoff. You can get the preprint (but it's not in final form)
here.

Disk Covering Methods: improving the accuracy and speed of
largescale
phylogenetic
analyses:
This book chapter written several years ago,
but it provides graphtheoretic techniques for doing
divideandconquer to "boost" phylogenetic estimation
methods. Subsequent algorithmic developments along
these lines include DACTAL, SuperFine, and
SATe, and papers on these techniques
are available on my
webpage.
 Tutorial on Phylogenetic Tree Estimation
by Junhyong Kim and Tandy Warnow, given at ISMB (Intelligent Systems
for Molecular Biology) in 1999. This provides material about
estimating phylogenies under stochastic models of evolution,
covering material such as statistical consistency and
sequence length requirements for estimation methods.
You can get the paper here.

Undergraduate bioinformatics algorithms textbook:
Neil Jones and Pavel Pevzner wrote a nice textbook for
undergraduates, basically introducing algorithm design
in the context of some bioinformatics problems (genome
assembly, genome rearrangements, etc.). This is not
all that close to material we'll focus on in this
course, but it's still a lovely text.

Classic book on hidden Markov models:
Durbin, Eddy, Krogh, and Mitchison
wrote a book which has become a classic
in the bioinformatics field, focusing on
hidden Markov models. It's called
"Biological sequence analysis: probabilistic
models of proteins and nucleic acids".

For material on Genome Sequence Assembly, see this
Genome
Assembly Primer, provided by the Center for Bioinformatics and
Computational Biology at the University of Maryland.
Introductory papers on assemblers include:
 Genome assembly reborn: recent computational challenges, by
Mihai Pop (webpage)
 Paper by Nagarajan and Pop (PDF)
 Assembly algorithms for nextgeneration sequencing
data, by Miller, Koren, and Sutton
(PDF).
 Another paper by Steven Salzberg
(PDF)

A new assembler called SGA (String Graph Assembler),
here by
Simpson and Durbin, and discussed
on this page.
For background material,
read this paper
by Gene Myers on string graphs.

See also this review
by Salzberg.
 A comparison of Illumina and Roche 454
technologies is available
here.
 For material on Metagenomics, see
A Primer on Metagenomics by Wooley et al,
A
Bioinformatician's Guide to Metagenomics",
and From genomics to metagenomics by Desai et al.
Homework Assignments
Homework #1 (due September 4, 2013). Read
Chapter 2 in the
course notes.
Do all homework problems that begin with
2.1 or 2.2.
Homework #2 (due September 11, 2013).
See the last slide for the class presentation on
September 4, 2013.
Read Chapter 3 in the course notes.
Homework #3 (due September 18, 2013).
 See the last slide for the class presentation
on September 4, 2013 for problems assigned from
the text.

Read Chapters 4 and 5 in the course notes.

Pick a recent paper from the scientific literature
(since 2009) that is about supertree methods, read it,
critique it, and write a discussion of the paper.
The paper can be about a new method, a comparison of
existing methods on data, or the use of a method on
a dataset (e.g., a biological data analysis). Your
critique should be 12 pages, and should describe what
is in the paper, and provide some comments of your own.
For example, do you agree with the conclusions in the
paper? Do you think the methodology is sound? Do you
have any concerns?
(Note: you may not have enough background to critique the
paper
particularly deeply  don't worry about this. I'm
not going to grade this critique  but I will
give you feedback about it.)
Make sure to use your own words for this 
do not copy anything from any other paper, unless you
specifically quote it and give proper attribution.
Homework #4, due September 23, 2013.
 Do the following problems from the text: 5.2(8), 6.1(1)6.1(3),
6.1(5).
 Read paper #110 in my list of papers, and submit 12 page
summary. Prepare at least one question about the paper, which
could be a critique, a suggestion for a followup study, or
something else.
Homework #5, due September 30, 2013.
 Do the following problems
from the text: 3.1(5), 4.2(1), 4.3(2), 4.4(4), 4.6(3), 4.7(1),
4.8(1), 4.8(2), 4.8(5).
(Biologists: you are welcome to replace up to 3 problems
in this list with a critical review of some published paper
that is relevant to phylogeny or alignment estimation.
If you wish to do this, please check with me
about the choice of paper.)
Homework #6, due October 2, 2013.
 Revise your critique of the supertree paper you read,
and resubmit  with complete bibliographic information, and also
provide a hardcopy of the paper.
I will distribute your critique to the rest of the class, and we will
then have a vigorous and interesting group discussion about
what these papers showed.
(If you prefer to submit a new critique of a different super paper,
that's fine too  just make sure to provide full bibliographic
information and hardcopy.)
 Read ``Splitinducing indels in phylogenomic
analysis" by Donath and Stadler,
available at
http://www.bioinformatik.unileipzig.de/Publications/PREPRINTS/11001.pdf.
(This seems to not have been published.)
Critique it.
 Find another paper about phylogeny or alignment
estimation and critique it.
(Be careful to include all the bibliometric information
about the paper in your writeup, and also provide a hardcopy
of the paper with your critique.)
Homework #7, due October 8 (Tuesday) by noon, via email.
Submit a PDF of a critique of any paper (other than
the splids paper  which is the next homework assignment) you have read, including
full bibliographical information. Make sure to suggest some kind
of followup research that could be done. Your critique will
be posted on the class webpage for download by students in the
course, so that we can have a discussion of these papers.
These will be graded, and
the grade will count towards your homework grade.
See below for the critiques to download.
Homework #8, due October 14.
These critiques will be graded, and the grade will count
towards your homework grade.
You are welcome to discuss your critiques with other
students, but you must write your critique yourself.
NOTE: Be careful about
your writing  do not copy text without complete
attribution (put the copied text in quotes). Avoid
anything that could be considered plagiarism.

Revise your critique of the splids paper. Propose an
explicit experiment, addressing a question that you
consider insufficiently answered in the paper.

Read two papers (related to phylogeny
or alignment estimation) that address a similar question, and
compare and contrast them.
One of the papers you select
can be a paper you have already read and
critiqued, but at least one must be new.
Identify a question that is unanswered (or insufficiently
answered) in these papers, and suggest an experiment
that might help resolve the question.
(Provide hardcopies of both papers.)
 Read all posted critiques for all papers. Note what
you think makes a good critique!

Also read two papers: Quartet maxcut by Snir and Rao, MPE
62(1): 18, 2012,
and Quartetbased phylogenetic inference: improvements and
limits, MBE 2001, 18(6):11031016.
Come prepared to discuss these two papers in depth. I will
call on everyone in the class!
Homework #9, due October 21.
Read this paper,
and submit a 12 page summary of it.
Homework #10, due October 28.
Read the following papers on genome assembly (from
the list above):
Based on these papers, pick one of the
following assemblers 
SGA, Velvet, MSRCA, AllpathsLG, SOAPDenovo,
Newbler, and CABOG 
and
prepare a
15 minute presentation of the
method, as well
as a 23 page discussion of what you understand
about the assembler.
You should make sure to read at least one paper about
the assembler (written by the authors of the method).
I may ask you to present the paper
and your discussion of the paper
in class.
Homework #11, due November 4.
(Biology students, you are welcome to
read and critique, in depth, a paper published
in the last 3 years about genome assembly instead
of doing these problems.)

Consider the following set of eight 3mers:
AGT,AAA,ACT,AAC,CTT,GTA,TTT, and TAA.

Find the shortest common superstring for these 3mers.

Construct the digraph with 8 vertices, one for each 3mer (the
Hamiltonian path approach). Find a Hamiltonian
path in the graph. Is this path also Eulerian?
Write the superstring corresponding to this Hamiltonian path.

Construct the digraph with one vertex for every
2mer prefix or suffix (the Eulerian path approach).
Find an Eulerian path in the graph.
Is this path also Hamiltonian? Write
the superstring corresponding to the path.

Is there a string that has all possible
2digit numbers appearing exactly once, with digits
taken from {0,1,2}? If so, find it. Explain your analysis.

Is there a string that has all possible 2digit numbers
appearing exactly once, with digits taken from any
finite alphabet? Explain your reasoning.

Find a shortest common superstring of the 8 3digit
binary
numbers (i.e., the set {000,001,010,011,100,101,110,111}.

This problem is only for graduate students
in CS, Math, or ECE.
Consider the SBH (Sequencing by Hybridization) problem, where the input
is the lmer spectrum of a sequence, and we wish to find the sequence.
Suppose what we have is the union of the kmer spectra of two sequences, X and Y.
Consider the problem of obtaining two sequences X' and Y' given the
union Z of their spectra (i.e. Spectrum(X',k) union Spectrum(Y',k) = Z).

Give an approach to this problem by modifying the Hamiltonian Path
formulation.

Give an approach to this problem by modifying the Eulerian Path formulation.
Suppose we have n DNA fragments and we know n and also
how many times each kmer appears in each string, and hence
the total number of times a given kmer appears over all
the fragments.
Give an algorithm to find n fragments that are
consistent with this information.
Homework #12. Due November 6.
Briefly answer all questions in the first set
of questions (140), and provide a longer
answer to at least one of the second
set of questions, from
here.
Note  to answer these questions, you may well have to do a
literature search!
Homework #13.
Due November 11.
Read the following papers, and
then follow up by finding another paper
related to genome assemby, and reading it. Write up
a discussion about the other paper you read.
Be prepared for a class discussion.
Homework #14. Due November 25. Final projects due
(as hardcopy in class, not emailed as PDF).
Homework #15. Due November 27. Read
A Primer on Metagenomics by Wooley et al,
A
Bioinformatician's Guide to Metagenomics",
and From genomics to metagenomics by Desai et al.
Write a 23 page
critique, including a
summary of these papers and a discussion.
Also provide a question that you
think is either unanswered by these
papers, or where the authors disagree.