CS 394C: Algorithms for Computational Biology

Instructor: Tandy Warnow, tandy@cs.utexas.edu

Professor Warnow's office hours: M 12:30-1:30 PM, GDC 4.510, and Wednesday December 11 from 3-5.

Teaching Assistant: Eshan Chattopadhyay

Course meets MW 2-3:30 PM, GDC 2.502

Final exam Thursday December 12, 9 AM to noon, GDC 2.502.

Conference: Tuesday December 17, 1-4 PM in ACES 2.402.

Topics: This graduate course covers computational issues in molecular biology and historical linguistics. The biological topics we will cover include genome assembly, multiple sequence alignment, the estimation of evolutionary trees and networks, and metagenomics. In biology, we will computational problems for estimating evolutionary histories of families of natural languages (Indo-European, Dravidian, Finno-Ugric, etc.). These problems are individually enormously computationally intensive - computational approaches generally involve attempts to solve NP-hard optimization problems. Algorithmic development is needed for improved scientific analyses. We will cover statistical models, primary optimization problems, and the mathematical theory needed to predict the performance of estimation methods under the models. The course will also present a large spectrum of algorithms (both deterministic with established theory, and heuristics) that have been developed for these problems. In addition, we will have lectures from guest biologists.

Background Expected for Students: The course is designed for students in computer science or mathematics, and no background in biology or linguistics is assumed or even needed. However, if you are a student from another disciplines (such as biology) and are interested in taking the course, please see me.

Research Opportunities: The goal of the course is to enable the students to do high quality research (both mathematical and algorithmic) for these problems. Furthermore, abundant research opportunities are available, including development of new heuristics for (typically NP-hard) discrete optimization problems, analyses of biological or linguistic data using new methods, simulation studies of existing methods, etc. Phylogeny and multiple sequence alignment estimation has several easily formulated and accessible research challenges, which new students could make progress on - without needing to know any biology or linguistics. Therefore, much of the course will be focused there.

Syllabus

Grading Scheme:

Homework

All University policies relating to plagiarism (e.g., copying text with minor modifications and/or paraphrasing) apply to homework. In addition, you must submit your own solutions to homework. While it is fine to discuss homework assignments with other students in the class, you must write up your solutions yourself, and so there should not be any sharing of any written solutions between you. Furthermore, if you do discuss a homework assignment with someone else, you must indicate this (and identify the other student) on your submitted homework. See below for more information about plagiarism.

Homework assignments will be of three types: pen and paper (doing calculations, proving theorems, etc.), programming (developing, implementing, and/or testing methods for computational biology or computational historical linguistics problems), and discussing published papers. The majority of the homeworks related to phylogeny estimation are available here, and are based on the 394C course notes.

Biology students: if you find a problem requires computational background that you do not have, please contact me. If I agree, you will be welcome to substitute reading and critiquing a recent publication related to the same subject matter; these critiques will need to be 2 pages long, and will be graded for content and writing.

Final Project

The course requires a final project of each student. You are strongly encouraged to do a research project, but you can also do a survey paper on some topic relevant to the course material. In both cases, your project should be a paper (of about 15 pages) in a format and style appropriate for submission to a journal. Research projects can involve two students, but survey papers must be done by yourself.

Grades on the final project depend upon the kind of project you do. For a research paper, your grade will be 30% writing, 40% scientific/algorithmic rigor, and 30% impact. If you do a survey paper, the grade will be 30% writing, 30% summary of the literature you discuss, and 40% commentary (i.e., insight, critical and thoughtful discussion of the issues that come up).

Schedule:

Plagiarism

Note: all too often, I have found that some students do not understand that copying text (with, perhaps, slight modifications) amounts to plagiarism, unless the copied text is put in quotes, and the original source is correctly attributed. Paraphrasing is also potentially a problem and. Please do not copy text (with minor editing) or paraphrase without attribution; these both amount to plagiarism. The University of Texas imposes very serious penalties. Furthermore, plagiarism in publications is even more serious. See this webpage for more information on plagiarism.

Final Exam

The final exam will be comprehensive, and will take place on Thursday December 12 from 9 AM to noon. This will be a closed book exam, and you will not need a calculator.

Class Presentations

Reading Materials

The following are potentially useful to you for background material.

Homework Assignments

  • Homework #1 (due September 4, 2013). Read Chapter 2 in the course notes. Do all homework problems that begin with 2.1 or 2.2.
  • Homework #2 (due September 11, 2013). See the last slide for the class presentation on September 4, 2013. Read Chapter 3 in the course notes.
  • Homework #3 (due September 18, 2013).
  • Homework #4, due September 23, 2013.
  • Homework #5, due September 30, 2013.
  • Homework #6, due October 2, 2013.
  • Homework #7, due October 8 (Tuesday) by noon, via email. Submit a PDF of a critique of any paper (other than the splids paper - which is the next homework assignment) you have read, including full bibliographical information. Make sure to suggest some kind of follow-up research that could be done. Your critique will be posted on the class webpage for download by students in the course, so that we can have a discussion of these papers. These will be graded, and the grade will count towards your homework grade. See below for the critiques to download.
  • Homework #8, due October 14. These critiques will be graded, and the grade will count towards your homework grade. You are welcome to discuss your critiques with other students, but you must write your critique yourself. NOTE: Be careful about your writing - do not copy text without complete attribution (put the copied text in quotes). Avoid anything that could be considered plagiarism.
  • Homework #9, due October 21. Read this paper, and submit a 1-2 page summary of it.
  • Homework #10, due October 28. Read the following papers on genome assembly (from the list above): Based on these papers, pick one of the following assemblers - SGA, Velvet, MSR-CA, Allpaths-LG, SOAPDenovo, Newbler, and CABOG -- and prepare a 15 minute presentation of the method, as well as a 2-3 page discussion of what you understand about the assembler. You should make sure to read at least one paper about the assembler (written by the authors of the method). I may ask you to present the paper and your discussion of the paper in class.
  • Homework #11, due November 4. (Biology students, you are welcome to read and critique, in depth, a paper published in the last 3 years about genome assembly instead of doing these problems.)
  • Homework #12. Due November 6. Briefly answer all questions in the first set of questions (1-40), and provide a longer answer to at least one of the second set of questions, from here. Note - to answer these questions, you may well have to do a literature search!
  • Homework #13. Due November 11. Read the following papers, and then follow up by finding another paper related to genome assemby, and reading it. Write up a discussion about the other paper you read. Be prepared for a class discussion.
  • Homework #14. Due November 25. Final projects due (as hardcopy in class, not emailed as PDF).
  • Homework #15. Due November 27. Read A Primer on Metagenomics by Wooley et al, A Bioinformatician's Guide to Metagenomics", and From genomics to metagenomics by Desai et al. Write a 2-3 page critique, including a summary of these papers and a discussion. Also provide a question that you think is either unanswered by these papers, or where the authors disagree.