CS 307 - Fall 2001 - Assignment 3
Mapping Genomes

Handed Out:    September 24, 2001
Due:    Before 11:59 p.m. October 10, 2001
Purpose: To learn about creating classes, working with complex algorithms, and array manipulation
Provided classes:    DNAMain.java, Sequencer.java, FileHandler.java, Settings.java, simple.txt, complex2.txt

This program uses a simplification of the process involved in the Human Genome Project and other similar projects for mapping DNA. For example, see

(or type human genome into your favorite search engine).

This assignment is modeled after an assignment first constructed by Richard Pattis at Carnegie Mellon.

Vocabulary

Base or Bases, is one of A, T, C, G, the nucleotides that make up DNA. Theses letters stand for adenine, thymine, cytosine, and guanine.
Strand is part of a DNA molecule and consists of a sequence of bases, e.g., atcgcgcgaatt.
Clone is a collection of strands representing a DNA fragment that your program is trying to reconstruct.

For this assignment, you'll be reading a file that identifies several strands in a DNA molecule and trying to reconstruct the DNA molecule from the strands. The data you'll be processing is artificially generated (by a program) but the process is similar to what's done by computer programs in the human genome project. In a laboratory, a fragment of DNA is duplicated, then all the fragments are broken into smaller, overlapping strands by methods like thermal dissociation and agitation. In a lab, the bases of the smaller strands are determined and entered into a file. At this point you're stepping in to process the file and reconstruct the original DNA sequence of bases. The data for this assignment is perfect, not noisy or with any mistakes. In this respect the data is unlike that processed by the Human Genome Project.

Basic Operation

The basic operation that's essential in doing this assignment is to process all the strands in a clone by matching two strands and merging them into a new strand. This process decreases the number of strands by one since two strands are merged into one strand. The match/merge operation is repeated until there is only one strand left in the clone---this strand will be the original DNA; or until no matches are possible. Every program that generates a single DNA strand should generate the same strand. When no single strand can be produced, e.g., because the matching threshold (defined below) is too low, programs may generate different results depending on the order in which matches are attempted. So, either one strand is left, which will be the same in every program, or several strands are left and the leftover strands may be different.

Match and Merge

Two strands X and Y can match by overlapping, or Y can be completely contained in X (or vice versa, X can be contained in Y).

Overlapped Match

For example, if X is the strand acggtcac and Y is gtcacatta then X and Y overlap as diagrammed below.

0	1	2	3	4	5	6	7
a	c	g	g	t	c	a	c
			g	t	c	a	c	a	t	t	a
			0	1	2	3	4	5	6	7	8

This example shows how a match between X (called the first strand of the match) and Y (called the second strand) is formed by finding an index in X (3 in the example above) so that all the bases from that index to the end of X match the corresponding bases at the beginning of Y. The number of bases forming the match, five in the example above, is the size of the match. The size of a match must exceed some threshold to count as a match. For example, if the threshold were six, the match between X and Y above would not be considered a match.

When two strands match, a merged strand is formed by copying the first strand, then appending the bases of the second strand that come after the matched bases. In the example above, the merged strand is formed by first copying X to get acggtcac then appending atta (the bases in the second strand Y from indexes 5--8) to yield the merged strand acggtcacatta. A merged strand also joins the two labels associated with a strand, see the section on file formats below for label information.

Containing Match

If strand X acggtcac completely contains a strand Y, for example cggt as shown below, then a containing match is found.

0	1	2	3	4	5	6	7
a	c	g	g	t	c	a	c
	c	g	g	t
	0	1	2	3

The merge of this match is simply the strand X. For a containing match no threshold is involved, all containing matches are good matches and decreases the number of strands by one when the merge is formed.

Finding Matches

For most strand pairs, a match won't be found, or if a match is found it may not exceed the desired threshold. The program you write will begin with all the strands read from a file and stored in a clone, then try to find matches by considering every possible pair of strands in the clone. Whenever a match is found, the strands are removed from future consideration in matching, but their merge is inserted into the collection of strands being considered. As noted previously, this process continues until there is a single strand or until no more matches can be found. When you create a merge you must use all labels associate with the merged strands so that the new strand can be identified.

Classes For Programming

You're given six files for this assignment, these can be copied via the links on the web page. The classes are:

DNAMain: simply creates a Sequence object and tells it to go. You must specify the file where the DNA data is contained as a command line argument. This would be a String with the full path to the file with the DNA data. It is easiest to store the DNA file right off your root directory or right on your hard drive. For example C:\simple.txt. Remember when you type the path as a String any backslashes in the String must be double otherwise it looks like you are trying to specify an escape character. "C:\\simple.txt"
Sequence: creates a FileHandler Object and a Clone object. Reads all the Strands from the specified file and adds them to the clone object. When finished it prints out all of the labels of the final DNA strands, the merged one.
FileHandler: a class to shield you from the complexities of doing the file input. I want you to focus on the Clone and Strand classes and the algorithms involved so I am giving you both FileHandler and Sequence.
Settings: a class that holds various constants and settings such as the threshold level and the debugging level. It is often helpful in complicated programs to provide built in debugging code that can be turned on or off based on the DEBUG_LEVEL. For example

if(Settings.DEBUG_LEVEL > Settings.MINOR_DEBUG) System.out.println("info to help debug program");
simple.txt: a file with a simple set of strands for your program to practice on. You can crank up the debug level on this one.
complex.txt: a file with a complex set of strands for your program to practice on. Be careful not to crank up the debug level when processing this file. It is very large. I would only test your program on this file after you know it works on simple.txt. You may want to simple output the final label, certainly not the final strand or strands.

You must implement the Strand and Clone classes.

A sequence of steps is outlined that you may find useful in writing the program. The format of the input file follows the description of how you can develop your program although you need not deal with the files other than providing the name to the main method.

Steps in Developing the Program

Write the single int Constructor and addStrand method for the Clone class. The single int Constructor is told how many strands will be in the Clone initially. The addStrand method is used to add a single strand to the Clone. Note these methods are only called by Sequence. You should not need to call them anywhere else only implement and test them.
Write a stub version of the match method in Strand that always returns 1, for example. Then use this to write the process method in Clone so that the sequencing operation of matching and merging is simulated. The results won't be correct, but you'll have a start on things. You'll want to consider every possible pair of strands, but make sure you don't try to match a strand with itself.
If this works, change the implementation of the match method in Strand to always return -1. This will help you verify that all possible pairs of strands is tried for matching.
You'll want to add debugging statement like the one below to help develop your program. The statement below is from my version of Clone.java

if (Settings.DEBUG_LEVEL > Settings.MINOR_DEBUG) { System.out.println( "trying to match strands " + j + " " + k); }
Write the function match method in Strand so that it works as it should, returning the index at which the parameter strand matches (or -1 if no match at the given threshold is found).
I strongly suggest that you do this by implementing a private helper method isMatch in Strand whose method signature is
private boolean isMatch(Strand other, int index) // pre: 0 <= index < length( ) // post: returns true if strand s matches this strand // starting at location index of this strand // returns false if no match
For example, if the value of index is zero then the strand s is tried for a match at the beginning of the strand being tested (remember, this is a member of the Strand class). If the value of index is length()-1 then s is tried as a match of one character---the last character (against the first character of s).
The method isMatch is one for loop that compares characters in the two strands, starting at location 0 in the Strand other that's a parameter and starting at location index in the Strand whose member function is being called. (the calling object)
If isMatch works properly, the match method can be written with one loop that calls isMatch with all values of parameter index from zero to length()-1. The calls to isMatch shouldn't be made unless the match could be a containing match or exceed the threshold for an overlapping match.
Once this matching function works you'll need to implement a Strand constructor that creates a strand from two other strands starting at an index in the first strand. You could call this as follows (this is the call from my program).
Strand newStrand = new Strand(myStrands[j],myStrands[k], index);

This is called when a match of two strands is found. Don't forget that you must merge bases from both strands, but also merge the labels. Each Strand object will have a private instance variable to track its string of bases and its Label. When merging two strand the new strand should have a Label equal to the first label concatenated with the second label

When a match is found the two strands matching must be removed from myStrands in the Clone object. I recommend writing a private helper method to perform this merge. The new Strand could be created here and myStrands could be updated by adding the new Strand and deleting the two old strands. Be sure to update myStrands correctly so your algorithm process will only check all possibilities for the new list of strands.
You may find you need other methods to assist in you completion of the assignment but the ones mentioned above are the major ones. You may want a toString method and getLabel methods

Input File Format

An input file is a sequence of strands. The file begins with the number of strand in the file. Each strand consists of a label on one line followed by the bases that make up the strand. The bases are stored 60 bases per line except for the last line which may have fewer than 60 bases. The sample data file simple.txt is reproduced below.

7
B0
tgaaaattcctttctattttaggccc

C0
tgaaaattcctttctattttaggcccatgcaat

C1
ggcattagggcggttaa

B1
atgcaatggcattagggcggttaa

A2
ggttaa

A0
tgaaaattcctttctattt

A1
taggcccatgcaatggcattagggc

Again, you shouldn't have to deal with any files directly, it is already taken care of in Sequence and FileHandler as long as you provide a valid file name.

Given the above file my program obtained the solution:

B0C0B1C1A2A0A1 tgaaaattcctttctattttaggcccatgcaatggcattagggcggttaa

Note your final label may be different, but your sequence of bases should be the same.

A note on the instance variables of Clone and Strand: One of your key decisions will be what you use to store the data in Strand and Clone. In Strand there are no restrictions although the most likely candidates are a String, a StringBuffer, or an array of chars. Again there are no restrictions for your internal storage device of the bases in each Strand object. For the Clone class however you must have and use an array of Strands as your storage device.

What to turn in: Turn in your Clone.java and Strand.java files by the due date to me via email. Note I will test your program against a DNA sequence file you don't have access too, but if you can handle simple.txt and complex.txt you should be fine. One word of caution if you are using BlueJ. When you add a file to a bluej project the file is actually copied into the folder where the project exists, Most likely this is C:\bluej\projectName. When you turn in a file be sure you are turning in the file from the bluej project folder not where you downloaded the file to initially. The changes you make via bluej are to that file in the project folder, not the file you used to add the class to the project.