CS 307 - Fall 2001 - Assignment 3
Mapping Genomes
Handed Out: September
24,
2001
Due: Before 11:59 p.m. October 10, 2001
Purpose: To learn about creating classes, working with complex algorithms,
and array manipulation
Provided classes: DNAMain.java,
Sequencer.java, FileHandler.java,
Settings.java, simple.txt,
complex2.txt
This assignment is modeled after an assignment first constructed by Richard Pattis at Carnegie Mellon.
For this assignment, you'll be reading a file that identifies several strands in a DNA molecule and trying to reconstruct the DNA molecule from the strands. The data you'll be processing is artificially generated (by a program) but the process is similar to what's done by computer programs in the human genome project. In a laboratory, a fragment of DNA is duplicated, then all the fragments are broken into smaller, overlapping strands by methods like thermal dissociation and agitation. In a lab, the bases of the smaller strands are determined and entered into a file. At this point you're stepping in to process the file and reconstruct the original DNA sequence of bases. The data for this assignment is perfect, not noisy or with any mistakes. In this respect the data is unlike that processed by the Human Genome Project.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | ||||
a | c | g | g | t | c | a | c | ||||
g | t | c | a | c | a | t | t | a | |||
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
This example shows how a match between X (called the first strand of the match) and Y (called the second strand) is formed by finding an index in X (3 in the example above) so that all the bases from that index to the end of X match the corresponding bases at the beginning of Y. The number of bases forming the match, five in the example above, is the size of the match. The size of a match must exceed some threshold to count as a match. For example, if the threshold were six, the match between X and Y above would not be considered a match.
When two strands match, a merged strand is formed by copying the first strand, then appending the bases of the second strand that come after the matched bases. In the example above, the merged strand is formed by first copying X to get acggtcac then appending atta (the bases in the second strand Y from indexes 5--8) to yield the merged strand acggtcacatta. A merged strand also joins the two labels associated with a strand, see the section on file formats below for label information.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
a | c | g | g | t | c | a | c |
c | g | g | t | ||||
0 | 1 | 2 | 3 |
The merge of this match is simply the strand X. For a containing match no threshold is involved, all containing matches are good matches and decreases the number of strands by one when the merge is formed.
You must implement the Strand and Clone classes.
A sequence of steps is outlined that you may find useful in writing the program. The format of the input file follows the description of how you can develop your program although you need not deal with the files other than providing the name to the main method.
If this works, change the implementation of the match method in Strand to always return -1. This will help you verify that all possible pairs of strands is tried for matching.
You'll want to add debugging statement like the one below to help develop
your program. The statement below is from my version of Clone.java
if (Settings.DEBUG_LEVEL > Settings.MINOR_DEBUG)
{ System.out.println( "trying to match strands
" + j + " " + k); }
I strongly suggest that you do this by implementing a private helper method isMatch in Strand whose method signature is
private boolean isMatch(Strand other, int index)
// pre: 0 <= index < length( )
// post: returns true if strand s matches this strand
// starting at location index of this strand
// returns false if no match
For example, if the value of index is zero then the strand s is tried for a match at the beginning of the strand being tested (remember, this is a member of the Strand class). If the value of index is length()-1 then s is tried as a match of one character---the last character (against the first character of s).
The method isMatch is one for loop that compares characters in the two strands, starting at location 0 in the Strand other that's a parameter and starting at location index in the Strand whose member function is being called. (the calling object)
If isMatch works properly, the match method can be written with one loop that calls isMatch with all values of parameter index from zero to length()-1. The calls to isMatch shouldn't be made unless the match could be a containing match or exceed the threshold for an overlapping match.
Strand newStrand = new Strand(myStrands[j],myStrands[k], index);
This is called when a match of two strands is found. Don't forget that you must merge bases from both strands, but also merge the labels. Each Strand object will have a private instance variable to track its string of bases and its Label. When merging two strand the new strand should have a Label equal to the first label concatenated with the second label
When a match is found the two strands matching must be removed from
myStrands in the Clone object. I recommend writing a private
helper method to perform this merge. The new Strand could be created here and
myStrands could be updated by adding the new Strand and deleting the two old
strands. Be sure to update myStrands correctly so your algorithm process
will only check all possibilities for the new list of strands.
You may find you need other methods to assist in you completion of the assignment but the ones mentioned above are the major ones. You may want a toString method and getLabel methods
7 B0 tgaaaattcctttctattttaggccc C0 tgaaaattcctttctattttaggcccatgcaat C1 ggcattagggcggttaa B1 atgcaatggcattagggcggttaa A2 ggttaa A0 tgaaaattcctttctattt A1 taggcccatgcaatggcattagggc
Again, you shouldn't have to deal with any files directly, it is already taken care of in Sequence and FileHandler as long as you provide a valid file name.
Given the above file my program obtained the solution:
B0C0B1C1A2A0A1
tgaaaattcctttctattttaggcccatgcaatggcattagggcggttaa
Note your final label may be different, but your sequence of bases should be the same.
A note on the instance variables of Clone and Strand: One of your key decisions will be what you use to store the data in Strand and Clone. In Strand there are no restrictions although the most likely candidates are a String, a StringBuffer, or an array of chars. Again there are no restrictions for your internal storage device of the bases in each Strand object. For the Clone class however you must have and use an array of Strands as your storage device.
What to turn in: Turn in your Clone.java and Strand.java files by the due date to me via email. Note I will test your program against a DNA sequence file you don't have access too, but if you can handle simple.txt and complex.txt you should be fine. One word of caution if you are using BlueJ. When you add a file to a bluej project the file is actually copied into the folder where the project exists, Most likely this is C:\bluej\projectName. When you turn in a file be sure you are turning in the file from the bluej project folder not where you downloaded the file to initially. The changes you make via bluej are to that file in the project folder, not the file you used to add the class to the project.