Gamma: Parallelizing Sequential Streaming Applications

updates in red

In this assignment, you will implement the Gamma HJOIN architecture within a single Java program using multiple threads and Java pipes exactly as in the Gamma lecture notes -- no short cuts!  Short cuts will be deducted from your grade.

Warning: this is a substantial assignment.  It will take you many days to finish.  Do not delay in getting started.  This assignment will count 2x times a typical programming assignment, so take it seriously.  Brownie points will be given to those groups who finish early.


Gamma Netbeans Project Preliminaries

I am giving you a shell of a Gamma Netbeans Project that uses MDELite to read and write tables.  In the project, I supply 3 directories:
I have 5 components/boxes (ReadTable, Print, Sort, HSplit, OMerge) for you to use as exemplars and three runnable regression tests:

Study carefully how these examples work.  Only when you understand how each box and each regression test works, should you proceed onto the main assignment.


More Studying

A major part of this assignment is for you to read and use existing code.   Read the last sentence again for effect. Not doing so will cost you a lot of time. Study carefully the gammaSupport package. It contains definition of tuples and common operations and other utilities that you will need, such as connectors:
Connector c = new Connector("A_to_B");
ReadEnd re = c.getReadEnd();
WriteEnd we = c.getWriteEnd();

Please note: there is (may still be) a bug in Java Pipes.  If you serialized tuples and pushed them through pipes, your program will fail.  (I have no idea why).  If tuples are serialized as Java Strings, there is no problem.  This level of detail has been hidden from you -- you should only invoke getNextTuple or getNextMap methods on ReadEnds, and putNextTuple or putNextMap on WriteEnds.  Study the boxes whose code I give you.  You'll see the level of abstraction at which you need to program.

Another point on Java pipes.  When a  ReadEnd is finished (meaning there is nothing more to read) -- close it!  When a WriteEnd is finished (meaning there is nothing more to write) -- close it!  Typical coding idioms are:
ReadEnd
WriteEnd
while(true) {
Tuple t = re.getNextTuple();
if (t == null) {
re.close();
break;
}
// do something with t
}
while(true) {
// produce tuple t
if (t == null) {
wr.close();
break;
}
wr.putNextTuple(t);
}

Study the code of the boxes I give you, and you'll see the use of these idioms.  They are important: if you don't close when your done, your boxes/threads may hang or not perform correctly.  It is a horror to be stuck in this position. So follow my advice!


Another note: Carefully read and understand the BMap class.  It implements tuple hashing and manipulates bitmaps. There are many ways in which MMerge and MSplit can be realized. The simplest is this: M is a n×k bitmap. The join key of an A tuple is hashed twice: once to determine the row of M, the second to determine the column within the selected row. Thus, all tuples of substream A[i] hash to row i of M. MMerge combines M[1]...M[n] into M by boolean disjunction. For each i, MSplit extracts row i from M and zeros out the rest of M[i].


Databases Provided

In the zip file, I provide in directory RelationData a set of individual tables, and the entire clientView database from which these tables were extracted.  Further, in the RegTest directory, I provide a simple program, getAnswers, to produce tables that are the results of joins.  The files that are produced by this program are to be used as "Correct" answers for writing your regression tests.


Implementation Hints

I suggest two main class hierarchies.  Here's the first -- every terminal class is a primitive, whose object is a thread.  

 

The meaning of these classes is below:

The following boxes/classes implement pipe-and-filter graphs that implement the functionality of the primitive boxes above (and thus have the same iconic "shape").  They are not part of the Thread hierarchy as only the building blocks above are implemented as threads.

These boxes implement various non-primitive implementations of the HJoin box/primitive.  In  the notes, one implementation map-reduced HJoin, another implementation implemented HJoin using BFilter and Bloom boxes wired together, and finally the entire Gamma graph.  The terminal boxes are subclasses of Gamma.ArrayConnectors, which provides useful ways of creating arrays of pipes in one call.  Not all methods of Gamma.ArrayConnectors may be needed for this project; all are fairly simple.


What To Submit To Canvas

All of the below in a zip file (including your Netbeans or Eclipse Project). The zip file must unzip into <yourName>/<YourFilesAndDirectories>.
  1. Your program needs to run correctly on Linux machines, even though you may have developed them on Macs and Windoze.  The TA will grade your program running on Linux.

  2. Your program's source.

  3. A short description that the Grader needs to know to run your program, other than the above. 

  4. Use JUnit tests that are provided to confirm you are producing the correct output for Gamma joins of:
You may add any additional tests you believe are necessary. Additional tests are optional.
  1. A short writeup explaining any additional tests that you have added to your program to verify that it works.

  2. A PDF file (in the required format) that the Grader should read to provide any information that is not obvious.  The contents of the PDF file can be minimal.

A critical part of any design is clarity and understandability.   Hence, you will be graded on the clarity of your project and its ability to work correctly.  Sloppy code, documentation, or anything that makes grading or understanding your program difficult will cost you points.  Beware, some of these "beauty" points are subjective. 

Remember: No late assignments/submissions will be accepted.