Program P9: Gamma Joins

Due Wednesday, December 4th, 12noon

In this assignment, you will implement the Gamma Join architecture within a single Java program using multiple threads and Java pipes exactly as in the Gamma lecture notes -- no short cuts!  (Short cuts will be deducted from your grade).

Warning: this is a substantial assignment.  It will take you many days to finish.  Do not delay in getting started.  This assignment will count three (3) times a typical programming assignment, so take it seriously.  Brownie points will be given to those groups who finish early.


Before you begin this project, I am supplying you the source code of a parallel sort. The refinement from sequential to parallel is:

Here is its Netbeans source code that uses Java pipes.  Study carefully how it works.  Only when you understand how each box works, should you proceed onto the main assignment.


I am giving you the source of two packages:

Connector c = new Connector("A_to_B");
ReadEnd re = c.getReadEnd();
WriteEnd we = c.getWriteEnd();

Carefully read and understand the BMap class.  It implements tuple hashing and manipulates bitmaps. There are many ways in which MMERGE and MSPLIT can be realized. The simplest is this: M is a nxk bitmap. The join key of an A tuple is hashed twice: once to determine the row of M, the second to determine the column within the selected row. Thus, all tuples of substream Ai hash to row i of M. MMERGE combines M1...Mn into M by boolean disjunction. For each i, MSPLIT extracts row i from M and zeros out the rest of Mi.

Now, carefully re-read the lecture notes on Gamma and how its design was incrementally developed.  I suggest that you:

  1. start by implementing a ReadRelation component that reads a text file and produces a stream of Strings that represent records.  You'll reuse this component all the time in your tests.  You will need to use Java's StringTokenizer to parse lines of input files (given below)

  2. next, implement a Print component that prints a stream of tuples to standard out.

  3. now, begin building components like HJOIN, BFILTER, etc. one at a time.  Below I give you the core methods of a test for the HJOIN component:

public void join(String r1name, String r2name, int jk1, int jk2) throws Exception {
System.out.println( "Joining " + r1name + " with " + r2name );

ThreadList.init();
Connector c1 = new Connector("input1");
ReadRelation r1 = new ReadRelation(r1name, c1);
Connector c2 = new Connector("input2");
ReadRelation r2 = new ReadRelation(r2name, c2);
Connector o = new Connector("output");
HJoin hj = new HJoin(c1, c2, jk1, jk2, o);
Print p = new Print(o);
ThreadList.run(p);
}
// run 3 joins to verify that the HJOIN implementation is correct. Utility is from RegTest.jar
public void jointest() throws Exception {
Utility.redirectStdOut("out.txt");
join("parts", "odetails", 0, 1);
join("client", "viewing", 0, 0);
join("orders", "odetails", 0, 0);
Utility.validate("out.txt", "CorrectOutput/jointest.txt",true);
}

Note: below is a pipe and filter diagram that the above code implements.  Note that the diagram doesn't show the details of what file to read or the join key, but this information is conveyed directly in the calls above.

Note: The last line in each constructor (ReadRelation, HJoin, Print, etc) that represents a primitive computation that is implemented as a thread, the last of its constructor is  ThreadList.add(this); --- that is, each created thread puts itself on the thread list, so that when ThreadList.run(p) is executed, all threads that are created will be run.  The reason I did this is that you always have to place each created thread on the ThreadList.  And since it is easy to forget to do this, put this action into the constructor so that you never will forget as it is done for you.

You will need tables to join.  I provide several tables: client, viewing, orders, parts, odetails.  I also provide tables that are produced by particular joins:

so that you can compare your output with correct output.   These solutions are there for you to confirm that you are producing the correct output, not that this is the *format* that your output should be in.  (Ex. joinkey columns may be duplicated). The order in which tuples are listed is irrelevant.  The order in which columns are listed is irrelevant.  Your implementation will determine the order in which tuples are listed. Your implementation will determine the order in which columns are listed.  But all tuples and all columns should be present.

Note: the joins I provide and that you will compute are called "natural joins" -- that is, the join key for two tables is the column(s) that both tables have in common.  You should assume that there is only one column in common for every join you will encounter.

Note: I suggest that you use these "correct" files to get started.  When you use the RegTest code, you may discover that you are outputting the correct result, except for some strange invisible characters at the end of the file.  When you see that you are in this situation, simply copy your generated output (which you know to be correct, except for the end-of-line or end-of-file strangeness) and simply make it the "correct" output from this point on.

You must clearly explain and demonstrate that you have implemented the final parallel hash-join architecture in the lecture notes.  You should have substantial tests for each component that you write.  And you should have a substantial write-up of what you have done, how you have proceeded, and your organization.


Implementation Hints.

I suggest two main class hierarchies.  Here's the first -- every terminal class is a primitive, whose object is a thread.  See note above on thread constructors.

 

The meaning of these classes is below:

The following boxes/classes implement pipe-and-filter graphs that implement the functionality of the primitive boxes above (and thus have the same iconic "shape").  They are not part of the Thread hierarchy as only the building blocks above are implemented as threads.

These boxes implement various non-primitive implementations of the HJoin box/primitive.  In  the notes, one implementation map-reduced HJoin, another implementation implemented HJoin using BFilter and Bloom boxes wired together, and finally the entire Gamma graph.  The terminal boxes are subclasses of Gamma.ArrayConnectors, which provides useful ways of creating arrays of pipes in one call.  Not all methods of Gamma.ArrayConnectors may be needed for this project; all are fairly simple.