Program P8: Gamma Joins

Due Friday, December 2th, 10pm

In this assignment, you will implement the Gamma Join architecture within a single Java program using multiple threads and Java pipes. Hint -- read this entire assignment before you proceed! 

Warning: this is a substantial programming assignment.  It will take you several days to finish.  Do not delay in getting started.  This programming assignment will count 3 times a typical programming assignment, so take it seriously.  You may work alone or in groups of two.  Brownie points will be given to those who submit their assignments early.

I am giving you the source of two packages:

Connector c = new Connector("A_to_B");
ReadEnd  re = c.getReadEnd();
WriteEnd we = c.getWriteEnd();

Carefully read and understand the BMap class.  It implements tuple hashing and manipulates bitmaps. There are many ways in which MMERGE and MSPLIT can be realized. The simplest is this: M is a nxk bitmap. The join key of an A tuple is hashed twice: once to determine the row of M, the second to determine the column within the selected row. Thus, all tuples of substream Ai hash to row i of M. MMERGE combines M1...Mn into M by boolean disjunction. For each i, MSPLIT extracts row i from M and zeros out the rest of Mi.

Now, carefully re-read the lecture notes on Gamma and how its design was incrementally developed.  I suggest that you:

  1. start by implementing a ReadRelation component that reads a text file and produces a stream of Strings that represent records.  You'll reuse this component all the time in your tests.  You will need to use Java's StringTokenizer to parse lines of input files (given below)

  2. next, implement a Print component that prints a stream of tuples to standard out.

  3. now, begin building components like HJOIN, BFILTER, etc. one at a time.  Below I give you the core methods of a test for the HJOIN component:

public void join(String r1name, String r2name, int joinkey1, int joinkey2) throws Exception {
   System.out.println( "Joining " + r1name + " with " + r2name );

   ThreadList.init();
   Connector c1 = new Connector("input1");
   ReadRelation r1 = new ReadRelation(r1name, c1);
   Connector c2 = new Connector("input2");
   ReadRelation r2 = new ReadRelation(r2name, c2);
   Connector o = new Connector("output");
   HJoin hj = new HJoin(c1, c2, joinkey1, joinkey2, o);
   Print p = new Print(o);
   ThreadList.run(p);
}
// run 3 joins to verify that the HJOIN implementation is correct.
public void jointest() throws Exception {
    Diff.initSortedTest("out.txt", "jointest.txt");
    join("parts", "odetails", 0, 1);
    join("client", "viewing", 0, 0);
    join("orders", "odetails", 0, 0);
    if (Diff.finishTest()) { fail("HJ test fails");}
}

Note: below is a pipe and filter diagram that the above code implements.  Note that the diagram doesn't show the details of what file to read or the join key, but this information is conveyed directly in the calls above.

You will need tables to join. Here is a zip file containing them all. I provide several: client, viewing, orders, parts, odetails.  I also provide tables that are produced by particular joins:

so that you can compare your output with correct output.   These solutions are there for you to confirm that you are producing the correct output, not that this is the *format* that your output should be in.  (Ex. joinkey columns may be duplicated). The order in which tuples are listed is irrelevant.  The order in which columns are listed is irrelevant.  Your implementation will determine the order in which tuples are listed. Your implementation will determine the order in which columns are listed.  But all tuples and all columns should be present.

Note that the joins I provide and that you will compute are called "natural joins" -- that is, the join key for two tables is the column(s) that both tables have in common.  You should assume that there is only one column in common for every join you will encounter.

You must clearly explain and demonstrate that you have implemented the final parallel hash-join architecture in the lecture notes.  You should have substantial tests for each component that you write.  And you should have a substantial write-up of what you have done, how you have proceeded, and your organization.