In this assignment, you will implement the Gamma Join architecture within a single Java program using multiple threads and Java pipes. Hint -- read this entire assignment before you proceed!
Warning: this is a substantial programming assignment. It will take you several days to finish. Do not delay in getting started. This programming assignment will count 3 times a typical programming assignment, so take it seriously. You may work alone or in groups of two. Brownie points will be given to those who submit their assignments early.
I am giving you the source of two packages:
basicConnector -- contains files that will return piped-connectors to you, and from which you can extract read and write ends of pipes. Study this source code to understand how and why it works.
Connector c = new Connector("A_to_B");
ReadEnd re = c.getReadEnd();
WriteEnd we = c.getWriteEnd();
gammaSupport -- contains definition of tuples and common operations and other utilities that you will need. Study this source code to understand what it provides you. Please note: there is a bug in Java Pipes. If you serialized tuples and pushed them through pipes, your program will fail. (I have no idea why). If tuples are serialized as Java Strings, there is no problem. The Tuple class of gammaSupport provides an appropriate String serialization method for you to use.
Carefully read and understand the BMap class. It implements tuple hashing and manipulates bitmaps. There are many ways in which MMERGE and MSPLIT can be realized. The simplest is this: M is a nxk bitmap. The join key of an A tuple is hashed twice: once to determine the row of M, the second to determine the column within the selected row. Thus, all tuples of substream Ai hash to row i of M. MMERGE combines M1...Mn into M by boolean disjunction. For each i, MSPLIT extracts row i from M and zeros out the rest of Mi.
Now, carefully re-read the lecture notes on Gamma and how its design was incrementally developed. I suggest that you:
start by implementing a ReadRelation component that reads a text file and produces a stream of Strings that represent records. You'll reuse this component all the time in your tests. You will need to use Java's StringTokenizer to parse lines of input files (given below)
next, implement a Print component that prints a stream of tuples to standard out.
now, begin building components like HJOIN, BFILTER, etc. one at a time. Below I give you the core methods of a test for the HJOIN component:
public void join(String r1name, String r2name, int joinkey1, int joinkey2) throws Exception {
System.out.println( "Joining " + r1name + " with " + r2name );
ThreadList.init();
Connector c1 = new Connector("input1");
ReadRelation r1 = new ReadRelation(r1name, c1);
Connector c2 = new Connector("input2");
ReadRelation r2 = new ReadRelation(r2name, c2);
Connector o = new Connector("output");
HJoin hj = new HJoin(c1, c2, joinkey1, joinkey2, o);
Print p = new Print(o);
ThreadList.run(p);
}
// run 3 joins to verify that the HJOIN implementation is correct.
public void jointest() throws Exception {
Diff.initSortedTest("out.txt", "jointest.txt");
join("parts", "odetails", 0, 1);
join("client", "viewing", 0, 0);
join("orders", "odetails", 0, 0);
if (Diff.finishTest()) { fail("HJ test fails");}
}
Note: below is a pipe and filter diagram that the above code implements. Note that the diagram doesn't show the details of what file to read or the join key, but this information is conveyed directly in the calls above.

You will need tables to join. Here is a zip file containing them all. I provide several: client, viewing, orders, parts, odetails. I also provide tables that are produced by particular joins:
join client table with viewing table over key (CLIENTNO)
join orders table with odetails table over key (ONO)
join parts table with odetails table over key (PNO)
so that you can compare your output with correct output. These solutions are there for you to confirm that you are producing the correct output, not that this is the *format* that your output should be in. (Ex. joinkey columns may be duplicated). The order in which tuples are listed is irrelevant. The order in which columns are listed is irrelevant. Your implementation will determine the order in which tuples are listed. Your implementation will determine the order in which columns are listed. But all tuples and all columns should be present.
Note that the joins I provide and that you will compute are called "natural joins" -- that is, the join key for two tables is the column(s) that both tables have in common. You should assume that there is only one column in common for every join you will encounter.
You must clearly explain and demonstrate that you have implemented the final parallel hash-join architecture in the lecture notes. You should have substantial tests for each component that you write. And you should have a substantial write-up of what you have done, how you have proceeded, and your organization.