Gamma Joins
updates in red

In this assignment, you are to implement the Gamma Join architecture as a single Java program using multiple threads and Java pipes exactly as in the Gamma lecture notes – no short cuts! (Short cuts will be deduct from your grade).

Warning: this is a substantial assignment. It will take you several days to finish. Do not delay in getting started. This assignment will count two (2) times a typical programming assignment, so take it seriously.

Here is the Netbeans Shell for this project to get you started.

Part 1

Before you begin this project, I am supplying you the source code of a parallel sort. The refinement from sequential to parallel is:

Here is its Netbeans source code that uses Java pipes. Run the regression tests, see how the regression tests are set up and how each box above is coded. Only when you understand how this program works should you proceed any further with this assignment.

Note because of the way Java pipes work, one has to "hack" the OMerge operation by round-robin polling from each pipe. Ideally, one should transmit the next output received from any pipe, but this is not possible the way Java pipes work now. You will need to use this "hack" in other situations in this assignment.

Part 2

I am giving you the source of two packages:

basicConnector -- contains files that will return piped-connectors to you, and from which you can extract read and write ends of pipes. Study the this source code to understand how and why it works. Do not proceed with the next package until you do so.

Connector c = new Connector("A_to_B");

ReadEnd re = c.getReadEnd();

WriteEnd we = c.getWriteEnd();

gammaSupport -- contains definitions of tuples and common operations and other utilities that you will need. Study this source code to understand what it provides you; ask before you reimplement. Very little of MDELite8 is used in this assignment -- in checking I believe only the Tuple and TableSchema classes are used.

Note: there is a bug in Java Pipes. If you serialize tuples and pushed them through pipes, your program will fail. (I have no idea why). If tuples are serialized as Java Strings, there is no problem. The Tuple class of gammaSupport provides an appropriate String serialization method for you to use.

Note: Carefully read and understand the BMap class. It implements tuple hashing and manipulates bitmaps. There are many ways in which MMERGE and MSPLIT can be realized. The simplest is this: M is a n*k bitmap. The join key of an A tuple is hashed twice: once to determine the row of M, the second to determine the column within the selected row. Thus, all tuples of substream A_i hash to row i of M. MMERGE combines M₁...M_n into M by boolean disjunction. For each i, MSPLIT extracts row i from M and zeros out the rest of M_i.

Note:Use/import the ParallelSort RegTest package in your project along with basicConnector and gammaSupport. The reason is that if there is an error in your code, you can use a debugger to see what arguments you are using (or RegTest knows that you are using) to discover your problem.

Note: you'll waste a lot of time and write some ugly code if you do not read and understand all of the above code before you proceed. Part of your grade is the ability to read and leverage existing code rather than hacking your own.

Now, carefully re-read the lecture notes on Gamma and how its design was incrementally developed. I suggest that you:

1. Start by implementing a ReadTable component that reads a text file and produces a stream of Strings that represent records. You'll reuse this component all the time in your tests. You will need to use Java's StringTokenizer to parse lines of input files (given below).

2. Next, implement a Print component that prints a stream of tuples to standard out.

3. Now, begin building components like HJOIN, BFILTER, etc. one at a time. Below I give you the core methods of a test for the HJOIN component:

public void join(String r1name, String r2name, int jk1, int jk2) throws Exception {

System.out.println( "Joining " + r1name + " with " + r2name );

ThreadList.init();

Connector c1 = new Connector("input1");

ReadTable r1 = new ReadTable(r1name, c1);

Connector c2 = new Connector("input2");

ReadTable r2 = new ReadTable(r2name, c2);

Connector o = new Connector("output");

HJoin hj = new HJoin(c1, c2, jk1, jk2, o);

Print p = new Print(o);

ThreadList.run(p);

}

// run 3 joins to verify that the HJOIN implementation is correct.

// Utility is from RegTest.jar

@Test

public void jointest() throws Exception {

Utility.redirectStdOut("out.txt");

join("parts", "odetails", 0, 1);

join("client", "viewing", 0, 0);

join("orders", "odetails", 0, 0);

Utility.validate("out.txt", "CorrectOutput/jointest.txt",true);

}

Note: below is a pipe and filter diagram that the above code implements. Note that the diagram doesn't show the details of what file to read or the join key; this information is conveyed directly as argument sin the calls above.

Note on Thread Constructors: The last line in each constructor (ReadTable, HJoin, Print, etc) represents a primitive computation that is implemented as a thread, the last of its constructor is ThreadList.add(this); --- that is, each created thread puts itself on the thread list, so that when ThreadList.run(p) is executed, all threads that are created will be run. The reason I did this is that you always have to place each created thread on the ThreadList. And since it is easy to forget to do this, put this action into the constructor so that you never will forget as it is done for you. FYI, this is another example of a leaking constructor.

You will need tables to join. I provide a client view database which has several tables (which I have unpacked for you): client, viewing, orders, parts, odetails. I also provide tables that are produced by particular joins:

join client table with viewing table over key (CLIENTNO)
join orders table with odetails table over key (ONO)
join parts table with odetails table over key (PNO) joined with orders table over key ONO.

so that you can compare your output with correct output. These solutions are there for you to confirm that you are producing the correct output, not that this is the format that your output should be in. (Ex. joinkey columns may be duplicated). You are free to permute the order of rows and columns -- but in your regression tests, you must be able to verify that you are producing the correct output (whatever it is that you decide).

Note: the joins I provide and that you will compute are called "natural joins" -- that is, the join key for two tables is the column(s) that both tables have in common. You should assume that there is only one column in common for every join you will encounter.

Note: I suggest that you use these "correct" files to get started. When you use the RegTest code, you may discover that you are outputting the correct result, except for some strange invisible characters at the end of the file. When you see that you are in this situation, simply copy your generated output (which you know to be correct, except for the end-of-line or end-of-file strangeness) and simply make it the "correct" output from this point on.

You must clearly explain and demonstrate that you have implemented the final parallel hash-join architecture in the lecture notes. You should have substantial tests for each component that you write. And you should have a substantial write-up of what you have done, how you have proceeded, and your organization.

Implementation Hints

The first thing is that you don't write your own file read and file write methods for reading relational tables. You use MDELite. Further, I suggest two main class hierarchies. Here's the first -- every terminal class is a primitive, whose object is a thread. See note above on thread constructors.

The meaning of these classes is below:

Bloom -- computes a bloom filter, as in notes. Observe the icon in the middle of the class: the box has a record stream input (on the left) and a record stream output on the right, and also a bitmap output (red).
BFilter -- applies bloom filter, as in notes. The icon in the middle of the class says that this box takes a bitmap (red) and stream (black) as input and produced a record stream as output (on the right). This class was now added to GammaMDLShell.zip -- it was inadvertently not included before. Here's the BFilter.java replacement.
BloomSimulator -- to test a Bloom box, I needed to create a box that supplies a bloom filter; that's this box.
DoNothing -- relays input messages to output messages. Initially useful to see if pipelines are working.
HJoin -- this box encodes the hashjoin algorithm.
Sink -- this box is useful in testing; it simply reads its inputs and does nothing with them.
ReadTable -- this box has 2 parameters: name of file to read and the end of a pipe in which to send Tuples of the file. You should use the method Tuple.makeTupleFromFileData.
Print -- prints each record that it receives.
PrintMap -- reads a single bit map and prints it. Useful for debugging.
HSplit -- reads in a record stream and hash-splits its contents across 4 streams (numbered 0..3). The icon suggests only 3 streams, but you should use 4.
Merge -- reads 4 streams in round-robin order to merge their results into a single output stream.
SplitM -- splits a bitmap into 4 distinct bitmaps (inverse of MergeM).
MergeM -- merges 4 distinct bitmaps into a single bitmap.

Note: The boxes in green above I have given you (I think) the full source. So check them out!

The following boxes/classes implement pipe-and-filter graphs that implement the functionality of the primitive boxes above (and thus have the same iconic "shape"). They are not part of the Thread hierarchy as only the building blocks above are implemented as threads.

These boxes implement various non-primitive implementations of the HJoin box/primitive. In the notes, one implementation map-reduced HJoin, another implementation implemented HJoin using BFilter and Bloom boxes wired together, and finally the entire Gamma graph. The terminal boxes are subclasses of Gamma.ArrayConnectors, which provides useful ways of creating arrays of pipes in one call. Not all methods of Gamma.ArrayConnectors may be needed for this project; all are fairly simple.

Extra Credit

I have implemented some 3-way joins using array cross bars, array print, and array Read Table, which you will find in the project source and regression tests. You will get extra credit if you can show me the DxT derivation of this code. Here's a picture of a 3-way join (relation X joined with relation Y joined with relation Z). If you substitute the Gamma design for HJOIN into both boxes below, you'll see what I mean.

What to Submit via Canvas

Submit a single zip file to Canvas.

The zip file should unzip into <yourName>/<yourFilesAndDirectories> (I will return submissions that do not satisfy this constraint). It should contain the following:
Your Netbeans/Eclipse project
An idiot-proof bash-script called run.script that will run your implementation of Gamma and show me what you've done.
A .pdf document that explains your design, program, and anything I need to know to evaluate what you have done. If no PDF document is submitted, I will return your assignment with no grade
If you choose to try the extra credit, problem, put it in a separate PDF for me to look at.

A critical part of any design is clarity and understandability. Hence, you will be graded on the clarity of your project and its ability to work correctly. Sloppy code, documentation, or anything that makes grading or understanding your program difficult will cost you points. Beware, some of these "beauty" points are subjective.

Remember: No late assignments/submissions will be accepted.

Gamma Joins updates in red