CS 372 Operating Systems
Lab 3 Programming Assignment:
Map Reduce Search Engine
/$ bin/hadoop jar search.jar Search hadoop export
Search Complete
3 relevant files found
Terms: hadoop, export,
hadoop-env.sh (20)
...export HADOOP_SSH_OPTS="-o ConnectTimeout=1 -o SendEnv=HADOOP_CONF_DIR"...
...export HADOOP_LOG_DIR=${HADOOP_HOME}/logs...
...export HADOOP_BALANCER_OPTS="-Dcom.sun.management.jmxremote $HADOOP_BALANCER_OPTS"...
...export HADOOP_TASKTRACKER_OPTS=...
hadoop-default.xml (5)
...Indicates how many times hadoop should attempt to contact the...
...The filesystem for Hadoop archives...
...tasks and applications using Hadoop Pipes, Hadoop Streaming etc....
...Should native hadoop libraries, if present, be used...
log4j.properties (1)
...# Sends counts of logging messages at different severity levels to Hadoop Metrics....
"operating system"
http://en.wikipedia.org/wiki/Operating_system (offset 2355)
http://www.webopedia.com/TERM/O/operating_system.html (offset 12)
...
"semaphore"
http://en.wikipedia.org/wiki/Semaphore_(programming)
http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/Semaphore.html
...
...
Hence, when we want to search for the term "operating system", rather than reading all possible files looking for the term, we simply go to the index, see that it occurs in a number of indexed files at certain offsets from the beginning of the file, and we can return a result for the search very quickly. Of course, searching using this technique relies heavily on the ability to maintain an accurate index, so search engines have (at least) 2 main components: a component which is always indexing the web (or other searchable body) in the background, and a component which consults that index in response to search queries from users.
Another difficult problem for search engines is returning results that are actually relevant to the user's query. For example, if we search for "space pigs walking tall", we may be searching the exact phrase, or we may be searching for documents that contain some subset of these terms. In fact there may be no single document that contains all of them, so deciding whether documents about "space walking" are more relevant than documents about "tall pigs" is a difficult challenge. Perhaps we are interested in documents that contain "pig" and "walk". Consequently, search engines use elaborate heuristics (for example, Google's PageRank algorithm) to attempt to maximize the relevance of search results. These heuristics can directly influence the design of the search index.
In this lab you will develop 5 Map/Reduce algorithms that address the problems of indexing, relevance, searching, and summarizing results. The general structure of our search engine is shown below.
The input files represent the body of files that our search engine searches. The search engine works by creating two different kinds of indexes. Note that these indexes only need to be created once (although you will likely need to create them many times during debugging!). After the indexes are built the search query can operate directly on them.
term file-name_0|offset_0| summary_line_0^A term is a searchable word such as "operating", and is followed by a list of files, offsets, and copies of the line on which the term occurred. In this example, shown from our implementation,
file-name_1|offset_1| summary_line_1^
...
file-name_n|offset_n| summary_line_n^
volume hadoop-default.xml|16384| Reserved space in bytes per volume. Always leave this much spacethe term "volume" occurs just once in the file "hadoop-default.xml" at byte offset 16484, while the term "waits" occurs in two places in that same file. The summary index is used by the Presenter component of the search engine when writing the final search results to provide some summary context around the search terms found: this is a simplified version of how Internet search engines summarize the content of files when presenting search results.
waits hadoop-default.xml|24576| The time, in milliseconds, the tasktracker waits for sending a^
hadoop-default.xml|24576| The interval, in milliseconds, for which the tasktracker waits
term1 term2 ... termN|file-name countA group of N terms, separated by spaces is followed by a list of files in which that N-gram occurs, along with a count of the the number of times the n-gram occured in that file. In the example below:
waits |hadoop-default.xml 2The 1-gram "waits" occurs twice in the file "hadoop-default.xml" (which we know to be true in the summary index above as well), and the 3-gram "to sleep between" occurs once in "hadoop-default.xml" and once in "hadoop-env.sh". NOTE:To simplify the problem, we only count N-grams that do not cross line breaks in the input file. The N-gram index is used by the search component of the search engine to generate a list of candidate search matches, along with a relevance score.
to sleep between|hadoop-default.xml 1
to sleep between|hadoop-env.sh 1
This jar file contains the Map Reduce infrastructure, a skeleton search engine program, and a searchable corpus of files. The skeleton search engine contains a version of the n-gram indexer that works for single words (1-grams) only, which you may find helpful for pattern-matching to develop the rest of your map-reduce algorithms.
To extract the lab3.jar, and create the search corpus.
$> jar -xvf lab3.jar
$> gunzip corpus.tar.gz
$> tar -xvf corpus.tar
To build the lab from the command line:
First off, let us make sure you are running the bash shell. The shell is the name for the program you run that interprets command lines. Bash is the official GNU shell, so lots of people use it. To run it on the CS machines type /usr/local/bin/bash. When the shell starts it will read the commands in the .bashrc file located in your home directory. Your home directory is the path where the system places your shell when you log in. You can get there by typing cd with no argument. If you edit .bashrc after starting the shell, it won't reread the file. The environment variables you set for a shell only stay in effect for as long as the shell runs. If you exit a shell, all its environment variables go away.
First, make sure you've got the HADOOP_HOME environment variable defined. If you are working on a linux CS machine, you can either edit the .bashrc file in your home directory to contain the line:
HADOOP_HOME=/lusr/hadoop
Or, at the command line you can type
$> export HADOOP_HOME=/lusr/hadoop
This should get you to use the CS installation of hadoop, which we recommend. If you are working on windows or some other system, you need to download Hadoop and build it. Wherever you put it, you need to set HADOOP_HOME to that path name.
Once your shell can find HADOOP, then building the project is a simple matter of typing: $> make
You might need to set JAVA_HOME=/lusr/java6
Depending on the memory your machine has and how Hadoop is configured, you might need to set this environment variable _JAVA_OPTIONS="-Xms128m -Xmx1700m"
To invoke the search engine, navigate to the hadoop base directory and type the following at the command line:
make run
The maximum number of searchable terms is 10 by default, but this
can be set by changing Search.MAX_SEARCH_TERMS.
To use eclipse as your development environment, you need to create a project "from existing source" after you extract the jar file, the same way you did for the previous projects. However, an additional step is required to put the Hadoop libraries on your build path. To do this, you want to select "Properties" from the project menu, and choose the "Java Build Path" item in the left pane of the properties dialog. Next, select the "Add External JARS..." button, and add the following libraries. The paths shown are for CS machines:
If you are working with your own installation of hadoop, replace "/lusr/hadoop" above with whatever directory your $HADOOP_HOME variable maps to. Note that this will only get you to the point where you can build and edit your project in eclipse. You still need to run the project from the command line. To do that, read the next section!
To debug your search engine in eclipse, you need to jump through a few more hoops to make eclipse and hadoop work together. Under your Run menu, choose "Open Debug Dialog", and choose a new "Java Application" when creating a configuration. These steps should be exactly the same as for your previous projects. After this do the following:
Alternatively, you can work using eclipse with the IBM alphaWorks hadoop plug-in, which will set up these things for you, but which may introduce a number of other complications!
Search.MAX_SEARCH_TERMS
defined in Search.java. There will be a separate directory under your
HDFS directory / ngram_index for each N. For example, the
bigram index is stored in ngram_index/2-grams. The
skeleton
implementation we provide implements a correct indexer for
single word indexing (1-grams). You should expand this
implementation to make it possible to index N-grams. Additionally, you
can use this to pattern match and understand how to implement your
other mapper/reducer classes. You should implement this first,
since you have some
sample implementation to start from. summary_index.
Search.run() method,
and you are responsible for filling in some of this logic. See comments
in Search.java. Search.SHOW_RESULTS_MAX_FILES
in Search.java. The winnower also sorts the output files selected by
relevance score. RelevantFile object),
and uses the summary index to generate the final output of the search
which contains a list of relevant files, relevance scores, and line
summaries where search terms or N-grams occur. The number of summaries
shown should not exceed Presenter.MAX_RESULTS_PER_FILE.
An example of the search engine output format is shown above. Utils.termify. If you call this method on
your search terms and terms from your searchable text, you should not
have to worry about the details of what is a legal search term. In
general, the termify method removes punctuation, changes everything to
lower-case, and requires that a searchable term either start with a
letter (a-z) or be a number. We encourage those who are interested in
the nitty-gritty details to look at the Utils.termify() method, since
that is the defacto standard of what is a legal search term for this
project.
System.out.println become complicated by the fact
that such output may not be available at the command line where you
invoked the program. System.out.println
statements go to the console where you invoked the command line. The
hadoop server running the CS machines will start by default in this
mode, when you run your code, so System.out.println will work. You can
also use Utils.DEBUG(), and Utils.info(). Also, note that because the search wants to work directly on the
indexes produced by your indexer, it checks to see if they are present
in the HDFS, and if so, skips the step of building them (that is, after
all, the point of an index). However, as you debug you may find that
you need to rebuild an existing index. You can ensure that this happens
by either setting the rebuild_index property in system.properties
to true, or by using the Cleaner class described in the next section to
delete the index.
ls at the command line, you need to go
through hadoop by typing bin/hadoop fs -ls at the command
line. Similarly, to view the file of your search results, instead of
using cat search_results, you need to use bin/hadoop
fs
-cat search_results/part-00000 . This takes some getting used
to. In the final version of the program all input and output goes to
the console, but you may find it helpful for debugging to be able to
view files of intermediate results and for indexes.
However, When you are working on a CS machine with your hadoop server running in a standalone debug mode (the default), you can look directly at your output without going through HDFS. Your intermediate and temporary files will be in /tmp/cs372-username/ and your search results will be in your working directory in search_results/part-00000.
HDFS Cleaner.java . This
class can be used to remove temporary files and indexes.
To remove all temporary and results files used by the search
engine, type:
$HADOOP_HOME/bin/hadoop jar search.jar Cleaner .
To remove all temporary, results, and index files, type:
. $HADOOP_HOME/bin/hadoop jar search.jar Cleaner -a
You can also just use the makefile to do this by typing: 'make clean_index'.
We have provided sample correct output for the default Makefile (make run) settings in the files macbeth*.txt, which will get extracted when you unpack the jar file. The macbeth_winnow_results.txt and macbeth_search_results.txt files do not need to match exactly, and are provided to help you get a sense of what the output of intermediate search phases look like.
The following diff should yield no differences:
>
myresults.txt > make run
> diff macbeth_presentation_results myresults.txt
Here is the output for 1-grams.
As with the previous labs, we will be testing your solution on other inputs, so you will need to spend some time devising other test cases and verfying that your solution works for these cases!
A good test case is text that has a repeated word. A search for "sponge sponge" should not double the relevance of the word "sponge" in the results. However, a search for "sponge sponge" should find any files that have that 2-gram in it, while a search for sponge should not. The point here is that scoring for individual word searches should make sure not to search for the same word more than once, but it is legal for an n-gram to have a repeated word.
You might want to do SummarizeMapper after the indexer.
If you pair program, follow these additional guidelines.
In your README file, include the following:
Name
Slip days used (this project): _______ Slip days used (total): ______
On my honor, name, this programming assignment is my own work.
...if applicable...
Pair Name
Slip days used (this project): _______ Slip days used (total): ______
On our honor, name and name, this programming assignment is our own
work. We spent 80% of our time on this project together, we split the
keyboard time evenly, and we both participated equally in the solution
design.
........
Total hours spent on this project:
To turn in the project, go to the work directory, which should contain the java files and README file. In that directory type the following:
%> make turnin
or you can do it "by hand" and type
%> turnin --submit naga86 lab3 *.java README
Useful Links: