/u/mooney/ir-code/ir/vsr/. See the Javadoc for this system. Use the main method for InvertedIndex to index a set of documents and then process queries.
You can use the web pages in
/u/mooney/ir-code/corpora/curlie-science/ as a set of test documents.
This corpus contains 900 pages, 300 random samples each from the Curlie indices
and chemistry. You can
also use a corpus of UTCS department faculty webpages in
/u/mooney/ir-code/corpora/cs-faculty/. This corpus contains 800 pages
spidered from the department web site. See
the sample trace of using the system on these
For example, using the corpus of UTCS department faculty webpages, for the query "software design" most of the results do not contain the phrase, but simply the word "software". For the query "information technology," most of the results contain the two separate words a number of times but not the actual relevant phrase. Using the corpus of science-related documents from Curlie, for the queries "life science" and "high energy", most of the top results contain the two separate words a number of times but not the actual relevant phrase.
In some situations, cosine similarity can prefer documents that contain a high density of some of the query words at the expense of completely ignoring other query words. In addition, cosine similarity never considers multi-word phrases or the proximity or ordering of words.
Appropriate retrieval for many such queries can be aided by noticing that certain phrases such as "information technology", "software design", "life science" and "high energy" are important as multi-word phrases and are not well represented by a bag of words.
A simple statistical approach to discovering useful phrases is to simply look
for frequently occuring sequences of words. In a first pass through the
corpus, your system should find all two-word phrases in the corpus (so called
"bigrams") and determine the frequency of each bigram across the entire corpus.
Consider bigrams as two indexed tokens produced in sequence by the current
Document token generator, therefore, they do not include stop words. After
finding all bigrams, your program should determine the set of most frequent
bigrams and store them as known phrases. Your system should have a parameter,
maxPhrases, that determines the maximum number of phrases
to be remembered (which should default to a value of 1,000). You may find the
Java sorting methods
Then, when producing the vector representations of documents and queries, it should notice instances of the known phrases (two tokens generated in order), and create a single token for the entire phrase but not tokens for the individual words. For example, the query "information technology" should result in a vector containing a single phrasal token "information technology" that does not include the individual tokens "information" and "technology".
Here is a sample solution trace produced by my solution to this problem. After the first pass through the corpus, the system prints out the 1,000 most-common phrases with their frequency. You can verify that all of the retrieved documents now contain the complete two-word query phrases. Replicating the minute details of this trace is not important, but the trace for your system should be similar and only retrieve documents that contain these complete common phrases. Your solution should obviously be a general purpose phrase-indexer and not just a hack that works with these specific queries.
Implement your new version as a specialized class of
InvertedPhraseIndex that accepts the same command line
InvertedIndex. You may also need to add methods to other
classes. In particular, my solution added methods to at least
In submitting your solution, follow the general course instructions on submitting projects on the course homepage. Note especially, what the prefix for a file name is, and the command used to generate the zip file in a way that maintains the directory structure required.
Along with that, follow these specific instructions for Project 1. The following files should be submitted.
The grading breakdown for this assignment is: