Project 1
CS 371R: Information Retrieval and Web Search
Adding Common Phrases to Vector-Space Retrieval


Due: Sept. 26, 2013

Existing System

As discussed in class, a basic system for vector-space retrieval (VSR) is available in /u/mooney/ir-code/ir/vsr/. See the Javadoc for this system. Use the main method for InvertedIndex to index a set of documents and then process queries.

You can use the web pages in /u/mooney/ir-code/corpora/yahoo-science/ as a set of test documents. This corpus contains 900 pages, 300 random samples each from the Yahoo indices for biology, physics, and chemistry. See the sample trace of using the system.

Problem

One of the problems with VSR is that is does not handle multi-word phrases well. With a bag-of-words model, only individual words matter, regardless of their position and context. Therefore, queries that use specific phrases whose individual words do not adequately capture the meaning of the overall phrase, are not handled well.

For example, in the sample trace, for the query "cold fusion", the first two results contain the phrase but results 3 thru 5 do not, they do not even contain the word "cold". For the query "quantum mechanics", the 1st result contains the phrase but the 2nd result is about fluid mechanics and does not even contain the word "quantum," while the 3rd and 4th results are about quantum computing and do not contain the word "mechanics". For the query "chemical physics" the 1st result is relevant but the 2nd result only has "physics", the 3rd just "chemical", the 4th thru 6th just "physics". In some situations, cosine similarity can prefer documents that contain a high density of some of the query words at the expense of completely ignoring other query words. In addition, cosine similarity never considers multi-word phrases or the proximity or ordering of words.

Appropriate retrieval for many such queries can be aided by noticing that certain phrases such as "cold fusion," "quantum mechanics," and "chemical physics" are important as multi-word phrases and are not well represented by a bag of words.

You Fix It

Your task is to change the existing VSR code to first find common two-word phrases that appear in a document corpus, recognize occurrences of these phrases in documents, index these documents using the entire phrase, and finally recognize these phrases in queries and use them in retrieval. Your approach should be general-purpose and should produce better results for the examples in the sample trace.

A simple statistical approach to discovering useful phrases is to simply look for frequently occuring sequences of words. In a first pass through the corpus, your system should find all two-word phrases in the corpus (so called "bigrams") and determine the frequency of each bigram across the entire corpus. Consider bigrams as two indexed tokens produced in sequence by the current Document token generator, therefore, they do not include stop words. After finding all bigrams, your program should determine the set of most frequent bigrams and store them as known phrases. Your system should have a parameter, called maxPhrases, that determines the maximum number of phrases to be remembered (which should default to a value of 1,000). You may find the Java sorting methods Arrays.sort or Collections.sort useful.

Then, when producing the vector representations of documents and queries, it should notice instances of the known phrases (two tokens generated in order), and create a single token for the entire phrase but not tokens for the individual words. For example, the query "cold fusion" should result in a vector containing a single phrasal token "cold fusion" that does not include the individual tokens "cold" and "fusion".

Here is a sample solution trace produced by my solution to this problem. After the first pass through the corpus, the system prints out the 1,000 most-common phrases with their frequency. Notice that, although, HTMLFileDocument does its best to parse the HTML and remove formatting commands, a number of the most-common phrases are actually HTML commands that are still indexed. You can verify that all of the retrieved documents now contain the complete two-word query phrases. Replicating the minute details of this trace is not important, but the trace for your system should be similar and only retrieve documents that contain these complete common phrases. Your solution should be a general purpose phrase-indexer and not just a hack that works with these specific queries.

Implement your new version as a specialized class of InvertedIndex called InvertedPhraseIndex that accepts the same command line options as InvertedIndex. You may also need to add methods to other classes. In particular, my solution added methods to at least Document and HashMapVector.

Submission

In submitting your solution, follow the general course instructions on submitting projects on the course homepage.

Along with that, follow these specific instructions for Project 1: