Project 1
CS 371R: Information Retrieval and Web Search
Adding Common Phrases to Vector-Space Retrieval


Due: Sep. 23, 2020, 11:59pm.

Existing System

As discussed in class, a basic system for vector-space retrieval (VSR) is available in /u/mooney/ir-code/ir/vsr/. See the Javadoc for this system. Use the main method for InvertedIndex to index a set of documents and then process queries.

You can use the web pages in /u/mooney/ir-code/corpora/curlie-science/ as a set of test documents. This corpus contains 900 pages, 300 random samples each from the Curlie indices for biology, physics, and chemistry. You can also use a corpus of UTCS department faculty webpages in /u/mooney/ir-code/corpora/cs-faculty/. This corpus contains 800 pages spidered from the department web site. See the sample trace of using the system on these to corpora.

Problem

One of the problems with VSR is that is does not handle multi-word phrases well. With a bag-of-words model, only individual words matter, regardless of their position and context. Therefore, queries that use specific phrases whose individual words do not adequately capture the meaning of the overall phrase, are not handled well.

For example, using the corpus of UTCS department faculty webpages, for the query "software design" most of the results do not contain the phrase, but simply the word "software". For the query "information technology," most of the results contain the two separate words a number of times but not the actual relevant phrase. Using the corpus of science-related documents from Curlie, for the queries "life science" and "high energy", most of the top results contain the two separate words a number of times but not the actual relevant phrase.

In some situations, cosine similarity can prefer documents that contain a high density of some of the query words at the expense of completely ignoring other query words. In addition, cosine similarity never considers multi-word phrases or the proximity or ordering of words.

Appropriate retrieval for many such queries can be aided by noticing that certain phrases such as "information technology", "software design", "life science" and "high energy" are important as multi-word phrases and are not well represented by a bag of words.

You Fix It

Your task is to change the existing VSR code to first find common two-word phrases that appear in a document corpus, recognize occurrences of these phrases in documents, index these documents using the entire phrase, and finally recognize these phrases in queries and use them in retrieval. Your approach should be general-purpose and should produce better results for the examples in the sample trace.

A simple statistical approach to discovering useful phrases is to simply look for frequently occuring sequences of words. In a first pass through the corpus, your system should find all two-word phrases in the corpus (so called "bigrams") and determine the frequency of each bigram across the entire corpus. Consider bigrams as two indexed tokens produced in sequence by the current Document token generator, therefore, they do not include stop words. After finding all bigrams, your program should determine the set of most frequent bigrams and store them as known phrases. Your system should have a parameter, called maxPhrases, that determines the maximum number of phrases to be remembered (which should default to a value of 1,000). You may find the Java sorting methods Arrays.sort or Collections.sort useful.

Then, when producing the vector representations of documents and queries, it should notice instances of the known phrases (two tokens generated in order), and create a single token for the entire phrase but not tokens for the individual words. For example, the query "information technology" should result in a vector containing a single phrasal token "information technology" that does not include the individual tokens "information" and "technology".

Here is a sample solution trace produced by my solution to this problem. After the first pass through the corpus, the system prints out the 1,000 most-common phrases with their frequency. You can verify that all of the retrieved documents now contain the complete two-word query phrases. Replicating the minute details of this trace is not important, but the trace for your system should be similar and only retrieve documents that contain these complete common phrases. Your solution should obviously be a general purpose phrase-indexer and not just a hack that works with these specific queries.

Implement your new version as a specialized class of InvertedIndex called InvertedPhraseIndex that accepts the same command line options as InvertedIndex. You may also need to add methods to other classes. In particular, my solution added methods to at least Document and HashMapVector.

Submission

In submitting your solution, follow the general course instructions on submitting projects on the course homepage. Note especially, what the prefix for a file name is, and the command used to generate the zip file in a way that maintains the directory structure required.

Along with that, follow these specific instructions for Project 1. The following files should be submitted.

  1. [PREFIX]_code.zip zip file of the ir directory with /vsr subfolder that contains all the classes you have made changes to. Submit both .java and .class files.
  2. [PREFIX]_soln_trace.txt The trace file of program execution.
  3. [PREFIX]_report.pdf A short document (1-2 pages) clearly describing in well-written English prose, the approach taken to the assignment, the general algorithm employed, clear instructions on running the system, and insightful discussion of any experimental results obtained. In pdf format.

The grading breakdown for this assignment is: