Project 1 for CS 371R: Information Retrieval and Web Search
Adding Proximity Preference to Vector-Space Retrieval

Due: Feb. 14, 2008


Existing System

As discussed in class, a basic system for vector-space retrieval (VSR) is available in /u/mooney/ir-code/ir/vsr/. See the Javadoc for this system. Use the main method for InvertedIndex to index a set of documents and then process queries.

You can use the web pages in /u/mooney/ir-code/corpora/yahoo-science/ as a set of test documents. This corpus contains 900 pages, 300 random samples each from the Yahoo indices for biology, physics, and chemistry.

See the sample trace of using the system.

Problem

One of the problems with VSR is that it does not consider the proximity of the query words in the document. With a bag-of-words model, the location of words in the document is irrelevant, only their frequency matters. Therefore, multi-word queries where the specific combination of words indicates a specific concept only when occuring close together, are not handled very well.

For example, in the sample trace, for the first query "computer science", the top retrieved documents do not contain the phrase "computer science". The first very short document only contains the word "computer" and not "science". The second and third contain both words but far apart. The fourth only contains "science" (a lot) but not "computer". The fifth contains "computer" but not "science". Finally, the sixth result has the phrase "computer science" and discusses bioinformatics.

The next sample query "information theory" has similar problems. Not until the fifth result does the phrase "information theory" actually appear.

In the third query, "einstein rosen" I intended it as a query on a specific issue in quantum mechanics on which these two physicists wrote an article (actually the paper is by Einstein-Podolsky-Rosen and sometimes called the "EPR effect"). The first 11 documents are just "Einstein" documents and do not even contain the name "Rosen". Finally, the 12th and 13th documents are actually relevant, but "Einstein" and "Rosen" are separated by a few words.

In the fourth query, "human genome", the first result is relevant, but the second is about other genomes and does not even have the word "human".

The fifth query "computer information science" is an example of a three-word query that has similar problems to the first two examples.

You Fix It

Your task is to change the existing VSR code to add a proximity component to the final document scoring function. For multi-word queries, documents in which the words in the query appear closer together (although not necessarily directly adjacent) and in the same order as in the query, should be preferred. Your approach should be general-purpose and should produce better results for the examples in the sample trace.

Here is a sample solution trace produced by my solution to this problem. Note that the top documents now contain all of the query words, close together and in the correct order.

In addition to the normal cosine-similarity metric, I calculated a specific proximity score for each retrieved document that measured how far apart the query words appeared in the document. The final score was the ratio of the vector score and the proximity score (both components are shown in the trace). The proximity score was computed to be the closest distance in the document (measured in number of words, excluding stop words) that a query word appeared from another query word averaged across all pairs of words in the query. A multiplicative penalty factor was included in the distance metric when a pair of words appeared in the reverse order from that in the query. This is only a sketch of what I did, many details are omitted.

You do not have to adopt this exact approach. Feel free to be creative. However, your solution should be general-purpose (not hacked to the specific test queries), address the fundamental issue of proximity, and produce similarly improved results for the sample queries. Note that you may need to change many of the fundmental classes and methods in the code to extract and store information on the position of tokens in documents. When making changes, try to add new methods and classes rather than changing existing ones. The final system should support both the original approach and the new proximity-enhanced one (e.g. I created a specialization of InvertedIndex called InvertedProxIndex for the new verison). Hint: I found it useful to use the Java Arrays.binarySearch method to efficently find the closest position of a token to the occurence of another token given a sorted array of token positions.

In submitting your solution, follow the general course instructions on submitting projects on the course homepage.