Project 1 for CS 371R: Information Retrieval and Web Search
Adding Proximity Preference to Vector-Space Retrieval

Due: September 27, 2021 at 11:59 p.m.


Existing System

As discussed in class, a basic system for vector-space retrieval (VSR) is available in /u/mooney/ir-code/ir/vsr/. See the Javadoc for this system. Use the main method for InvertedIndex to index a set of documents and then process queries.

You can use the web pages in /u/mooney/ir-code/corpora/curlie-science-2021/ and /u/mooney/ir-code/corpora/cs-faculty/ as the sets of test documents. The Curlie Science dataset contains 900 pages, 300 random samples each from the Curlie indices for biology, physics, and chemistry. The UTCS dataset contains 1000 pages from the UT CS website.

See the sample trace of using the system on the Curlie Science dataset and on the UTCS dataset.

Problem

One of the problems with VSR is that it does not consider the proximity of the query words in the document. With a bag-of-words model, the location of words in the document is irrelevant, only their frequency matters. Therefore, multi-word queries where the specific combination of words indicates a specific concept only when occuring close together, are not handled very well.

For example, in the sample trace for the Curlie Science dataset, for the first query "black hole", the top retrieved document does not contain the phrase "black hole". Instead, the webpage talks about a black beetle boring a hole. Though both of the words appear in the document, as they are far apart from each other, their meaning is completely different from what is desired.

The next sample query "virtual reality" has similar problems that the top retrieved documents do not contain the phrase. The first document includes both words but they are very far apart. It talks about the virtual world and different ways to achieve augmented reality. The second document only contains the word "virtual" in the phrase "virtual library". The document that actually talks about "virtual reality" is only ranked number 7. Not until the seventh result does a relevant page that contains "virtual reality" actually appear.

We can see similar examples in the sample trace for the UTCS dataset. For the first query "real world", the top three results are web pages with nothing related to "real world" but different "real time" projects. The fourth result is relevant and includes the phrase "real-world".

For the second query "academic calendar", the top results contain the words "academic" and "calendar" in separate parts of the web page but do not include relevant information about the academic calendar. Intead, they talks about various academic staff. The 32nd result is relevant and includes the phrase "academic calendar".

You Fix It

Your task is to change the existing VSR code to add a proximity component to the final document scoring function. For multi-word queries, documents in which the words in the query appear closer together (although not necessarily directly adjacent) and in the same order as in the query, should be preferred. Your approach should be general-purpose and should produce better results for the examples in the sample traces for the Curlie Science dataset and the UTCS dataset.

Here are the sample solution traces produced by my solution to this problem or the Curlie Science dataset and the UTCS dataset. Note that the top documents now contain all of the query words, close together and in the correct order.

In addition to the normal cosine-similarity metric, I calculated a specific proximity score for each retrieved document that measured how far apart the query words appeared in the document. The final score was the ratio of the vector score and the proximity score (both components are shown in the trace). The proximity score was computed to be the closest distance in the document (measured in number of words, excluding stop words) that a query word appeared from another query word averaged across all pairs of words in the query and all occurrences of the words in the document. A multiplicative penalty factor was included in the distance metric when a pair of words appeared in the reverse order from that in the query. This is only a sketch of what I did, many details are omitted.

You do not have to adopt this exact approach. Feel free to be creative. However, your solution should be general-purpose (not hacked to the specific test queries), address the fundamental issue of proximity, and produce similarly improved results for the sample queries. Note that you may need to change many of the fundmental classes and methods in the code to extract and store information on the position of tokens in documents. When making changes, try to add new methods and classes rather than changing existing ones. The final system should support both the original approach and the new proximity-enhanced one (e.g. I created a specialization of InvertedIndex called InvertedProxIndex for the new verison). Hint: I found it useful to use the Java Arrays.binarySearch method to efficently find the closest position of a token to the occurence of another token given a sorted array of token positions.

Report Instructions

Make sure to include the following in your report:

Submission Instructions

Please submit the following to Canvas:

  1. All code files that you modified in the code zip file
  2. Your report
  3. Trace files of you running your proximity preference enhanced inverted index for the Curlie Science dataset ([PREFIX]_curlie_trace.txt) and the UTCS dataset ([PREFIX]_cs_trace.txt) where [PREFIX] is proj1_[your EID]. You don't need to include trace files of the original inverted index.

In submitting your solution, follow the general course instructions on submitting projects on the course homepage.

Grading

You will be graded on your code, documentation, and report. Please make sure that your code compiles and runs on the UTCS lab machines.

The grading breakdown for this assignment is: