Project 1
CS 371R: Information Retrieval and Web Search
Enhancing Vector-Space Retrieval


Due: September 28, 2017

Existing System

As discussed in class, a basic system for vector-space retrieval (VSR) is available in /u/mooney/ir-code/ir/vsr/. See the Javadoc for this system. Use the main method for InvertedIndex to index a set of documents and then process queries.

You can use the web pages in /u/mooney/ir-code/corpora/dmoz-science/ as a set of test documents. This corpus contains 900 pages, 300 random samples each from the DMOZ indices for biology, physics, and chemistry.

See the sample trace of running the system on this corpus. You can also use a corpus of UTCS department faculty webpages in /u/mooney/ir-code/corpora/cs-faculty/. This corpus contains pages spidered from the department web site.

Problem

One of the problems with VSR for multi-term queries is that cosine similarity can sometimes prefer documents that contain one query term with high frequency over documents that contain all of the query terms but each with less frequency. The queries in the sample trace where specifically constructed to illustrate this problem.

The first query in the trace, "einstein podolsky", was intended as a query on a specific issue in quantum mechanics on which these two physicists wrote an article (actually the paper is by Einstein-Podolsky-Rosen and the issue is now called the "EPR paradox"). Einstein was very critical of quantum theory and this paper presented an arugment about why the theory must be "incomplete." Among the ten top-ranked documents, 9 of them are just "Einstein" documents and do not reference "Podolsky". The only one that contains both words and is actually about EPR is the 4th ranked retrieval rather than the first.

For the second query in the trace "gravity waves", most of the results contain "gravity" but not "waves." The 6th results contains "waves" but not "gravity." The 2nd, 4th, and 9th results contain both terms but the terms are not together and none of the documents actually talk about gravity waves. Only the 10th result actually contains both terms and is actually relevant; however, it uses the phrase "gravitational waves" and never actually uses the exact phrase "gravity waves", illustrating the limitations of relying on exact phrase matching.

In the third query in the trace "oxygen atom," the 2nd, 3rd, 7th, 8th, and 9th results only contain "atom", the 1st, 5th and 6th contain only "oxygen" and only the 4th and 10th contain both.

Your Task

Your task is to change the existing VSR code to help fix these problems. For any retrieved document, the system should also compute the fraction of the distinct terms in the query that actually occur somewhere in the document. For example, for the query "Einstein Podolsky", documents that contain "Einstein" but not "Podolsky" should get 0.5 for this fraction, whereas documents that contain both terms should get a 1.0. This "query count" score should be combined with the existing cosine similarity to produce a final hybrid score that also considers the TF-IDF of the overlapping terms. How you combine "query count" and cosine similarity to produce a final score is up to you. Your approach should be general-purpose and should produce clearly better results for the examples in the sample trace.

Here is a sample solution trace produced by my solution to this problem. Replicating the minute details of this trace is not important (it uses a very simple approach to combine cosine and query fraction), but the trace for your system should be similar, in particular the top retrieved documents should contain all of the query terms. Your solution should be general purpose and not just a hack that works with these specific queries.

Implement your new version as a specialized class that extends InvertedIndex called InvertedIndexWithQueryCount that accepts the same command line options as InvertedIndex.

Submission Instructions

Follow the general instructions for submitting files using Canvas as described in Project Submission Info. For this assignment, you need to submit the following files:
  1. [PREFIX]_code.zip - Your code including the InvertedIndexWithQueryCount code in zip file (*.java and *.class file). Please try to put all modified code in a single file if at all possible.
  2. [PREFIX]_trace.txt - Your solution trace file.
  3. [PREFIX]_report.pdf - A PDF report of your experiment (see Project Submission Info for a description of the contents of the report).
For example, the files listed under "Turned In" on Canvas should be:

proj1_jd1234_code.zip
proj1_jd1234_trace.txt
proj1_jd1234_report.pdf

and the zip file should have the following contents:
$ unzip -l proj1_jd1234_code.zip Archive: proj1_jd1234_code.zip Length Date Time Name --------- ---------- ----- ---- 21067 2015-09-14 12:57 ir/vsr/InvertedIndexWithQueryCount.java 10049 2015-09-14 17:26 ir/vsr/InvertedIndexWithQueryCount.class --------- ------- 31106 2 files

Grading Criteria