Project 2
CS 371R: Information Retrieval and Web Search
Evaluating the Performance of Relevance Rated Feedback


Due: October 12, 2017

Existing Framework for Evaluating Retrieval

As discussed in class, a basic system for evaluating vector-space retrieval (VSR) is available in /u/mooney/ir-code/ir/eval/. See the Javadoc for this system. Use the main method for Experiment to index a set of documents, then process queries, evaluate the results compared to known relevant documents, and finally generate a recall-precision curve using the interpolation method discussed in class.

You can use the documents in the Cystic-Fibrosis (CF) corpus (/u/mooney/ir-code/corpora/cf/) as a set of test documents. This corpus contains 1,239 "documents" (actually just medical article title and abstracts). A set of 100 queries with the correct documents determined to be relevant to these queries is in /u/mooney/ir-code/queries/cf/queries.

As discussed in class, Experiment can be used to produce recall-precision curves for this document/query corpus. Here is a trace of running such an experiment. The program also generates as output a ".gplot" file that gnuplot can use to generate a recall-precision graph (plot), such as this graph. To create a pdf plot file execute the following command:

gnuplot filename.gplot | ps2pdf - filename.pdf.

The gnuplot command creates a postscript (*.ps) file and this output is directly piped into the ps2pdf command (Note the "-") which then produces a pdf filename.pdf.

A set of sample results files that I generated for the CF data are in /u/mooney/ir-code/results/cf/.

You can also edit the ".gplot" files yourself to create graphs combining the results of multiple runs of Experiment (such as with this ".gplot" file and resulting pdf plot file) in order to compare different methods.

The existing Experiment assumes simple binary gold-standard relevance judgements. However, real-valued gold-standard ratings of relevance are more informative, and our CF data actually comes with ratings on a 3-level scale (0:not relevant, 1:marginally relevant, 2:very relevant) from 4 judges. In order to produce a single relevance rating, I averaged the scores of the 4 judges and scaled the result to produce a real-valued rating between 0 and 1. The rated query file in /u/mooney/ir-code/queries/cf/queries-rated has the results. For each query, each relevant document is followed by a 0-1 relevance rating.

The variant of Experiment in ExperimentRated evaluates on rated queries such as queries-rated. In addition to producing recall-precision curves (which should be the same as those from Experiment), it also produces NDCG results that utilize the continuous relevance ratings. Average NDCG values for all ranks 1-10 are printed at the end of the run and also written to a file with an ".ndcg" extension. In addition, a *.ndcg.gplot file is created that allows for the creation of NDCG plot similar to the Precision-Recall plot (an example NDCG plot). Here is a trace of running an ExperimentRated experiment.

NOTE: The Java code and class file for ExperimentRated has been added to the code under /u/mooney/ir-code/. Thus, if you downloaded a local copy of the code, you need to download the new ExperimentRated *.java and *.class files. If you used the network code directly, you shouldn't have to change anything.

Your Task

Relevance-Rated Feedback

Code for performing binary relevance feedback is included in the VSR system. It is invoked by using the "-feedback" flag, in which case, after viewing a retrieved document, the user is asked to rate it as either relevant or irrelevant to the query. Then, by using the "r" (redo) command, this feedback will be used to revise the query vector (using the Ide_Regular method), which is then used to produce a new set of retrievals. You can see a trace of using relevance feedback in VSR.

Part of your task is to modify the existing code for relevance feedback to accept continuous real-valued feedback rather than just binary feedback. Create a new version of the Feedback class called FeedbackRated that allows continuous relevance rating. The new versions of the addGood and addBad methods for this class should also be given a real-valued rating of how good or bad the given document is. In order to modify the Ide Regular algorithm implemented in Feedback to handle real-values ratings, just multiply a document vector by its corresponding rating value before adding or subtracting it from the query. Implement a specialization of InvertedIndex called InvertedIndexRated that uses continuous relevance-rated feedback. Allow the user to provide real-valued ratings between -1 (very irrelevant) and +1 (very relevant). You can see a trace of using my version of relevance-rated feedback.

Evaluating Relevance-Rated Feedback

An important question that can be addressed experimentally is: Does relevance feedback improve retrieval results? As discussed in class and the text, when evaluating relevance feedback, one must be sure not to include in the final evaluated results, results on documents for which feedback has been explicitly provided (in machine learning, this error is called "testing on the training data"). Results must be evaluated only for the documents for which no feedback has been provided.

Your assignment is to produce a specialization (subclass) of the ExperimentRated object (ExperimentRelFeedbackRated) that supports fair evaluation of rated relevance feedback, and then use this code to produce recall-precision curves and NDCG results that evaluate the effect of different types of relevance feedback on retrieval performance on the CF corpus.

The main method for your new experiment class should accept an additional (4th) command-line argument, the number of documents, N, for which to simulate feedback (in addition to the existing inputs (corpus directory, query file, and output file) and option flags accepted by ExperimentRated). Then after each test query, the system should use the information on the correct relevant documents from the query file to simulate user relevance feedback for the top N documents in the set of initial ranked retrievals. Then it should use this feedback to revise and re-execute the query to produce a new set of ranked retrievals. These final retrieval results should then be evaluated, but first all documents for which feedback has been provided must be removed from the retrieval array and the list of correct retrievals. The final output should be a recall-precision graph and NDCG results for the reduced ("residual") test corpus.

You should compare your rated-relevance feedback, to normal "binary" relevance feedback, and to no feedback. For rated-relevance feedback, since only relevant document ratings are included in queries-rated, you should use a consistent rating of (negative) 1 for all irrelevant documents (i.e. those that do not appear in the list of gold-standard relevant documents for that query), and the gold-standard 0-1 positive rating for relevant documents listed for that query in queries-rated. To produce binary feedback, ExperimentRelFeedbackRated should accept a flag "-binary" that reduces all ratings to positive or negative 1. To compare to no feedback as a control condition, ExperimentRelFeedbackRated should accept a flag "-control" that prevents any revision of the query and simply evaluates the original results on the "residual" corpus (the corpus after removing the top N ranked retrievals). Example commands:

java ir.eval.ExperimentRelFeedbackRated [Optional FLAG] [CORPORA] [QUERIES] [OUTPUT_FILENAME] [NUM_SIMULATED_FEEDBACK]
java ir.eval.ExperimentRelFeedbackRated -control /u/mooney/ir-code/corpora/cf/ /u/mooney/ir-code/queries/cf/queries-rated control 3
java ir.eval.ExperimentRelFeedbackRated -binary /u/mooney/ir-code/corpora/cf/ /u/mooney/ir-code/queries/cf/queries-rated binary 3
java ir.eval.ExperimentRelFeedbackRated /u/mooney/ir-code/corpora/cf/ /u/mooney/ir-code/queries/cf/queries-rated rated 3

Produce recall-precision curves and NDCG plots for retrieval ranks 1 to 10 for these three approaches. For each approach, try N= 1, 3 and 5, i.e. try providing simulated relevance feedback for the top 1, 3, and 5 initial ranked retrieval results. Note that you cannot fairly compare results using different values of N since the residual test corpus is different for different values of N. Therefore, the control condition is different for different values of N and must use the appropriate residual corpus for that value of N. Therefore, you should produce three recall-precision curves and three NDCG plots, one for each value of N (1,3,5), comparing the three approaches using the same residual test corpus (i.e. There should be 6 plots in total: Prec-Recall plots for N=1,3,5 and NDCG plots for N=1,3,5. Each graph should have 3 curves for rated-feedback, binary-feedback, and no-feedback.).

Your report should summarize your approach, present the results in well organized graphs, and answer at least the following questions (You should explicitly answer these questions, i.e. in your report put "Q1. Does..?" and then give the answer underneath so that we do not have to search for your answers.):

  1. Does using feedback improve retrieval accuracy? Why or why not?
  2. Does rated-relevance feedback improve retrieval accuracy over binary relevance feedback? Why or why not?
  3. Do the recall-precision curves and the NDCG results show different results regarding the relative performance of these three different approaches? Why or why not?

Submission

In submitting your solution, follow the general course instructions on submitting projects on the course homepage. Along with that, follow these specific instructions for Project 2:

Grading Criteria