CS 371R Information Retrieval and Web Search: Project 2

Project 2
CS 371R: Information Retrieval and Web Search
Evaluating the Performance of Relevance Rated Feedback

Due: 11:59pm, October 6, 2025

Existing Framework for Evaluating Retrieval

As discussed in class, a basic system for evaluating vector-space retrieval (VSR) is available in /u/mooney/ir-code/ir/eval/. See the Javadoc for this system. Use the main method for Experiment to index a set of documents, then process queries, evaluate the results compared to known relevant documents, and finally generate a recall-precision curve using the interpolation method discussed in class.

You can use the documents in the Cystic-Fibrosis (CF) corpus (/u/mooney/ir-code/corpora/cf/) as a set of test documents. This corpus contains 1,239 "documents" (actually just medical article title and abstracts). A set of 100 queries with the correct documents determined to be relevant to these queries is in /u/mooney/ir-code/queries/cf/queries.

As discussed in class, Experiment can be used to produce recall-precision curves for this document/query corpus. Here is a trace of running such an experiment. The last argument in the command in the trace:

java ir.eval.Experiment /u/mooney/ir-code/corpora/cf /u/mooney/ir-code/queries/cf/queries /u/mooney/ir-code/results/cf/rp

is the output filename (outFile in the main method of ir/eval/Experiment.java). You have to change the output filename to a file path where you have write permissions. The program also generates as output a "[outFile].gplot" (substitute [outFile] with the value of the variable) file that gnuplot can use to generate a recall-precision graph (plot), such as this graph. To create a pdf plot file execute the following command:

gnuplot [outFile].gplot | ps2pdf - your_filename.pdf

The gnuplot command creates a postscript (*.ps) file and this output is directly piped into the ps2pdf command (Note the "-") which then produces a pdf filename.pdf. The gnuplot command only works from the directory that outFile is stored in.

A set of sample results files that I generated for the CF data are in /u/mooney/ir-code/results/cf/. These files are accessible on the lab machines in GDC.

You can also edit the ".gplot" files yourself to create graphs combining the results of multiple runs of Experiment (such as with this ".gplot" file and resulting pdf plot file) in order to compare different methods.

The existing Experiment assumes simple binary gold-standard relevance judgements. However, real-valued gold-standard ratings of relevance are more informative, and our CF data actually comes with ratings on a 3-level scale (0:not relevant, 1:marginally relevant, 2:very relevant) from 4 judges. In order to produce a single relevance rating, I averaged the scores of the 4 judges and scaled the result to produce a real-valued rating between 0 and 1. The rated query file in /u/mooney/ir-code/queries/cf/queries-rated has the results. For each query, each relevant document is followed by a 0-1 relevance rating.

The variant of Experiment in ExperimentRated evaluates on rated queries such as queries-rated. In addition to producing recall-precision curves (which should be the same as those from Experiment), it also produces NDCG results that utilize the continuous relevance ratings. Average NDCG values for all ranks 1-10 are printed at the end of the run and also written to a file with an ".ndcg" extension. In addition, a *.ndcg.gplot file is created that allows for the creation of NDCG plot similar to the recall-precision plot (an example NDCG plot). Here is a trace of running an ExperimentRated experiment.

These evaluation metrics (Recall-precision curves and NDCG) were covered in lecture 4 (Performance Evaluation of Information Retrieval Systems).

Your Task

Relevance-Rated Feedback

Code for performing binary relevance feedback is included in the VSR system. It is invoked by using the "-feedback" flag, in which case, after viewing a retrieved document, the user is asked to rate it as either relevant or irrelevant to the query. Then, by using the "r" (redo) command, this feedback will be used to revise the query vector (using the Ide_Regular method), which is then used to produce a new set of retrievals. You can see a trace of using relevance feedback in VSR.

Part of your task is to modify the existing code for relevance feedback to accept continuous real-valued feedback rather than just binary feedback. Create a new version of the Feedback class called FeedbackRated that allows continuous relevance rating. The new versions of the addGood and addBad methods for this class should also be given a real-valued rating of how good or bad the given document is. In order to modify the Ide Regular algorithm implemented in Feedback to handle real-values ratings, just multiply a document vector by its corresponding rating value before adding or subtracting it from the query. Implement a specialization of InvertedIndex called InvertedIndexRated that uses continuous relevance-rated feedback. Allow the user to provide real-valued ratings between -1 (very irrelevant) and +1 (very relevant). You can see a trace of using my version of relevance-rated feedback.

Relevance feedback was covered in lecture 5 (Query Operations (Relevance Feedback / Query Expansion)).

Evaluating Relevance-Rated Feedback

An important question that can be addressed experimentally is: Does relevance feedback improve retrieval results? As discussed in class and the text, when evaluating relevance feedback, one must be sure not to include in the final evaluated results, results on documents for which feedback has been explicitly provided (in machine learning, this error is called "testing on the training data"). Results must be evaluated only for the documents for which no feedback has been provided.

Your assignment is to produce a specialization (subclass) of the ExperimentRated object (ExperimentRelFeedbackRated) that supports fair evaluation of rated relevance feedback, and then use this code to produce recall-precision curves and NDCG results that evaluate the effect of different types of relevance feedback on retrieval performance on the CF corpus.

The main method for your new experiment class should accept an additional (4th) command-line argument, the number of documents, N, for which to simulate feedback (in addition to the existing inputs (corpus directory, query file, and output file) and option flags accepted by ExperimentRated). Then after each test query, the system should use the information on the correct relevant documents from the query file to simulate user relevance feedback for the top N documents in the set of initial ranked retrievals. Then it should use this feedback to revise and re-execute the query to produce a new set of ranked retrievals. These final retrieval results should then be evaluated, but first all documents for which feedback has been provided must be removed from the retrieval array and the list of correct retrievals. The final output should be a recall-precision graph and NDCG results for the reduced ("residual") test corpus.

You should compare your rated-relevance feedback, to normal "binary" relevance feedback, and to no feedback. For rated-relevance feedback, since only relevant document ratings are included in queries-rated, you should use a consistent rating of (negative) 1 for all irrelevant documents (i.e. those that do not appear in the list of gold-standard relevant documents for that query), and the gold-standard 0-1 positive rating for relevant documents listed for that query in queries-rated. To produce binary feedback, ExperimentRelFeedbackRated should accept a flag "-binary" that reduces all ratings to positive or negative 1. To compare to no feedback as a control condition, ExperimentRelFeedbackRated should accept a flag "-control" that prevents any revision of the query and simply evaluates the original results on the "residual" corpus (the corpus after removing the top N ranked retrievals). Example commands:

java ir.eval.ExperimentRelFeedbackRated [Optional FLAG] [CORPORA] [QUERIES] [OUTPUT_FILENAME] [NUM_SIMULATED_FEEDBACK]
java ir.eval.ExperimentRelFeedbackRated -control /u/mooney/ir-code/corpora/cf/ /u/mooney/ir-code/queries/cf/queries-rated control 3
java ir.eval.ExperimentRelFeedbackRated -binary /u/mooney/ir-code/corpora/cf/ /u/mooney/ir-code/queries/cf/queries-rated binary 3
java ir.eval.ExperimentRelFeedbackRated /u/mooney/ir-code/corpora/cf/ /u/mooney/ir-code/queries/cf/queries-rated rated 3

Produce recall-precision curves and NDCG plots for retrieval ranks 1 to 10 for these three approaches. For each approach, try N= 1, 3 and 5, i.e. try providing simulated relevance feedback for the top 1, 3, and 5 initial ranked retrieval results. Note that you cannot fairly compare results using different values of N since the residual test corpus is different for different values of N. Therefore, the control condition is different for different values of N and must use the appropriate residual corpus for that value of N. Therefore, you should produce three recall-precision curves and three NDCG plots, one for each value of N (1,3,5), comparing the three approaches using the same residual test corpus (i.e. There should be 6 plots in total: Prec-Recall plots for N=1, 3, and 5 and NDCG plots for N=1, 3, and 5. Each graph should have 3 curves for rated-feedback, binary-feedback, and no-feedback.).

Your report should summarize your approach, present the results in well organized graphs (you can shrink the graphs and insert them into your report), and answer at least the following questions (You should explicitly answer these questions, i.e. in your report put "Q1. Does..?" and then give the answer underneath so that we do not have to search for your answers.):

Does using feedback improve retrieval accuracy? Why or why not?
Does rated-relevance feedback improve retrieval accuracy over binary relevance feedback? Why or why not?
Do the recall-precision curves and the NDCG results show different results regarding the relative performance of these three different approaches? Why or why not?

The report length does not have to be less than the usual 2-page limit because the graphs take up a lot space.

Submission

You should submit your work on Gradescope. In submitting your solution, follow the general course instructions on submitting projects on the course homepage. Along with that, follow these specific instructions for Project 2:

Create at least the following new classes described above:
1. FeedbackRated which extends Feedback
2. InvertedIndexRated which extends InvertedIndex
3. ExperimentRelFeedbackRated which extends ExperimentRated
For this assignment, you need to submit the following files:

code/ - A folder containing all your code (*.java and *.class file). Please do not modify the original java files but extend each class and override the appropriate methods.
report.pdf - A PDF report of your experiment as described above with the 6 plots referenced in the instructions.
trace/rated.txt - Trace file of a test of InvertedIndexRated with the same commands as in this trace.

trace/exp.txt - Trace file of a test of ExperimentRelFeedbackRated similar to this ExperimentRated trace but also containing, for each query, information on the Feedback utilized. This Feedback information should be given as in the following example output:

Query 6: What is the effect of water or other therapeutic agents on the 
physical properties (viscosity, elasticity) of sputum or bronchial 
secretions from CF patients?
Returned 955 documents.
24 truly relevant documents.
Feedback:
Positive docs: [RN-00593, RN-00031, RN-00441]
Negative docs: [RN-00047, RN-00976]
Executing New Expanded and Reweighted Query: 
   1 is relevant; Recall =  4.762%; Precision =  100.0%
   ...

outputs/ - A folder containing the data files used to generate your graphs in the following format:
- n1 - Directory containing all results files generated by running ir.eval.ExperimentRelFeedbackRated with N = 1
- n2 - Same as above but with N = 2
- n3 - Same as above but with N = 3
- n5 - Same as above but with N = 5
Each of the directories should include the following contents:
- control, control.ndcg - Output files from running java ir.eval.ExperimentRelFeedbackRated -control /u/mooney/ir-code/corpora/cf/ /u/mooney/ir-code/queries/cf/queries-rated control [N]
- binary, binary.ndcg - Output files from running java ir.eval.ExperimentRelFeedbackRated -binary /u/mooney/ir-code/corpora/cf/ /u/mooney/ir-code/queries/cf/queries-rated binary [N]
- rated, rated.ndcg - Output files from running java ir.eval.ExperimentRelFeedbackRated /u/mooney/ir-code/corpora/cf/ /u/mooney/ir-code/queries/cf/queries-rated rated [N]
Make sure that these files match the output of the code that you submit.

**NOTE**: In the autograder, N2 will the visible test case which you can use to verify if your implementation is correct. N1, N3 and N5 will be hidden and will be used for grading.


  Length      Date    Time    Name
---------  ---------- -----   ----
    21067  2015-09-14 12:57   ir/vsr/InvertedIndexRated.java
    10049  2015-09-14 17:26   ir/vsr/InvertedIndexRated.class
    21067  2015-09-14 12:57   ir/vsr/FeedbackRated.java
    10049  2015-09-14 17:26   ir/vsr/FeedbackRated.class
    21067  2015-09-14 12:57   ir/eval/ExperimentRelFeedbackRated.java
    10049  2015-09-14 17:26   ir/eval/ExperimentRelFeedbackRated.class
---------                     -------
    91106                     6 files


  Length      Date    Time    Name
---------  ---------- -----   ----
        0  2019-09-25 21:09   n1/
      259  2019-09-25 18:30   n1/binary
      215  2019-09-25 18:30   n1/binary.ndcg
      260  2019-09-25 18:29   n1/control
      213  2019-09-25 18:29   n1/control.ndcg
      259  2019-09-25 18:30   n1/rated
      212  2019-09-25 18:30   n1/rated.ndcg
        0  2019-09-25 21:09   n2/
      259  2019-09-25 18:30   n2/binary
      215  2019-09-25 18:30   n2/binary.ndcg
      260  2019-09-25 18:29   n2/control
      213  2019-09-25 18:29   n2/control.ndcg
      259  2019-09-25 18:30   n2/rated
      212  2019-09-25 18:30   n2/rated.ndcg
        0  2019-09-25 21:19   n3/
      259  2019-09-25 18:26   n3/binary
      216  2019-09-25 18:26   n3/binary.ndcg
      260  2019-09-25 18:26   n3/control
      213  2019-09-25 18:26   n3/control.ndcg
      260  2019-09-25 18:26   n3/rated
      214  2019-09-25 18:26   n3/rated.ndcg
        0  2019-09-25 21:19   n5/
      258  2019-09-25 18:31   n5/binary
      217  2019-09-25 18:31   n5/binary.ndcg
      262  2019-09-25 18:31   n5/control
      215  2019-09-25 18:31   n5/control.ndcg
      260  2019-09-25 18:31   n5/rated
      212  2019-09-25 18:31   n5/rated.ndcg
---------                     -------
     4264                     28 files

Please make sure that your code compiles and runs on the UTCS lab machines.

The submitted files on Gradescope should look like this:

The autograder output on Gradescope should look like this:

This project will have 35% of the grade for the report, so please ensure you answer all questions and follow the rubrics on Gradescope.

Grading Criteria

10%: Your program compiles successfully and functions normally without throwing exceptions.
10%: Your implementation is efficient. For example, it shouldn't change the overall time complexity. Also, it shouldn't significantly increase the average time it takes to respond to a query.
35% Working code that correctly implements rated feedback and experimental evaluation of feedback
10%: Good programming style with necessary comments, intuitive variable/function names, and appropriate indentation.
35%: Quality of report, clear presentation of results, good analysis & discussion, and your answers to the questions above.

Project 2 CS 371R: Information Retrieval and Web Search Evaluating the Performance of Relevance Rated Feedback