/u/mooney/ir-code/ir/eval/
. See the Javadoc for this system. Use
the main
method for Experiment to
index a set of documents, then process queries, evaluate the results compared
to known relevant documents, and finally generate a recall-precision curve
using the interpolation method discussed in class.
You can use the documents in the Cystic-Fibrosis (CF) corpus (/u/mooney/ir-code/corpora/cf/
) as a set of test documents. This corpus
contains 1,239 "documents" (actually just medical article title and abstracts).
A set of 100 queries with the correct documents determined to be relevant to
these queries is in
/u/mooney/ir-code/queries/cf/queries
.
As discussed in class, Experiment can be used to produce recall-precision
curves for this document/query corpus. Here is a trace of
running such an experiment. The last argument in the command in the trace:
java ir.eval.Experiment /u/mooney/ir-code/corpora/cf /u/mooney/ir-code/queries/cf/queries /u/mooney/ir-code/results/cf/rp
is the output filename (outFile
in the main method of ir/eval/Experiment.java
). You have to change the output filename to a file path
where you have write permissions.
The program also generates as output a "[outFile].gplot"
(substitute [outFile]
with the value of the variable)
file that gnuplot can
use to generate a recall-precision graph (plot), such as this graph.
To create a pdf plot file execute the following command:
gnuplot [outFile].gplot | ps2pdf - your_filename.pdf
The gnuplot
command creates a postscript (*.ps) file and this output
is directly piped into the ps2pdf
command (Note the "-") which then produces
a pdf filename.pdf
. The gnuplot
command only works from the directory that outFile
is stored in.
A set of sample results files that I generated for the CF data are in
/u/mooney/ir-code/results/cf/
. These files are accessible on the lab machines in GDC.
You can also edit the ".gplot" files yourself to create graphs combining the results of multiple runs of Experiment (such as with this ".gplot" file and resulting pdf plot file) in order to compare different methods.
The existing Experiment assumes simple binary gold-standard relevance
judgements. However, real-valued gold-standard ratings of relevance are more
informative, and our CF data actually comes with ratings on a 3-level scale
(0:not relevant, 1:marginally relevant, 2:very relevant) from 4 judges. In
order to produce a single relevance rating, I averaged the scores of the 4
judges and scaled the result to produce a real-valued rating between 0 and 1.
The rated query file in /u/mooney/ir-code/queries/cf/queries-rated
has the results. For each query, each relevant document is followed by a 0-1
relevance rating.
The variant of Experiment in ExperimentRated evaluates on rated queries such
as queries-rated
. In addition to producing recall-precision
curves (which should be the same as those from Experiment), it also produces
NDCG results that utilize the continuous relevance ratings. Average
NDCG values for all ranks 1-10 are printed at the end of the run and also
written to a file with an ".ndcg"
extension. In addition, a *.ndcg.gplot
file is created that allows for the creation of NDCG plot similar to the recall-precision
plot (an example NDCG plot).
Here is a trace of
running an ExperimentRated experiment.
These evaluation metrics (Recall-precision curves and NDCG) were covered in lecture 4 (Performance Evaluation of Information Retrieval Systems).
Code for performing binary relevance feedback is included in the VSR system. It is invoked by using the "-feedback" flag, in which case, after viewing a retrieved document, the user is asked to rate it as either relevant or irrelevant to the query. Then, by using the "r" (redo) command, this feedback will be used to revise the query vector (using the Ide_Regular method), which is then used to produce a new set of retrievals. You can see a trace of using relevance feedback in VSR.
Part of your task is to modify the existing code for relevance feedback to
accept continuous real-valued feedback rather than just binary feedback.
Create a new version of the Feedback class called FeedbackRated that allows
continuous relevance rating. The new versions of the addGood
and addBad
methods for this class should also be given a
real-valued rating of how good or bad the given document is. In order to
modify the Ide Regular algorithm implemented in Feedback to handle real-values
ratings, just multiply a document vector by its corresponding rating value
before adding or subtracting it from the query. Implement a specialization of
InvertedIndex called InvertedIndexRated that uses continuous relevance-rated
feedback. Allow the user to provide real-valued ratings between -1 (very
irrelevant) and +1 (very relevant). You can see
a trace of using my version of
relevance-rated feedback.
Relevance feedback was covered in lecture 5 (Query Operations (Relevance Feedback / Query Expansion)).
An important question that can be addressed experimentally is: Does relevance feedback improve retrieval results? As discussed in class and the text, when evaluating relevance feedback, one must be sure not to include in the final evaluated results, results on documents for which feedback has been explicitly provided (in machine learning, this error is called "testing on the training data"). Results must be evaluated only for the documents for which no feedback has been provided.
Your assignment is to produce a specialization (subclass) of the ExperimentRated object (ExperimentRelFeedbackRated) that supports fair evaluation of rated relevance feedback, and then use this code to produce recall-precision curves and NDCG results that evaluate the effect of different types of relevance feedback on retrieval performance on the CF corpus.
The main method for your new experiment class should accept an additional (4th) command-line argument, the number of documents, N, for which to simulate feedback (in addition to the existing inputs (corpus directory, query file, and output file) and option flags accepted by ExperimentRated). Then after each test query, the system should use the information on the correct relevant documents from the query file to simulate user relevance feedback for the top N documents in the set of initial ranked retrievals. Then it should use this feedback to revise and re-execute the query to produce a new set of ranked retrievals. These final retrieval results should then be evaluated, but first all documents for which feedback has been provided must be removed from the retrieval array and the list of correct retrievals. The final output should be a recall-precision graph and NDCG results for the reduced ("residual") test corpus.
You should compare your rated-relevance feedback, to normal "binary" relevance
feedback, and to no feedback. For rated-relevance feedback, since only
relevant document ratings are included in queries-rated
, you
should use a consistent rating of (negative) 1 for all irrelevant documents
(i.e. those that do not appear in the list of gold-standard relevant documents
for that query), and the gold-standard 0-1 positive rating for relevant
documents listed for that query in queries-rated
. To produce
binary feedback, ExperimentRelFeedbackRated should accept a flag
"-binary
" that reduces all ratings to positive or negative 1. To
compare to no feedback as a control condition, ExperimentRelFeedbackRated
should accept a flag "-control
" that prevents any revision of the
query and simply evaluates the original results on the "residual" corpus (the
corpus after removing the top N ranked retrievals). Example commands:
java ir.eval.ExperimentRelFeedbackRated [Optional FLAG] [CORPORA] [QUERIES] [OUTPUT_FILENAME] [NUM_SIMULATED_FEEDBACK] java ir.eval.ExperimentRelFeedbackRated -control /u/mooney/ir-code/corpora/cf/ /u/mooney/ir-code/queries/cf/queries-rated control 3 java ir.eval.ExperimentRelFeedbackRated -binary /u/mooney/ir-code/corpora/cf/ /u/mooney/ir-code/queries/cf/queries-rated binary 3 java ir.eval.ExperimentRelFeedbackRated /u/mooney/ir-code/corpora/cf/ /u/mooney/ir-code/queries/cf/queries-rated rated 3
Produce recall-precision curves and NDCG plots for retrieval ranks 1 to 10 for these three approaches. For each approach, try N= 1, 3 and 5, i.e. try providing simulated relevance feedback for the top 1, 3, and 5 initial ranked retrieval results. Note that you cannot fairly compare results using different values of N since the residual test corpus is different for different values of N. Therefore, the control condition is different for different values of N and must use the appropriate residual corpus for that value of N. Therefore, you should produce three recall-precision curves and three NDCG plots, one for each value of N (1,3,5), comparing the three approaches using the same residual test corpus (i.e. There should be 6 plots in total: Prec-Recall plots for N=1, 3, and 5 and NDCG plots for N=1, 3, and 5. Each graph should have 3 curves for rated-feedback, binary-feedback, and no-feedback.).
Your report should summarize your approach, present the results in well organized graphs (you can shrink the graphs and insert them into your report), and answer at least the following questions (You should explicitly answer these questions, i.e. in your report put "Q1. Does..?" and then give the answer underneath so that we do not have to search for your answers.):
You should submit your work on Gradescope. In submitting your solution, follow the general course instructions on submitting projects on the course homepage. Along with that, follow these specific instructions for Project 2:
code/
- A folder containing all your code (*.java and *.class file). Please do not modify the original java files but extend each class and override the appropriate methods.report.pdf
- A PDF report of your experiment as described above with the 6 plots referenced in the instructions.trace/rated.txt
- Trace file of a test of InvertedIndexRated with the same commands as in this trace. trace/exp.txt
- Trace file of a test of
ExperimentRelFeedbackRated similar to this ExperimentRated
trace but also containing, for each query, information on the Feedback utilized. This Feedback information should be given as in the following example output:
Query 6: What is the effect of water or other therapeutic agents on the physical properties (viscosity, elasticity) of sputum or bronchial secretions from CF patients? Returned 955 documents. 24 truly relevant documents. Feedback: Positive docs: [RN-00593, RN-00031, RN-00441] Negative docs: [RN-00047, RN-00976] Executing New Expanded and Reweighted Query: 1 is relevant; Recall = 4.762%; Precision = 100.0% ...
outputs/
- A folder containing the data files used to generate your graphs in the following format:
n1
- Directory containing all results files generated by running ir.eval.ExperimentRelFeedbackRated with N = 1n2
- Same as above but with N = 2n3
- Same as above but with N = 3n5
- Same as above but with N = 5Each of the directories should include the following contents:
control, control.ndcg
- Output files from running java ir.eval.ExperimentRelFeedbackRated -control /u/mooney/ir-code/corpora/cf/ /u/mooney/ir-code/queries/cf/queries-rated control [N]
binary, binary.ndcg
- Output files from running java ir.eval.ExperimentRelFeedbackRated -binary /u/mooney/ir-code/corpora/cf/ /u/mooney/ir-code/queries/cf/queries-rated binary [N]
rated, rated.ndcg
- Output files from running java ir.eval.ExperimentRelFeedbackRated /u/mooney/ir-code/corpora/cf/ /u/mooney/ir-code/queries/cf/queries-rated rated [N]
Make sure that these files match the output of the code that you submit.
**NOTE**: In the autograder, N2 will the visible test case which you can use to verify if your implementation is correct. N1, N3 and N5 will be hidden and will be used for grading.
Length Date Time Name
--------- ---------- ----- ----
21067 2015-09-14 12:57 ir/vsr/InvertedIndexRated.java
10049 2015-09-14 17:26 ir/vsr/InvertedIndexRated.class
21067 2015-09-14 12:57 ir/vsr/FeedbackRated.java
10049 2015-09-14 17:26 ir/vsr/FeedbackRated.class
21067 2015-09-14 12:57 ir/eval/ExperimentRelFeedbackRated.java
10049 2015-09-14 17:26 ir/eval/ExperimentRelFeedbackRated.class
--------- -------
91106 6 files
Length Date Time Name
--------- ---------- ----- ----
0 2019-09-25 21:09 n1/
259 2019-09-25 18:30 n1/binary
215 2019-09-25 18:30 n1/binary.ndcg
260 2019-09-25 18:29 n1/control
213 2019-09-25 18:29 n1/control.ndcg
259 2019-09-25 18:30 n1/rated
212 2019-09-25 18:30 n1/rated.ndcg
0 2019-09-25 21:09 n2/
259 2019-09-25 18:30 n2/binary
215 2019-09-25 18:30 n2/binary.ndcg
260 2019-09-25 18:29 n2/control
213 2019-09-25 18:29 n2/control.ndcg
259 2019-09-25 18:30 n2/rated
212 2019-09-25 18:30 n2/rated.ndcg
0 2019-09-25 21:19 n3/
259 2019-09-25 18:26 n3/binary
216 2019-09-25 18:26 n3/binary.ndcg
260 2019-09-25 18:26 n3/control
213 2019-09-25 18:26 n3/control.ndcg
260 2019-09-25 18:26 n3/rated
214 2019-09-25 18:26 n3/rated.ndcg
0 2019-09-25 21:19 n5/
258 2019-09-25 18:31 n5/binary
217 2019-09-25 18:31 n5/binary.ndcg
262 2019-09-25 18:31 n5/control
215 2019-09-25 18:31 n5/control.ndcg
260 2019-09-25 18:31 n5/rated
212 2019-09-25 18:31 n5/rated.ndcg
--------- -------
4264 28 files
Please make sure that your code compiles and runs on the UTCS lab machines.