/u/mooney/ir-code/ir/eval/. See the Javadoc for this system. Use the main method for Experiment to index a set of documents, then process queries, evaluate the results compared to known relevant documents, and finally generate a recall-precision curve using the interpolation method discussed in class.
You can use the documents in the Cystic-Fibrosis (CF) corpus (
/u/mooney/ir-code/corpora/cf/) as a set of test documents. This corpus
contains 1,239 "documents" (actually just medical article title and abstracts).
A set of 100 queries with the correct documents determined to be relevant to
these queries is in
As discussed in class, Experiment can be used to produce recall-precision
curves for this document/query corpus. Here is a trace of
running such an experiment. The program also generates as output a
file that gnuplot can
use to generate a recall-precision graph (plot), such as this graph.
To create a pdf plot file execute the following command:
gnuplot filename.gplot | ps2pdf - filename.pdf.
gnuplot command creates a postscript (*.ps) file and this output
is directly piped into the
ps2pdf command (Note the "-") which then produces
A set of sample results files that I generated for the CF data are in
You can also edit the ".gplot" files yourself to create graphs combining the results of multiple runs of Experiment (such as with this ".gplot" file and resulting pdf plot file) in order to compare different methods.
The existing Experiment assumes simple binary gold-standard relevance
judgements. However, real-valued gold-standard ratings of relevance are more
informative, and our CF data actually comes with ratings on a 3-level scale
(0:not relevant, 1:marginally relevant, 2:very relevant) from 4 judges. In
order to produce a single relevance rating, I averaged the scores of the 4
judges and scaled the result to produce a real-valued rating between 0 and 1.
The rated query file in
has the results. For each query, each relevant document is followed by a 0-1
The variant of Experiment in ExperimentRated evaluates on rated queries such
queries-rated. In addition to producing recall-precision
curves (which should be the same as those from Experiment), it also produces
NDCG results that utilize the continuous relevance ratings. Average
NDCG values for all ranks 1-10 are printed at the end of the run and also
written to a file with an
".ndcg" extension. In addition, a *.ndcg.gplot
file is created that allows for the creation of NDCG plot similar to the Precision-Recall
plot (an example NDCG plot).
Here is a trace of
running an ExperimentRated experiment.
NOTE: The Java code and class file for
been added to the code under
/u/mooney/ir-code/. Thus, if you
downloaded a local copy of the code, you need to download the new
*.java and *.class files. If you used the network code directly, you shouldn't have
to change anything.
Code for performing binary relevance feedback is included in the VSR system. It is invoked by using the "-feedback" flag, in which case, after viewing a retrieved document, the user is asked to rate it as either relevant or irrelevant to the query. Then, by using the "r" (redo) command, this feedback will be used to revise the query vector (using the Ide_Regular method), which is then used to produce a new set of retrievals. You can see a trace of using relevance feedback in VSR.
Part of your task is to modify the existing code for relevance feedback to
accept continuous real-valued feedback rather than just binary feedback.
Create a new version of the Feedback class called FeedbackRated that allows
continuous relevance rating. The new versions of the
addBad methods for this class should also be given a
real-valued rating of how good or bad the given document is. In order to
modify the Ide Regular algorithm implemented in Feedback to handle real-values
ratings, just multiply a document vector by its corresponding rating value
before adding or subtracting it from the query. Implement a specialization of
InvertedIndex called InvertedIndexRated that uses continuous relevance-rated
feedback. Allow the user to provide real-valued ratings between -1 (very
irrelevant) and +1 (very relevant). You can see
a trace of using my version of
An important question that can be addressed experimentally is: Does relevance feedback improve retrieval results? As discussed in class and the text, when evaluating relevance feedback, one must be sure not to include in the final evaluated results, results on documents for which feedback has been explicitly provided (in machine learning, this error is called "testing on the training data"). Results must be evaluated only for the documents for which no feedback has been provided.
Your assignment is to produce a specialization (subclass) of the ExperimentRated object (ExperimentRelFeedbackRated) that supports fair evaluation of rated relevance feedback, and then use this code to produce recall-precision curves and NDCG results that evaluate the effect of different types of relevance feedback on retrieval performance on the CF corpus.
The main method for your new experiment class should accept an additional (4th) command-line argument, the number of documents, N, for which to simulate feedback (in addition to the existing inputs (corpus directory, query file, and output file) and option flags accepted by ExperimentRated). Then after each test query, the system should use the information on the correct relevant documents from the query file to simulate user relevance feedback for the top N documents in the set of initial ranked retrievals. Then it should use this feedback to revise and re-execute the query to produce a new set of ranked retrievals. These final retrieval results should then be evaluated, but first all documents for which feedback has been provided must be removed from the retrieval array and the list of correct retrievals. The final output should be a recall-precision graph and NDCG results for the reduced ("residual") test corpus.
You should compare your rated-relevance feedback, to normal "binary" relevance
feedback, and to no feedback. For rated-relevance feedback, since only
relevant document ratings are included in
should use a consistent rating of (negative) 1 for all irrelevant documents
(i.e. those that do not appear in the list of gold-standard relevant documents
for that query), and the gold-standard 0-1 positive rating for relevant
documents listed for that query in
queries-rated. To produce
binary feedback, ExperimentRelFeedbackRated should accept a flag
-binary" that reduces all ratings to positive or negative 1. To
compare to no feedback as a control condition, ExperimentRelFeedbackRated
should accept a flag "
-control" that prevents any revision of the
query and simply evaluates the original results on the "residual" corpus (the
corpus after removing the top N ranked retrievals). Example commands:
java ir.eval.ExperimentRelFeedbackRated [Optional FLAG] [CORPORA] [QUERIES] [OUTPUT_FILENAME] [NUM_SIMULATED_FEEDBACK] java ir.eval.ExperimentRelFeedbackRated -control /u/mooney/ir-code/corpora/cf/ /u/mooney/ir-code/queries/cf/queries-rated control 3 java ir.eval.ExperimentRelFeedbackRated -binary /u/mooney/ir-code/corpora/cf/ /u/mooney/ir-code/queries/cf/queries-rated binary 3 java ir.eval.ExperimentRelFeedbackRated /u/mooney/ir-code/corpora/cf/ /u/mooney/ir-code/queries/cf/queries-rated rated 3
Produce recall-precision curves and NDCG plots for retrieval ranks 1 to 10 for these three approaches. For each approach, try N= 1, 3 and 5, i.e. try providing simulated relevance feedback for the top 1, 3, and 5 initial ranked retrieval results. Note that you cannot fairly compare results using different values of N since the residual test corpus is different for different values of N. Therefore, the control condition is different for different values of N and must use the appropriate residual corpus for that value of N. Therefore, you should produce three recall-precision curves and three NDCG plots, one for each value of N (1,3,5), comparing the three approaches using the same residual test corpus (i.e. There should be 6 plots in total: Prec-Recall plots for N=1,3,5 and NDCG plots for N=1,3,5. Each graph should have 3 curves for rated-feedback, binary-feedback, and no-feedback.).
Your report should summarize your approach, present the results in well organized graphs, and answer at least the following questions (You should explicitly answer these questions, i.e. in your report put "Q1. Does..?" and then give the answer underneath so that we do not have to search for your answers.):
In submitting your solution, follow the general course instructions on submitting projects on the course homepage. Along with that, follow these specific instructions for Project 2:
[PREFIX]_code.zip- Your code in zip file (*.java and *.class file). Please do not modify the original java files but extend each class and override the appropriate methods.
[PREFIX]_report.pdf- A PDF report of your experiment as described above with the 6 plots referenced in the instructions.
[PREFIX]_rated_trace.txt- Trace file of a test of InvertedIndexRated with the same commands as in this trace.
[PREFIX]_exp_trace.txt- Trace file of a test of ExperimentRelFeedbackRated similar to this ExperimentRated trace but also containing, for each query, information on the Feedback utilized. This Feedback information should be given as in the following example output:
Query 6: What is the effect of water or other therapeutic agents on the physical properties (viscosity, elasticity) of sputum or bronchial secretions from CF patients? Returned 955 documents. 24 truly relevant documents. Feedback: Positive docs: [RN-00593, RN-00031, RN-00441] Negative docs: [RN-00047, RN-00976] Executing New Expanded and Reweighted Query: 1 is relevant; Recall = 4.762%; Precision = 100.0% ...
proj1_jd1234_code.zip proj1_jd1234_rated_trace.txt proj1_jd1234_exp_trace.txt proj1_jd1234_report.pdf
$ unzip -l proj1_jd1234_code.zip Archive: proj1_jd1234_code.zip Length Date Time Name --------- ---------- ----- ---- 21067 2015-09-14 12:57 ir/vsr/InvertedIndexRated.java 10049 2015-09-14 17:26 ir/vsr/InvertedIndexRated.class 21067 2015-09-14 12:57 ir/vsr/FeedbackRated.java 10049 2015-09-14 17:26 ir/vsr/FeedbackRated.class 21067 2015-09-14 12:57 ir/eval/ExperimentRelFeedbackRated.java 10049 2015-09-14 17:26 ir/eval/ExperimentRelFeedbackRated.class --------- ------- 91106 6 files