/u/mooney/ir-code/ir/vsr/. See the Javadoc for this system. Use the main method for InvertedIndex to index a set of documents and then process queries.
You can use the web pages in
/u/mooney/ir-code/corpora/dmoz-science/ as a set of test documents.
This corpus contains 900 pages, 300 random samples each from the DMOZ indices
physics, and chemistry.
See the sample trace of running the
system on this corpus. You can also use a corpus of UTCS department
faculty webpages in
/u/mooney/ir-code/corpora/cs-faculty/. This corpus contains
796 pages spidered from the department web site.
The first query in the trace, "einstein rosen", was intended as a query on a specific issue in quantum mechanics on which these two physicists wrote an article (actually the paper is by Einstein-Podolsky-Rosen and the issue is now called the "EPR paradox"). Einstein was very critical of quantum theory and this paper presented an arugment about why the theory must be "incomplete." Among the ten top-ranked documents, 9 of them are just "Einstein" documents and do not reference "Rosen". The only one that contains both words and is actually about EPR is the 3rd ranked retrieval rather than the first.
In the second query in the trace "gravity waves", the 1st, 4th, 6th, and 7th only contain "gravity," but the 2nd, 3rd, 5th, 8th, 9th, 10th and 11th contain both.
In the third query in the trace "oxygen atom," the 1st, 2nd, 3rd, 4th and 10th only contain "atom", the 5th and 6th contain only "oxygen" and only the 7th, 8th, and 9th contain both.
Here is a sample solution trace produced by my solution to this problem. Replicating the minute details of this trace is not important (it uses a very simple approach to combine cosine and query fraction), but the trace for your system should be similar, in particular the top retrieved documents should contain all of the query terms. Your solution should be general purpose and not just a hack that works with these specific queries.
Implement your new version as a specialized class that extends
accepts the same command line options as
[PREFIX]_code.zip- Your code including the
InvertedIndexWithQueryCountcode in zip file (*.java and *.class file). Please try to put all modified code in a single file if at all possible.
[PREFIX]_trace.txt- Your solution trace file.
[PREFIX]_report.pdf- A PDF report of your experiment (see Project Submission Info for a description of the contents of the report).
proj1_jd1234_code.zip proj1_jd1234_trace.txt proj1_jd1234_report.pdf
$ unzip -l proj1_jd1234_code.zip Archive: proj1_jd1234_code.zip Length Date Time Name --------- ---------- ----- ---- 21067 2015-09-14 12:57 ir/vsr/InvertedIndexWithQueryCount.java 10049 2015-09-14 17:26 ir/vsr/InvertedIndexWithQueryCount.class --------- ------- 31106 2 files