/u/mooney/ir-code/ir/vsr/. See the Javadoc for this system. Use the main method for InvertedIndex to index a set of documents and then process queries.
You can use the web pages in
/u/mooney/ir-code/corpora/dmoz-science/ as a set of test documents.
This corpus contains 900 pages, 300 random samples each from the DMOZ indices
physics, and chemistry.
See the sample trace of running the
system on this corpus. You can also use a corpus of UTCS department
faculty webpages in
/u/mooney/ir-code/corpora/cs-faculty/. This corpus contains
pages spidered from the department web site.
The first query in the trace, "einstein podolsky", was intended as a query on a specific issue in quantum mechanics on which these two physicists wrote an article (actually the paper is by Einstein-Podolsky-Rosen and the issue is now called the "EPR paradox"). Einstein was very critical of quantum theory and this paper presented an arugment about why the theory must be "incomplete." Among the ten top-ranked documents, 9 of them are just "Einstein" documents and do not reference "Podolsky". The only one that contains both words and is actually about EPR is the 4th ranked retrieval rather than the first.
For the second query in the trace "gravity waves", most of the results contain "gravity" but not "waves." The 6th results contains "waves" but not "gravity." The 2nd, 4th, and 9th results contain both terms but the terms are not together and none of the documents actually talk about gravity waves. Only the 10th result actually contains both terms and is actually relevant; however, it uses the phrase "gravitational waves" and never actually uses the exact phrase "gravity waves", illustrating the limitations of relying on exact phrase matching.
In the third query in the trace "oxygen atom," the 2nd, 3rd, 7th, 8th, and 9th results only contain "atom", the 1st, 5th and 6th contain only "oxygen" and only the 4th and 10th contain both.
Here is a sample solution trace produced by my solution to this problem. Replicating the minute details of this trace is not important (it uses a very simple approach to combine cosine and query fraction), but the trace for your system should be similar, in particular the top retrieved documents should contain all of the query terms. Your solution should be general purpose and not just a hack that works with these specific queries.
Implement your new version as a specialized class that extends
accepts the same command line options as
[PREFIX]_code.zip- Your code including the
InvertedIndexWithQueryCountcode in zip file (*.java and *.class file). Please try to put all modified code in a single file if at all possible.
[PREFIX]_trace.txt- Your solution trace file.
[PREFIX]_report.pdf- A PDF report of your experiment (see Project Submission Info for a description of the contents of the report).
proj1_jd1234_code.zip proj1_jd1234_trace.txt proj1_jd1234_report.pdf
$ unzip -l proj1_jd1234_code.zip Archive: proj1_jd1234_code.zip Length Date Time Name --------- ---------- ----- ---- 21067 2015-09-14 12:57 ir/vsr/InvertedIndexWithQueryCount.java 10049 2015-09-14 17:26 ir/vsr/InvertedIndexWithQueryCount.class --------- ------- 31106 2 files