/u/mooney/ir-code/ir/vsr/
. See
the Javadoc for this
system. Use
the
main method for
InvertedIndex to index a set of documents and then process
queries.
You can use the web pages in
/u/mooney/ir-code/corpora/cs-faculty/
as a set of test
documents. This dataset
contains 1000 pages recently crawled from the UTCS website starting from the faculty information page (https://www.cs.utexas.edu/people
).
See the sample trace of using the system on the UTCS dataset.
For example, in the sample trace for the UTCS dataset, for the first query "learning theory", the top two results are just a list of theory faculty, which lists "learning theory" but is not focused on that topic. The third result is a list of theory courses that includes learning theory but does not focus on it. The fourth is a general list of courses that does not include learning theory. The fifth is at least about a former PhD student who did work in learning theory.
The next sample query "information processing" has even worse problems. The top results talk about information but don't even contain the word "processing". The seventh result has "information" and talks about "natural language processing".
For the next query "number theory", again the top two results are just a list of theory faculty, which has "number" but not "number theory" and is not focused on that topic. The third result has several occurences of "number" but no "theory". The fourth result does include "number theory" but the fifth is just a list of theory courses without "number theory" and the sixth contains theory courses but just talks about "course number".
For the next query "information theory", again the top two results are just a list of theory faculty, which lists "information theory" but is not focused on that topic. The third result is again a list of theory courses that does not even include information theory. The fourth is a general list of courses that does not include information theory and the fifth and sixth results also do not discuss information theory.
For the final query "real world" the top two results are web pages with nothing related to "real world" but different "real time" projects. The third to sixth just say "hello world". The seventh is back to "real time" and finally the eigth talks about "real world" robotics.
Here is a sample solution trace produced by my solution to this problem for the UTCS dataset. Note that the top documents now contain all the query words, close together and in the correct order.
In addition to the normal cosine-similarity metric, I calculated a specific proximity score for each retrieved document that measured how far apart the query words appeared in the document. The final score was the ratio of the vector score and the proximity score (both components are shown in the trace). The proximity score was computed to be the closest distance in the document (measured in number of words, excluding stop words) that a query word appeared from another query word averaged across all pairs of words in the query and all occurrences of the words in the document. A multiplicative penalty factor was included in the distance metric when a pair of words appeared in the reverse order from that in the query. This is only a sketch of what I did, many details are omitted.
You do not have to adopt this exact approach. Feel free to be creative. However, your solution should be general-purpose (not hacked to the specific test queries), address the fundamental issue of proximity, and produce similarly improved results for the sample queries. Note that you may need to change many of the fundmental classes and methods in the code to extract and store information on the position of tokens in documents. When making changes, try to add new methods and classes rather than changing existing ones. The final system should support both the original approach and the new proximity-enhanced one (e.g. I created a specialization of InvertedIndex called InvertedPosIndex for the new verison). Hint: I found it useful to use the Java Arrays.binarySearch method to efficently find the closest position of a token to the occurence of another token given a sorted array of token positions.
Make sure to include the following in your report:
Please submit the following to Gradescope:
code/
.
This should necessarily include the main java class called InvertedPosIndex.java
.
In the autograder, the code will be executed as follows: java ir.vsr.InvertedPosIndex -html <path-to-dataset>
report.pdf
trace/curlie.txt
and trace/faculty.txt
.
trace/curlie.txt
must include the query "background radiation" and your trace/faculty.txt
must include the query "academic achievements" for the autograder to validate your system.
On submitting to Gradescope, your files should look something like this:
After the deadline, additional hidden tests will become visible that evaluate your code on more query keywords for both the curlie-science and cs-faculty datasets. The autograder scoring is based on whether your proximity-enhanced retrieval system can successfully find relevant documents containing the query terms. Each test awards full credit if your system meets the baseline performance threshold, with partial credit for results below the threshold.
The grading breakdown for this assignment is: