/u/mooney/ir-code/ir/vsr/. See the Javadoc for this system. Use the main method for InvertedIndex to index a set of documents and then process queries.
You can use the web pages in
/u/mooney/ir-code/corpora/yahoo-science/ as a set of test documents.
This corpus contains 900 pages, 300 random samples each from the Yahoo indices
for biology, physics, and chemistry.
See the sample trace of using the system.
For example, in the sample trace, for the query "cold fusion", the first two results contain the phrase but results 3 thru 5 do not, they do not even contain the word "cold". For the query "quantum mechanics", the 1st result contains the phrase but the 2nd result is about fluid mechanics and does not even contain the word "quantum," while the 3rd and 4th results are about quantum computing and do not contain the word "mechanics". For the query "chemical physics" the 1st result is relevant but the 2nd result only has "physics", the 3rd just "chemical", the 4th thru 6th just "physics". In some situations, cosine similarity can prefer documents that contain a high density of some of the query words at the expense of completely ignoring other query words. In addition, cosine similarity never considers multi-word phrases or the proximity or ordering of words.
Appropriate retrieval for many such queries can be aided by noticing that certain phrases such as "cold fusion," "quantum mechanics," and "chemical physics" are important as multi-word phrases and are not well represented by a bag of words.
A simple statistical approach to discovering useful phrases is to simply look
for frequently occuring sequences of words. In a first pass through the
corpus, your system should find all two-word phrases in the corpus (so called
"bigrams") and determine the frequency of each bigram across the entire corpus.
Consider bigrams as two indexed tokens produced in sequence by the current
Document token generator, therefore, they do not include stop words. After
finding all bigrams, your program should determine the set of most frequent
bigrams and store them as known phrases. Your system should have a parameter,
maxPhrases, that determines the maximum number of phrases
to be remembered (which should default to a value of 1,000). You may find the
Java sorting methods
Then, when producing the vector representations of documents and queries, it should notice instances of the known phrases (two tokens generated in order), and create a single token for the entire phrase but not tokens for the individual words. For example, the query "cold fusion" should result in a vector containing a single phrasal token "cold fusion" that does not include the individual tokens "cold" and "fusion".
Here is a sample solution trace produced by my solution to this problem. After the first pass through the corpus, the system prints out the 1,000 most-common phrases with their frequency. Notice that, although, HTMLFileDocument does its best to parse the HTML and remove formatting commands, a number of the most-common phrases are actually HTML commands that are still indexed. You can verify that all of the retrieved documents now contain the complete two-word query phrases. Replicating the minute details of this trace is not important, but the trace for your system should be similar and only retrieve documents that contain these complete common phrases. Your solution should be a general purpose phrase-indexer and not just a hack that works with these specific queries.
Implement your new version as a specialized class of
InvertedPhraseIndex that accepts the same command line
InvertedIndex. You may also need to add methods to other
classes. In particular, my solution added methods to at least
In submitting your solution, follow the general course instructions on submitting projects on the course homepage.
Along with that, follow these specific instructions for Project 1:
enceladus.cs.utexas.edu$ turnin -list jiho cs371r-proj1 4064224 4 drwx------ 3 jiho grad 4096 Sep 17 21:43 . 4064225 4 drwxr-xr-x 3 jiho grad 4096 Sep 17 21:43 ./proj1 4064226 6 -rw-r--r-- 1 jiho grad 5249 Sep 12 15:05 ./proj1/soln-trace 4064227 6 -rw-r--r-- 1 jiho grad 5249 Sep 12 15:05 ./proj1/REPORT.html 4064228 4 drwxr-xr-x 3 jiho grad 4096 Sep 17 21:43 ./proj1/ir 4064229 4 drwxr-xr-x 2 jiho grad 4096 Sep 17 21:43 ./proj1/ir/vsr 4064230 22 -rw-r--r-- 1 jiho grad 22478 Sep 12 15:06 ./proj1/ir/vsr/InvertedPhraseIndex.java 4064231 1 -rw-r--r-- 1 jiho grad 623 Aug 16 10:40 ./proj1/ir/vsr/HashMapVector$1.class 4064232 10 -rw-r--r-- 1 jiho grad 9823 Sep 12 15:06 ./proj1/ir/vsr/InvertedPhraseIndex.class 4064233 4 -rw-r--r-- 1 jiho grad 3922 Aug 16 10:40 ./proj1/ir/vsr/HashMapVector.class 4064234 9 -rw-r--r-- 1 jiho grad 8692 Aug 14 17:17 ./proj1/ir/vsr/HashMapVector.java 4064235 4 -rw-r--r-- 1 jiho grad 3151 Sep 17 01:18 ./proj1/ir/vsr/Document.class 4064236 8 -rw-r--r-- 1 jiho grad 7523 Sep 17 01:18 ./proj1/ir/vsr/Document.java enceladus.cs.utexas.edu$