ir.vsr
Class InvertedIndex

java.lang.Object
  extended by ir.vsr.InvertedIndex

public class InvertedIndex
extends java.lang.Object

An inverted index for vector-space information retrieval. Contains methods for creating an inverted index from a set of documents and retrieving ranked matches to queries using standard TF/IDF weighting and cosine similarity.


Field Summary
 java.io.File dirFile
          The directory from which the indexed documents come.
 java.util.List<DocumentReference> docRefs
          A list of all indexed documents.
 short docType
          The type of Documents (text, HTML).
 boolean feedback
          Whether relevance feedback using the Ide_regular algorithm is used
static int MAX_RETRIEVALS
          The maximum number of retrieved documents for a query to present to the user at a time
 boolean stem
          Whether tokens should be stemmed with Porter stemmer
 java.util.Map<java.lang.String,TokenInfo> tokenHash
          A HashMap where tokens are indexed.
 
Constructor Summary
InvertedIndex(java.io.File dirFile, short docType, boolean stem, boolean feedback)
          Create an inverted index of the documents in a directory.
InvertedIndex(java.util.List<Example> examples)
          Create an inverted index of the documents in a List of Example objects of documents for text categorization.
 
Method Summary
 void clear()
          Clear all documents from the inverted index
protected  void computeIDFandDocumentLengths()
          Compute the IDF factor for every token in the index and the length of the document vector for every document referenced in the index.
protected  Retrieval getRetrieval(double queryLength, DocumentReference docRef, double score)
          Calculate the final score for a retrieval and return a Retrieval object representing the retrieval with its final score.
 double incorporateToken(java.lang.String token, double count, java.util.Map<DocumentReference,DoubleValue> retrievalHash)
          Retrieve the documents indexed by this token in the inverted index, add it to the retrievalHash if needed, and update its running total score.
protected  void indexDocument(FileDocument doc, HashMapVector vector)
          Index the given document using its corresponding vector
protected  void indexDocuments()
          Index the documents in dirFile.
 void indexDocuments(java.util.List<Example> examples)
          Index the documents in the List of Examples for text categorization.
protected  void indexToken(java.lang.String token, int count, DocumentReference docRef)
          Add a token occurrence to the index.
static void main(java.lang.String[] args)
          Index a directory of files and then interactively accept retrieval queries.
 void presentRetrievals(HashMapVector queryVector, Retrieval[] retrievals)
          Print out a ranked set of retrievals.
 void print()
          Print out an inverted index by listing each token and the documents it occurs in.
 void printRetrievals(Retrieval[] retrievals, int start)
          Print out at most MAX_RETRIEVALS ranked retrievals starting at given starting rank number.
 void processQueries()
          Enter an interactive user-query loop, accepting queries and showing the retrieved documents in ranked order.
 Retrieval[] retrieve(Document doc)
          Perform ranked retrieval on this input query Document.
 Retrieval[] retrieve(HashMapVector vector)
          Perform ranked retrieval on this input query Document vector.
 Retrieval[] retrieve(java.lang.String input)
          Perform ranked retrieval on this input query.
 boolean showRetrievals(Retrieval[] retrievals)
          Show the top retrievals to the user if there are any.
 int size()
          Return the number of tokens indexed.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

MAX_RETRIEVALS

public static final int MAX_RETRIEVALS
The maximum number of retrieved documents for a query to present to the user at a time

See Also:
Constant Field Values

tokenHash

public java.util.Map<java.lang.String,TokenInfo> tokenHash
A HashMap where tokens are indexed. Each indexed token maps to a TokenInfo.


docRefs

public java.util.List<DocumentReference> docRefs
A list of all indexed documents. Elements are DocumentReference's.


dirFile

public java.io.File dirFile
The directory from which the indexed documents come.


docType

public short docType
The type of Documents (text, HTML). See docType in DocumentIterator.


stem

public boolean stem
Whether tokens should be stemmed with Porter stemmer


feedback

public boolean feedback
Whether relevance feedback using the Ide_regular algorithm is used

Constructor Detail

InvertedIndex

public InvertedIndex(java.io.File dirFile,
                     short docType,
                     boolean stem,
                     boolean feedback)
Create an inverted index of the documents in a directory.

Parameters:
dirFile - The directory of files to index.
docType - The type of documents to index (See docType in DocumentIterator)
stem - Whether tokens should be stemmed with Porter stemmer.
feedback - Whether relevance feedback should be used.

InvertedIndex

public InvertedIndex(java.util.List<Example> examples)
Create an inverted index of the documents in a List of Example objects of documents for text categorization.

Parameters:
examples - A List containing the Example objects for text categorization to index
Method Detail

indexDocuments

protected void indexDocuments()
Index the documents in dirFile.


indexDocuments

public void indexDocuments(java.util.List<Example> examples)
Index the documents in the List of Examples for text categorization.


indexDocument

protected void indexDocument(FileDocument doc,
                             HashMapVector vector)
Index the given document using its corresponding vector


indexToken

protected void indexToken(java.lang.String token,
                          int count,
                          DocumentReference docRef)
Add a token occurrence to the index.

Parameters:
token - The token to index.
count - The number of times it occurs in the document.
docRef - A reference to the Document it occurs in.

computeIDFandDocumentLengths

protected void computeIDFandDocumentLengths()
Compute the IDF factor for every token in the index and the length of the document vector for every document referenced in the index.


print

public void print()
Print out an inverted index by listing each token and the documents it occurs in. Include info on IDF factors, occurrence counts, and document vector lengths.


size

public int size()
Return the number of tokens indexed.


clear

public void clear()
Clear all documents from the inverted index


retrieve

public Retrieval[] retrieve(java.lang.String input)
Perform ranked retrieval on this input query.


retrieve

public Retrieval[] retrieve(Document doc)
Perform ranked retrieval on this input query Document.


retrieve

public Retrieval[] retrieve(HashMapVector vector)
Perform ranked retrieval on this input query Document vector.


getRetrieval

protected Retrieval getRetrieval(double queryLength,
                                 DocumentReference docRef,
                                 double score)
Calculate the final score for a retrieval and return a Retrieval object representing the retrieval with its final score.

Parameters:
queryLength - The length of the query vector, incorporated into the final score
docRef - The document reference for the document concerned
score - The partially computed score
Returns:
The retrieval object for the document described by docRef and score under the query with length queryLength

incorporateToken

public double incorporateToken(java.lang.String token,
                               double count,
                               java.util.Map<DocumentReference,DoubleValue> retrievalHash)
Retrieve the documents indexed by this token in the inverted index, add it to the retrievalHash if needed, and update its running total score.

Parameters:
token - The token in the query to incorporate.
count - The count of this token in the query.
retrievalHash - The hash table of retrieved DocumentReferences and current scores.
Returns:
The square of the weight of this token in the query vector for use in calculating the length of the query vector.

processQueries

public void processQueries()
Enter an interactive user-query loop, accepting queries and showing the retrieved documents in ranked order.


presentRetrievals

public void presentRetrievals(HashMapVector queryVector,
                              Retrieval[] retrievals)
Print out a ranked set of retrievals. Show the file name and score for the top retrieved documents in order. Then allow user to see more or display individual documents.


showRetrievals

public boolean showRetrievals(Retrieval[] retrievals)
Show the top retrievals to the user if there are any.

Returns:
true if retrievals are non-empty.

printRetrievals

public void printRetrievals(Retrieval[] retrievals,
                            int start)
Print out at most MAX_RETRIEVALS ranked retrievals starting at given starting rank number. Include the rank number and the score.


main

public static void main(java.lang.String[] args)
Index a directory of files and then interactively accept retrieval queries. Command format: "InvertedIndex [OPTION]* [DIR]" where DIR is the name of the directory whose files should be indexed, and OPTIONs can be "-html" to specify HTML files whose HTML tags should be removed. "-stem" to specify tokens should be stemmed with Porter stemmer. "-feedback" to allow relevance feedback from the user.