InvertedIndex

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

ir.vsr
Class InvertedIndex

java.lang.Object
  ir.vsr.InvertedIndex

public class InvertedIndex
extends java.lang.Object
extends java.lang.Object

An inverted index for vector-space information retrieval. Contains methods for creating an inverted index from a set of documents and retrieving ranked matches to queries using standard TF/IDF weighting and cosine similarity.

Field Summary
`java.io.File`	`dirFile` The directory from which the indexed documents come.
`java.util.List<DocumentReference>`	`docRefs` A list of all indexed documents.
`short`	`docType` The type of Documents (text, HTML).
`boolean`	`feedback` Whether relevance feedback using the Ide_regular algorithm is used
`static int`	`MAX_RETRIEVALS` The maximum number of retrieved documents for a query to present to the user at a time
`boolean`	`stem` Whether tokens should be stemmed with Porter stemmer
`java.util.Map<java.lang.String,TokenInfo>`	`tokenHash` A HashMap where tokens are indexed.

Constructor Summary
`InvertedIndex(java.io.File dirFile, short docType, boolean stem, boolean feedback)` Create an inverted index of the documents in a directory.
`InvertedIndex(java.util.List<Example> examples)` Create an inverted index of the documents in a List of Example objects of documents for text categorization.

Method Summary
`void`	`clear()` Clear all documents from the inverted index
`protected void`	`computeIDFandDocumentLengths()` Compute the IDF factor for every token in the index and the length of the document vector for every document referenced in the index.
`protected Retrieval`	`getRetrieval(double queryLength, DocumentReference docRef, double score)` Calculate the final score for a retrieval and return a Retrieval object representing the retrieval with its final score.
`double`	`incorporateToken(java.lang.String token, double count, java.util.Map<DocumentReference,DoubleValue> retrievalHash)` Retrieve the documents indexed by this token in the inverted index, add it to the retrievalHash if needed, and update its running total score.
`protected void`	`indexDocument(FileDocument doc, HashMapVector vector)` Index the given document using its corresponding vector
`protected void`	`indexDocuments()` Index the documents in dirFile.
`void`	`indexDocuments(java.util.List<Example> examples)` Index the documents in the List of Examples for text categorization.
`protected void`	`indexToken(java.lang.String token, int count, DocumentReference docRef)` Add a token occurrence to the index.
`static void`	`main(java.lang.String[] args)` Index a directory of files and then interactively accept retrieval queries.
`void`	`presentRetrievals(HashMapVector queryVector, Retrieval[] retrievals)` Print out a ranked set of retrievals.
`void`	`print()` Print out an inverted index by listing each token and the documents it occurs in.
`void`	`printRetrievals(Retrieval[] retrievals, int start)` Print out at most MAX_RETRIEVALS ranked retrievals starting at given starting rank number.
`void`	`processQueries()` Enter an interactive user-query loop, accepting queries and showing the retrieved documents in ranked order.
`Retrieval[]`	`retrieve(Document doc)` Perform ranked retrieval on this input query Document.
`Retrieval[]`	`retrieve(HashMapVector vector)` Perform ranked retrieval on this input query Document vector.
`Retrieval[]`	`retrieve(java.lang.String input)` Perform ranked retrieval on this input query.
`boolean`	`showRetrievals(Retrieval[] retrievals)` Show the top retrievals to the user if there are any.
`int`	`size()` Return the number of tokens indexed.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

MAX_RETRIEVALS

public static final int MAX_RETRIEVALS

The maximum number of retrieved documents for a query to present to the user at a time

See Also:: Constant Field Values

tokenHash

public java.util.Map<java.lang.String,TokenInfo> tokenHash

A HashMap where tokens are indexed. Each indexed token maps to a TokenInfo.

docRefs

public java.util.List<DocumentReference> docRefs

A list of all indexed documents. Elements are DocumentReference's.

dirFile

public java.io.File dirFile

The directory from which the indexed documents come.

docType

public short docType

The type of Documents (text, HTML). See docType in DocumentIterator.

stem

public boolean stem

Whether tokens should be stemmed with Porter stemmer

feedback

public boolean feedback

Whether relevance feedback using the Ide_regular algorithm is used

Constructor Detail

InvertedIndex

public InvertedIndex(java.io.File dirFile,
                     short docType,
                     boolean stem,
                     boolean feedback)

Create an inverted index of the documents in a directory.

Parameters:: dirFile - The directory of files to index.; docType - The type of documents to index (See docType in DocumentIterator); stem - Whether tokens should be stemmed with Porter stemmer.; feedback - Whether relevance feedback should be used.

InvertedIndex

public InvertedIndex(java.util.List<Example> examples)

Create an inverted index of the documents in a List of Example objects of documents for text categorization.

Parameters:: examples - A List containing the Example objects for text categorization to index

Method Detail

indexDocuments

protected void indexDocuments()

Index the documents in dirFile.

indexDocuments

public void indexDocuments(java.util.List<Example> examples)

Index the documents in the List of Examples for text categorization.

indexDocument

protected void indexDocument(FileDocument doc,
                             HashMapVector vector)

Index the given document using its corresponding vector

indexToken

protected void indexToken(java.lang.String token,
                          int count,
                          DocumentReference docRef)

Add a token occurrence to the index.

Parameters:: token - The token to index.; count - The number of times it occurs in the document.; docRef - A reference to the Document it occurs in.

computeIDFandDocumentLengths

protected void computeIDFandDocumentLengths()

Compute the IDF factor for every token in the index and the length of the document vector for every document referenced in the index.

print

public void print()

Print out an inverted index by listing each token and the documents it occurs in. Include info on IDF factors, occurrence counts, and document vector lengths.

size

public int size()

Return the number of tokens indexed.

clear

public void clear()

Clear all documents from the inverted index

retrieve

public Retrieval[] retrieve(java.lang.String input)

Perform ranked retrieval on this input query.

retrieve

public Retrieval[] retrieve(Document doc)

Perform ranked retrieval on this input query Document.

retrieve

public Retrieval[] retrieve(HashMapVector vector)

Perform ranked retrieval on this input query Document vector.

getRetrieval

protected Retrieval getRetrieval(double queryLength,
                                 DocumentReference docRef,
                                 double score)

Calculate the final score for a retrieval and return a Retrieval object representing the retrieval with its final score.

Parameters:: queryLength - The length of the query vector, incorporated into the final score; docRef - The document reference for the document concerned; score - The partially computed score
Returns:: The retrieval object for the document described by docRef and score under the query with length queryLength

incorporateToken

public double incorporateToken(java.lang.String token,
                               double count,
                               java.util.Map<DocumentReference,DoubleValue> retrievalHash)

Retrieve the documents indexed by this token in the inverted index, add it to the retrievalHash if needed, and update its running total score.

Parameters:: token - The token in the query to incorporate.; count - The count of this token in the query.; retrievalHash - The hash table of retrieved DocumentReferences and current scores.
Returns:: The square of the weight of this token in the query vector for use in calculating the length of the query vector.

processQueries

public void processQueries()

Enter an interactive user-query loop, accepting queries and showing the retrieved documents in ranked order.

presentRetrievals

public void presentRetrievals(HashMapVector queryVector,
                              Retrieval[] retrievals)

Print out a ranked set of retrievals. Show the file name and score for the top retrieved documents in order. Then allow user to see more or display individual documents.

showRetrievals

public boolean showRetrievals(Retrieval[] retrievals)

Show the top retrievals to the user if there are any.

Returns:: true if retrievals are non-empty.

printRetrievals

public void printRetrievals(Retrieval[] retrievals,
                            int start)

Print out at most MAX_RETRIEVALS ranked retrievals starting at given starting rank number. Include the rank number and the score.

main

public static void main(java.lang.String[] args)

Index a directory of files and then interactively accept retrieval queries. Command format: "InvertedIndex [OPTION]* [DIR]" where DIR is the name of the directory whose files should be indexed, and OPTIONs can be "-html" to specify HTML files whose HTML tags should be removed. "-stem" to specify tokens should be stemmed with Porter stemmer. "-feedback" to allow relevance feedback from the user.

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

ir.vsr Class InvertedIndex

MAX_RETRIEVALS

tokenHash

docRefs

dirFile

docType

stem

feedback

InvertedIndex

InvertedIndex

indexDocuments

indexDocuments

indexDocument

indexToken

computeIDFandDocumentLengths

print

size

clear

retrieve

retrieve

retrieve

getRetrieval

incorporateToken

processQueries

presentRetrievals

showRetrievals

printRetrievals

main

ir.vsr
Class InvertedIndex