ir.vsr
Class DocumentIterator

java.lang.Object
  extended by ir.vsr.DocumentIterator

public class DocumentIterator
extends java.lang.Object

An object for iterating over a set of documents in a directory. Produces DocumentFile objects that are either TextFileDocuments or HTMFileDocuments depending on whether docType is TYPE_TEXT or TYPE_HTML


Field Summary
protected  short docType
          The type of documents to be created
protected  java.io.File[] files
          An array of files in the directory
protected  int position
          The current position of the iterator in this array
protected  boolean stem
          Whether tokens should be stemmed with Porter stemmer
static short TYPE_HTML
          docType for HTML files
static short TYPE_TEXT
          docType for ASCII text files
 
Constructor Summary
DocumentIterator(java.io.File dirFile)
          Create an iterator for TexFileDocuments
DocumentIterator(java.io.File dirFile, short docType, boolean stem)
          Create an iterator with these attributes
DocumentIterator(java.io.File dirFile, short docType, boolean stem, java.io.FilenameFilter filter)
          Create an iterator with these attributes
 
Method Summary
 boolean hasMoreDocuments()
          Returns true iff there are more documents in this directory
static void main(java.lang.String[] args)
          Test by printing the bag-of-words for each file in the given directory
 FileDocument nextDocument()
          Get the next document
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

TYPE_TEXT

public static final short TYPE_TEXT
docType for ASCII text files

See Also:
Constant Field Values

TYPE_HTML

public static final short TYPE_HTML
docType for HTML files

See Also:
Constant Field Values

files

protected java.io.File[] files
An array of files in the directory


position

protected int position
The current position of the iterator in this array


docType

protected short docType
The type of documents to be created


stem

protected boolean stem
Whether tokens should be stemmed with Porter stemmer

Constructor Detail

DocumentIterator

public DocumentIterator(java.io.File dirFile,
                        short docType,
                        boolean stem,
                        java.io.FilenameFilter filter)
Create an iterator with these attributes

Parameters:
dirFile - The directory to use as a source of documents.
docType - The type of Document to create. e.g. TYPE_TEXT or TYPE_HTML
stem - Whether tokens should be stemmed with Porter stemmer.
filter - A filter to select a subset of the docs in the directory

DocumentIterator

public DocumentIterator(java.io.File dirFile,
                        short docType,
                        boolean stem)
Create an iterator with these attributes

Parameters:
dirFile - The directory to use as a source of documents.
docType - The type of Document to create. e.g. TYPE_TEXT or TYPE_HTML
stem - Whether tokens should be stemmed with Porter stemmer.

DocumentIterator

public DocumentIterator(java.io.File dirFile)
Create an iterator for TexFileDocuments

Parameters:
dirFile - The directory to use as a source of documents.
Method Detail

nextDocument

public FileDocument nextDocument()
Get the next document


hasMoreDocuments

public boolean hasMoreDocuments()
Returns true iff there are more documents in this directory


main

public static void main(java.lang.String[] args)
Test by printing the bag-of-words for each file in the given directory