ir.vsr
Class HTMLFileDocument

java.lang.Object
  extended by ir.vsr.Document
      extended by ir.vsr.FileDocument
          extended by ir.vsr.HTMLFileDocument

public class HTMLFileDocument
extends FileDocument

An HTML file document where HTML commands are removed from the token stream. To include HTML tokens, just create a TextFileDocument from the HTML file.


Field Summary
protected  java.io.BufferedReader textReader
          The I/O reader for accessing the output of the HTML parser.
protected  java.util.StringTokenizer tokenizer
          The tokenizer for lines read from this document.
static java.lang.String tokenizerDelim
          StringTokenizer delim for tokenizing only alphabetic strings.
 
Fields inherited from class ir.vsr.FileDocument
file, reader
 
Fields inherited from class ir.vsr.Document
nextToken, numStopWords, numTokens, stem, stemmer, stopWords, stopWordsFile
 
Constructor Summary
HTMLFileDocument(java.io.File file, boolean stem)
          Create a new text document for the given file.
HTMLFileDocument(java.lang.String fileName, boolean stem)
          Create a new text document for the given file name.
 
Method Summary
protected  java.lang.String getNextCandidateToken()
          Return the next purely alpha-character token in the document, or null if none left.
static void main(java.lang.String[] args)
          For testing, print the bag-of-words vector for a given HTML file
 
Methods inherited from class ir.vsr.Document
allLetters, hashMapVector, hasMoreTokens, loadStopWords, nextToken, numberOfTokens, prepareNextToken, printVector
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

tokenizerDelim

public static final java.lang.String tokenizerDelim
StringTokenizer delim for tokenizing only alphabetic strings.

See Also:
Constant Field Values

tokenizer

protected java.util.StringTokenizer tokenizer
The tokenizer for lines read from this document.


textReader

protected java.io.BufferedReader textReader
The I/O reader for accessing the output of the HTML parser.

Constructor Detail

HTMLFileDocument

public HTMLFileDocument(java.io.File file,
                        boolean stem)
Create a new text document for the given file.


HTMLFileDocument

public HTMLFileDocument(java.lang.String fileName,
                        boolean stem)
Create a new text document for the given file name.

Method Detail

getNextCandidateToken

protected java.lang.String getNextCandidateToken()
Return the next purely alpha-character token in the document, or null if none left.

Specified by:
getNextCandidateToken in class Document

main

public static void main(java.lang.String[] args)
                 throws java.io.IOException
For testing, print the bag-of-words vector for a given HTML file

Throws:
java.io.IOException