HTMLFileDocument

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

ir.vsr
Class HTMLFileDocument

java.lang.Object
  ir.vsr.Document
      ir.vsr.FileDocument
          ir.vsr.HTMLFileDocument

public class HTMLFileDocument
extends FileDocument
extends FileDocument

An HTML file document where HTML commands are removed from the token stream. To include HTML tokens, just create a TextFileDocument from the HTML file.

Field Summary
`protected java.io.BufferedReader`	`textReader` The I/O reader for accessing the output of the HTML parser.
`protected java.util.StringTokenizer`	`tokenizer` The tokenizer for lines read from this document.
`static java.lang.String`	`tokenizerDelim` StringTokenizer delim for tokenizing only alphabetic strings.

Fields inherited from class ir.vsr.FileDocument
`file, reader`

Fields inherited from class ir.vsr.Document
`nextToken, numStopWords, numTokens, stem, stemmer, stopWords, stopWordsFile`

Constructor Summary
`HTMLFileDocument(java.io.File file, boolean stem)` Create a new text document for the given file.
`HTMLFileDocument(java.lang.String fileName, boolean stem)` Create a new text document for the given file name.

Method Summary
`protected java.lang.String`	`getNextCandidateToken()` Return the next purely alpha-character token in the document, or null if none left.
`static void`	`main(java.lang.String[] args)` For testing, print the bag-of-words vector for a given HTML file

Methods inherited from class ir.vsr.Document
`allLetters, hashMapVector, hasMoreTokens, loadStopWords, nextToken, numberOfTokens, prepareNextToken, printVector`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

tokenizerDelim

public static final java.lang.String tokenizerDelim

StringTokenizer delim for tokenizing only alphabetic strings.

See Also:: Constant Field Values

tokenizer

protected java.util.StringTokenizer tokenizer

The tokenizer for lines read from this document.

textReader

protected java.io.BufferedReader textReader

The I/O reader for accessing the output of the HTML parser.

Constructor Detail

HTMLFileDocument

public HTMLFileDocument(java.io.File file,
                        boolean stem)

Create a new text document for the given file.

HTMLFileDocument

public HTMLFileDocument(java.lang.String fileName,
                        boolean stem)

Create a new text document for the given file name.

Method Detail

getNextCandidateToken

protected java.lang.String getNextCandidateToken()

Return the next purely alpha-character token in the document, or null if none left.

Specified by:: getNextCandidateToken in class Document

main

public static void main(java.lang.String[] args)
                 throws java.io.IOException

For testing, print the bag-of-words vector for a given HTML file

Throws:: java.io.IOException

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

ir.vsr Class HTMLFileDocument

tokenizerDelim

tokenizer

textReader

HTMLFileDocument

HTMLFileDocument

getNextCandidateToken

main

ir.vsr
Class HTMLFileDocument