- Using HTML Structure and Linked Pages to Improve Learning for Text Categorization
Michael B. Cline
Undergraduate Honors Thesis, Department of Computer Sciences, University of Texas at Austin, May 1999.
Also appears as Technical Report AI 98-270, Artificial Intelligence Lab, University of Texas at Austin.
21 pages
Paper ID: 91
Category: Text Categorization and Clustering
Classifying web pages is an important task in automating the organization of information on the WWW, and learning for text categorization can help automate the development of such systems. This project explores using two aspects of HTML to improve learning for text categorization: 1) Using HTML tags such as titles, links, and headings to partition the text on a page and 2) Using the pages linked from a given page to augment its description. Initial experimental results on 26 categories from the Yahoo hierarchy demonstrate the promise of these two methods for improving the accuracy of a bag-of-words text classifier using a simple Bayesian learning algorithm.

mooney@cs.utexas.edu