Using HTML Structure and Linked Pages to Improve Learning for Text Categorization (1999)
Michael B. Cline
Classifying web pages is an important task in automating the organization of information on the WWW, and learning for text categorization can help automate the development of such systems. This project explores using two aspects of HTML to improve learning for text categorization: 1) Using HTML tags such as titles, links, and headings to partition the text on a page and 2) Using the pages linked from a given page to augment its description. Initial experimental results on 26 categories from the Yahoo hierarchy demonstrate the promise of these two methods for improving the accuracy of a bag-of-words text classifier using a simple Bayesian learning algorithm.
Technical Report AI 98-270, Department of Computer Sciences, University of Texas at Austin. Undergraduate Honors Thesis.