Text Mining with Information Extraction

Text mining is a relatively new research area at the intersection of data mining, natural-language processing, machine learning, and information retrieval. The goal of text mining is to discover knowledge in unstructured text. The related task of Information Extraction (IE) concerns locating specific items of data in natural-language documents, thereby transforming unstructured text into a structured database. Although handmade IE systems have existed for a while, automatic construction of information extraction systems using machine learning is more recent. This proposal presents a new framework for text mining, called DiscoTEX (Discovery from Text EXtraction), which uses a learned information extraction system to transform text into more structured data which is then mined for interesting relationships.
DiscoTEX combines IE and standard data mining methods to perform text mining as well as improve the performance of the underlying IE system. It discovers prediction rules from natural-language corpora, and these rules are used to predict additional information to extract from future documents, thereby improving the recall of IE. The initial version of DiscoTex integrates an IE module acquired by the Rapier learning system, and a standard rule induction module such as C4.5rules or Ripper. Encouraging initial results are presented on applying these techniques to a corpus of computer job announcements posted on an Internet newsgroup. However, this approach has problems when the same extracted entity or feature is represented by similar but not identical strings in different documents. Consequently, we are also developing an alternate rule induction system for DiscoTex called, TextRISE, that allows for partial matching of string-valued features. We also present initial results applying the TextRISE rule learner to corpora of book descriptions and patent documents retrieved from the World Wide Web (WWW). Future research will involve thorough testing on several domains, further development of this approach, and extensions of the proposed framework (currently limited to prediction rule discovery) to additional text mining tasks.

View:

PDF, PS

Citation:

unpublished. Ph.D. proposal, Department of Computer Sciences, University of Texas at Austin.

Bibtex:

People

Un Yong Nahm

Ph.D. Alumni

pebronia [at] acm org

Labs

Machine Learning