Two Approaches to Handling Noisy Variation in Text Mining (2002)
Variation and noise in textual database entries can prevent text mining algorithms from discovering important regularities. We present two novel methods to cope with this problem: (1) an adaptive approach to ``hardening'' noisy databases by identifying duplicate records, and (2) mining ``soft'' association rules. For identifying approximately duplicate records, we present a domain-independent two-level method for improving duplicate detection accuracy based on machine learning. For mining soft matching rules, we introduce an algorithm that discovers association rules by allowing partial matching of items based on a textual similarity metric such as edit distance or cosine similarity. Experimental results on real and synthetic datasets show that our methods outperform traditional techniques for noisy textual databases.
In Papers from the Nineteenth International Conference on Machine Learning (ICML-2002) Workshop on Text Learning, pp. 18-27, Sydney, Australia, July 2002.

Mikhail Bilenko Ph.D. Alumni mbilenko [at] microsoft com
Raymond J. Mooney Faculty mooney [at] cs utexas edu
Un Yong Nahm Ph.D. Alumni pebronia [at] acm org