- Two Approaches to Handling Noisy Variation in Text Mining
Un Yong Nahm, Mikhail Bilenko, and Raymond J. Mooney
Proceedings of the ICML-2002 Workshop on Text Learning (TextML'2002), pp. 18-27, Sydney, Australia, July 2002.
Paper ID: 115
Category: Record Linkage & Duplicate Detection, Text Data Mining
Variation and noise in textual database entries can prevent text mining algorithms from discovering important regularities. We present two novel methods to cope with this problem: (1) an adaptive approach to "hardening" noisy databases by identifying duplicate records, and (2) mining "soft" association rules. For identifying approximately duplicate records, we present a domain-independent two-level method for improving duplicate detection accuracy based on machine learning. For mining soft matching rules, we introduce an algorithm that discovers association rules by allowing partial matching of items based on a textual similarity metric such as edit distance or cosine similarity. Experimental results on real and synthetic datasets show that our methods outperform traditional techniques for noisy textual databases.

mooney@cs.utexas.edu