Two Approaches to Handling Noisy Variation in Text Mining

Two Approaches to Handling Noisy Variation in Text Mining (2002)

Un Yong Nahm, Mikhail Bilenko, and Raymond J. Mooney

Variation and noise in textual database entries can prevent text mining algorithms from discovering important regularities. We present two novel methods to cope with this problem: (1) an adaptive approach to ``hardening'' noisy databases by identifying duplicate records, and (2) mining ``soft'' association rules. For identifying approximately duplicate records, we present a domain-independent two-level method for improving duplicate detection accuracy based on machine learning. For mining soft matching rules, we introduce an algorithm that discovers association rules by allowing partial matching of items based on a textual similarity metric such as edit distance or cosine similarity. Experimental results on real and synthetic datasets show that our methods outperform traditional techniques for noisy textual databases.

View:

PDF, PS

Citation:

In Papers from the Nineteenth International Conference on Machine Learning (ICML-2002) Workshop on Text Learning, pp. 18-27, Sydney, Australia, July 2002.

Bibtex:

People

Mikhail Bilenko	Ph.D. Alumni	mbilenko [at] microsoft com
Raymond J. Mooney	Faculty	mooney [at] cs utexas edu
Un Yong Nahm	Ph.D. Alumni	pebronia [at] acm org

Areas of Interest

Machine Learning Record Linkage & Duplicate Detection Text Data Mining

Labs

Machine Learning