Mining Soft-Matching Association Rules (2002)
Variation and noise in database entries can prevent data mining algorithms, such as association rule mining, from discovering important regularities. In particular, textual fields can exhibit variation due to typographical errors, mispellings, abbreviations, etc.. By allowing partial or "soft matching" of items based on a similarity metric such as edit-distance or cosine similarity, additional important patterns can be detected. This paper introduces an algorithm, SoftApriori that discovers soft-matching association rules given a user-supplied similarity metric for each field. Experimental results on several "noisy" datasets extracted from text demonstrate that SoftApriori discovers additional relationships that more accurately reflect regularities in the data.
In Proceedings of the Eleventh International Conference on Information and Knowledge Management (CIKM-2002), pp. 681-683, McLean, VA, November 2002.

Raymond J. Mooney Faculty mooney [at] cs utexas edu
Un Yong Nahm Ph.D. Alumni pebronia [at] acm org