AI Lab Areas - Record Linkage & Duplicate Detection

Record Linkage & Duplicate Detection

Record linkage is the process of identifying database records that are syntactically different but refer to the same entity. This problem has also been studied as duplicate detection, name matching, identity uncertainty, database hardening and citation matching. Our work is primarily focusing on using machine learning algorithms for training similarity metrics and comparison methods to improve matching accuracy. It is related to our work on text mining.

See the RIDDLE Repository on Identity Uncertainty, Duplicate Detection, and Record Linkage for datasets, bibliography, and more information on this topic.

Publications

[Expand to show all 12]

Adaptive Blocking: Learning to Scale Up Record Linkage	2006
Mikhail Bilenko, Beena Kamath, Raymond J. Mooney, In Proceedings of the Sixth IEEE International Conference on Data Mining (ICDM-06), pp. 87--96, Hong Kong, December 2006.
Learnable Similarity Functions and Their Application to Record Linkage and Clustering	2006
Mikhail Bilenko, PhD Thesis, Department of Computer Sciences, University of Texas at Austin. 136 pages.
Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping	2005
Mikhail Bilenko, Sugato Basu, and Mehran Sahami, In Proceedings of the 5th International Conference on Data Mining (ICDM-2005), pp. 58--65, Houston, TX, November 2005.
Alignments and String Similarity in Information Integration: A Random Field Approach	2005
Mikhail Bilenko and Raymond J. Mooney, In Proceedings of the 2005 Dagstuhl Seminar on Machine Learning for the Semantic Web, Dagstuhl, Germany, February 2005.
Learnable Similarity Functions and Their Applications to Clustering and Record Linkage	2004
Mikhail Bilenko, In Proceedings of the Ninth AAAI/SIGART Doctoral Consortium, pp. 981--982, San Jose, CA, July 2004.
Adaptive Duplicate Detection Using Learnable String Similarity Measures	2003
Mikhail Bilenko and Raymond J. Mooney, In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), pp. 39-48, Washington, DC, August 2003.
Adaptive Name-Matching in Information Integration	2003
Mikhail Bilenko, William W. Cohen, Stephen Fienberg, Raymond J. Mooney, and Pradeep Ravikumar, IEEE Intelligent Systems, Vol. 18, 5 (2003), pp. 16-23.
Employing Trainable String Similarity Metrics for Information Integration	2003
Mikhail Bilenko and Raymond J. Mooney, In Proceedings of the IJCAI-03 Workshop on Information Integration on the Web, pp. 67-72, Acapulco, Mexico, August 2003.
Learnable Similarity Functions and Their Applications to Record Linkage and Clustering	2003
Mikhail Bilenko, unpublished. Doctoral Dissertation Proposal, University of Texas at Austin.
On Evaluation and Training-Set Construction for Duplicate Detection	2003
Mikhail Bilenko and Raymond J. Mooney, In Proceedings of the KDD-03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, pp. 7-12, Washington, DC, August 2003.
Learning to Combine Trained Distance Metrics for Duplicate Detection in Databases	2002
Mikhail Bilenko and Raymond J. Mooney, Technical Report AI 02-296, Artificial Intelligence Laboratory, University of Texas at Austin.
Two Approaches to Handling Noisy Variation in Text Mining	2002
Un Yong Nahm, Mikhail Bilenko, and Raymond J. Mooney, In Papers from the Nineteenth International Conference on Machine Learning (ICML-2002) Workshop on Text Learning, pp. 18-27, Sydney, Australia, July 2002.

Labs

Machine Learning