UT ML Group: Record Linkage & Duplicate Detection
Record linkage is the process of identifying database records that
are syntactically different but refer to the same entity. This problem
has also been studied as duplicate detection, name matching, identity
uncertainty, database hardening and citation matching. Our work is
primarily focusing on using machine learning algorithms for training
similarity metrics and comparison methods to improve matching accuracy.
It is related to our work on text mining.
See the RIDDLE Repository on Identity Uncertainty,
Duplicate Detection, and Record Linkage for datasets, bibliography, and more information on this topic.
Publications
- Adaptive Blocking: Learning to Scale Up Record Linkage [Abstract] [PDF]
Mikhail Bilenko, Beena Kamath, Raymond J. Mooney
In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM-2006), pp. 87-96, Hong Kong, December 2006.
- Learnable Similarity Functions and Their Application to Record Linkage and Clustering [Abstract] [PDF]
Mikhail Bilenko
Ph.D. Thesis, Department of Computer Sciences, University of Texas at Austin, August 2006.
136 pages.
- Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping [Abstract] [PDF]
Mikhail Bilenko, Sugato Basu, and Mehran Sahami
Appears in Proceedings of the 5th International Conference on Data Mining (ICDM-2005), Houston, TX, pp. 58-65, November 2005.
- Alignments and String Similarity in Information Integration: A Random Field Approach [Abstract] [PDF]
Mikhail Bilenko and Raymond J. Mooney
Appears in Proceedings of the 2005 Dagstuhl Seminar on Machine Learning for the Semantic Web, Dagstuhl, Germany, February 2005.
- Learnable Similarity Functions and Their Applications to Clustering and Record Linkage [Abstract] [PDF]
Mikhail Bilenko
Proceedings of the Ninth AAAI/SIGART Doctoral Consortium, pp. 981-982, San Jose, CA, July 2004.
- Learnable Similarity Functions and Their Applications to Record Linkage and Clustering [Abstract] [PDF]
Mikhail Bilenko
Ph.D. proposal, Department of Computer Sciences, University of Texas at Austin, October 2003.
47 pages.
Also appears as Technical Report UT-AI-TR-03-305, Artificial Intelligence Lab, University of Texas at Austin, December 2003.
- Adaptive Name-Matching in Information Integration [Abstract] [PDF]
Mikhail Bilenko, William W. Cohen, Stephen Fienberg, Raymond J. Mooney, and Pradeep Ravikumar
IEEE Intelligent Systems, 18(5), pp. 16-23, September/October 2003.
- On Evaluation and Training-Set Construction for Duplicate Detection [Abstract] [PDF]
Mikhail Bilenko and Raymond J. Mooney
Proceedings of the KDD-2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, pp. 7-12, Washington DC, August 2003.
- Adaptive Duplicate Detection Using Learnable String Similarity Measures [Abstract] [PDF]
Mikhail Bilenko and Raymond J. Mooney
Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), pp. 39-48, Washington DC, August 2003.
- Employing Trainable String Similarity Metrics for Information Integration [Abstract] [PDF]
Mikhail Bilenko and Raymond J. Mooney
Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web, pp. 67-72, Acapulco, Mexico, August 2003.
- Two Approaches to Handling Noisy Variation in Text Mining [Abstract] [PDF]
Un Yong Nahm, Mikhail Bilenko, and Raymond J. Mooney
Proceedings of the ICML-2002 Workshop on Text Learning (TextML'2002), pp. 18-27, Sydney, Australia, July 2002.
- Learning to Combine Trained Distance Metrics for Duplicate Detection in Databases [Abstract] [PDF]
Mikhail Bilenko and Raymond J. Mooney
Technical Report AI 02-296, Artificial Intelligence Lab, University of Texas at Austin, February 2002.
mooney@cs.utexas.edu