Department of Computer Science

Machine Learning Research Group

University of Texas at Austin Artificial Intelligence Lab

Publications: Record Linkage & Duplicate Detection

Record linkage is the process of identifying database records that are syntactically different but refer to the same entity. This problem has also been studied as duplicate detection, name matching, identity uncertainty, database hardening and citation matching. Our work is primarily focusing on using machine learning algorithms for training similarity metrics and comparison methods to improve matching accuracy. It is related to our work on text mining.

See the RIDDLE Repository on Identity Uncertainty, Duplicate Detection, and Record Linkage for datasets, bibliography, and more information on this topic.

  1. Adaptive Blocking: Learning to Scale Up Record Linkage
    [Details] [PDF]
    Mikhail Bilenko, Beena Kamath, Raymond J. Mooney
    In Proceedings of the Sixth IEEE International Conference on Data Mining (ICDM-06), 87--96, Hong Kong, December 2006.
  2. Learnable Similarity Functions and Their Application to Record Linkage and Clustering
    [Details] [PDF]
    Mikhail Bilenko
    PhD Thesis, Department of Computer Sciences, University of Texas at Austin, Austin, TX, August 2006. 136 pages.
  3. Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping
    [Details] [PDF]
    Mikhail Bilenko, Sugato Basu, and Mehran Sahami
    In Proceedings of the 5th International Conference on Data Mining (ICDM-2005), 58--65, Houston, TX, November 2005.
  4. Alignments and String Similarity in Information Integration: A Random Field Approach
    [Details] [PDF]
    Mikhail Bilenko and Raymond J. Mooney
    In Proceedings of the 2005 Dagstuhl Seminar on Machine Learning for the Semantic Web, Dagstuhl, Germany, February 2005.
  5. Learnable Similarity Functions and Their Applications to Clustering and Record Linkage
    [Details] [PDF]
    Mikhail Bilenko
    In Proceedings of the Ninth AAAI/SIGART Doctoral Consortium, 981--982, San Jose, CA, July 2004.
  6. Learnable Similarity Functions and Their Applications to Record Linkage and Clustering
    [Details] [PDF]
    Mikhail Bilenko
    2003. Doctoral Dissertation Proposal, University of Texas at Austin.
  7. Adaptive Name-Matching in Information Integration
    [Details] [PDF]
    Mikhail Bilenko, William W. Cohen, Stephen Fienberg, Raymond J. Mooney, and Pradeep Ravikumar
    IEEE Intelligent Systems, 18(5):16-23, 2003.
  8. On Evaluation and Training-Set Construction for Duplicate Detection
    [Details] [PDF]
    Mikhail Bilenko and Raymond J. Mooney
    In Proceedings of the KDD-03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, 7-12, Washington, DC, August 2003.
  9. Adaptive Duplicate Detection Using Learnable String Similarity Measures
    [Details] [PDF]
    Mikhail Bilenko and Raymond J. Mooney
    In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2003), 39-48, Washington, DC, August 2003.
  10. Employing Trainable String Similarity Metrics for Information Integration
    [Details] [PDF]
    Mikhail Bilenko and Raymond J. Mooney
    In Proceedings of the IJCAI-03 Workshop on Information Integration on the Web, 67-72, Acapulco, Mexico, August 2003.
  11. Two Approaches to Handling Noisy Variation in Text Mining
    [Details] [PDF]
    Un Yong Nahm, Mikhail Bilenko, and Raymond J. Mooney
    In Papers from the Nineteenth International Conference on Machine Learning (ICML-2002) Workshop on Text Learning, 18-27, Sydney, Australia, July 2002.
  12. Learning to Combine Trained Distance Metrics for Duplicate Detection in Databases
    [Details] [PDF]
    Mikhail Bilenko and Raymond J. Mooney
    Technical Report AI 02-296, Artificial Intelligence Laboratory, University of Texas at Austin, Austin, TX, February 2002.