Probabilistic Semi-Supervised Clustering with Constraints (2006)
In certain clustering tasks it is possible to obtain limited supervision in the form of pairwise constraints, i.e., pairs of instances labeled as belonging to same or different clusters. The resulting problem is known as semi-supervised clustering, an instance of semi-supervised learning stemming from a traditional unsupervised learning setting. Several algorithms exist for enhancing clustering quality by using supervision in the form of constraints. These algorithms typically utilize the pairwise constraints to either modify the clustering objective function or to learn the clustering distortion measure. This chapter describes an approach that employs Hidden Markov Random Fields (HMRFs) as a probabilistic generative model for semi-supervised clustering, thereby providing a principled framework for incorporating constraint-based supervision into prototype-based clustering. The HMRF-based model allows the use of a broad range of clustering distortion measures, including Bregman divergences (e.g., squared Euclidean distance, KL divergence) and directional distance measures (e.g., cosine distance), making it applicable to a number of domains. The model leads to the HMRF-KMeans algorithm which minimizes an objective function derived from the joint probability of the model, and allows unification of constraint-based and distance-based semi-supervised clustering methods. Additionally, a two-phase active learning algorithm for selecting informative pairwise constraints in a query-driven framework is derived from the HMRF model, facilitating improved clustering performance with relatively small amounts of supervision from the user.
View:
PDF, PS
Citation:
In Semi-Supervised Learning, O. Chapelle and B. Sch{"{o}}lkopf and A. Zien (Eds.), Cambridge, MA 2006. MIT Press.
Bibtex:

Sugato Basu Ph.D. Alumni sugato [at] cs utexas edu
Mikhail Bilenko Ph.D. Alumni mbilenko [at] microsoft com
Raymond J. Mooney Faculty mooney [at] cs utexas edu