Machine Learning Research Group | University of Texas

Publications: Semi-Supervised Learning

In many learning tasks, there is a large supply of unlabeled data but insufficient labeled data since it can be expensive to generate. Semi-supervised learning combines labeled and unlabeled data during training to improve performance. Semi-supervised learning is applicable to both classification and clustering. In supervised classification, there is a known, fixed set of categories and category-labeled training data is used to induce a classification function. In semi-supervised classification, training also exploits additional unlabeled data, frequently resulting in a more accurate classification function. In semi-supervised clustering, some labeled data is used along with the unlabeled data to obtain a better clustering.

Hide abstracts

Dialog as a Vehicle for Lifelong Learning of Grounded Language Understanding Systems
[Details] [PDF] [Slides (PDF)]
Aishwarya Padmakumar
PhD Thesis, Department of Computer Science, The University of Texas at Austin, August 2020.
Natural language interfaces have the potential to make various forms of technology, including mobile phones and computers as well as robots or other machines such as ATMs and self-checkout counters, more accessible and less intimidating to users who are unfamiliar or uncomfortable with other types of interfaces. In particular, natural language understanding systems on physical robots face a number of challenges, including the need to ground language in perception, the ability to adapt to changes in the environment and novel uses of language, and to deal with uncertainty in understanding. To effectively handle these challenges, such systems need to perform lifelong learning - continually updating the scope and predictions of the model with user interactions. In this thesis, we discuss ways in which dialog interaction with users can be used to improve grounded natural language understanding systems, motivated by service robot applications. We focus on two types of queries that can be used in such dialog systems – active learning queries to elicit knowledge about the environment that can be used to improve perceptual models, and clarification questions that confirm the system’s hypotheses, or elicit specific information required to complete a task. Our goal is to build a system that can learn how to interact with users balancing a quick completion of tasks desired by the user with asking additional active learning questions to improve the underlying grounded language understanding components. We present work on jointly improving semantic parsers from and learning a dialog policy for clarification dialogs, that improve a robot’s ability to understand natural language commands. We introduce the framework of opportunistic active learning, where a robot introduces opportunistic queries, that may not be immediately relevant, into an interaction in the hope of improving performance in future interactions. We demonstrate the usefulness of this framework in learning to ground natural language descriptions of objects, and learn a dialog policy for such interactions. We also learn dialog policies that balance task completion, opportunistic active learning, and attribute-based clarification questions. Finally, we attempt to expand this framework to different types of underlying models of grounded language understanding.
ML ID: 389
Interaction and Autonomy in RoboCup@Home and Building-Wide Intelligence
[Details] [PDF]
Justin Hart, Harel Yedidsion, Yuqian Jiang, Nick Walker, Rishi Shah, Jesse Thomason, Aishwarya Padmakumar, Rolando Fernandez, Jivko Sinapov, Raymond Mooney, Peter Stone
In Artificial Intelligence (AI) for Human-Robot Interaction (HRI) symposium, AAAI Fall Symposium Series, Arlington, Virginia, October 2018.
Efforts are underway at UT Austin to build autonomous robot systems that address the challenges of long-term deployments in office environments and of the more prescribed domestic service tasks of the RoboCup@Home competition. We discuss the contrasts and synergies of these efforts, highlighting how our work to build a RoboCup@Home Domestic Standard Platform League entry led us to identify an integrated software architecture that could support both projects. Further, naturalistic deployments of our office robot platform as part of the Building-Wide Intelligence project have led us to identify and research new problems in a traditional laboratory setting.
ML ID: 366
Jointly Improving Parsing and Perception for Natural Language Commands through Human-Robot Dialog
[Details] [PDF]
Jesse Thomason, Aishwarya Padmakumar, Jivko Sinapov, Nick Walker, Yuqian Jiang, Harel Yedidsion, Justin Hart, Peter Stone, and Raymond J. Mooney
In Late-breaking Track at the SIGDIAL Special Session on Physically Situated Dialogue (RoboDIAL-18), Melbourne, Australia, July 2018.
In this work, we present methods for parsing natural language to underlying meanings, and using robotic sensors to create multi-modal models of perceptual concepts. We combine these steps towards language understanding into a holistic agent for jointly improving parsing and perception on a robotic platform through human-robot dialog. We train and evaluate this agent on Amazon Mechanical Turk, then demonstrate it on a robotic platform initialized from that conversational data. Our experiments show that improving both parsing and perception components from conversations improves communication quality and human ratings of the agent.
ML ID: 365
Jointly Improving Parsing and Perception for Natural Language Commands through Human-Robot Dialog
[Details] [PDF]
Jesse Thomason, Aishwarya Padmakumar, Jivko Sinapov, Nick Walker, Yuqian Jiang, Harel Yedidsion, Justin Hart, Peter Stone, and Raymond J. Mooney
In RSS Workshop on Models and Representations for Natural Human-Robot Communication (MRHRC-18). Robotics: Science and Systems (RSS), June 2018.
Natural language understanding in robots needs to be robust to a wide-range of both human speakers and human environments. Rather than force humans to use language that robots can understand, robots in human environments should dynamically adapt—continuously learning new language constructions and perceptual concepts as they are used in context. In this work, we present methods for parsing natural language to underlying meanings, and using robotic sensors to create multi-modal models of perceptual concepts. We combine these steps towards language understanding into a holistic agent for jointly improving parsing and perception on a robotic platform through human-robot dialog. We train and evaluate this agent on Amazon Mechanical Turk, then demonstrate it on a robotic platform initialized from conversational data gathered from Mechanical Turk. Our experiments show that improving both parsing and perception components from conversations improves communication quality and human ratings of the agent.
ML ID: 363
Continually Improving Grounded Natural Language Understanding through Human-Robot Dialog
[Details] [PDF]
Jesse Thomason
PhD Thesis, Department of Computer Science, The University of Texas at Austin, April 2018.
As robots become ubiquitous in homes and workplaces such as hospitals and factories, they must be able to communicate with humans. Several kinds of knowledge are required to understand and respond to a human's natural language commands and questions. If a person requests an assistant robot to "take me to Alice's office", the robot must know that Alice is a person who owns some unique office, and that "take me" means it should navigate there. Similarly, if a person requests "bring me the heavy, green mug", the robot must have accurate mental models of the physical concepts "heavy", "green", and "mug". To avoid forcing humans to use key phrases or words robots already know, this thesis focuses on helping robots understanding new language constructs through interactions with humans and with the world around them. To understand a command in natural language, a robot must first convert that command to an internal representation that it can reason with. Semantic parsing is a method for performing this conversion, and the target representation is often semantic forms represented as predicate logic with lambda calculus. Traditional semantic parsing relies on hand-crafted resources from a human expert: an ontology of concepts, a lexicon connecting language to those concepts, and training examples of language with abstract meanings. One thrust of this thesis is to perform semantic parsing with sparse initial data. We use the conversations between a robot and human users to induce pairs of natural language utterances with the target semantic forms a robot discovers through its questions, reducing the annotation effort of creating training examples for parsing. We use this data to build more dialog-capable robots in new domains with much less expert human effort. Meanings of many language concepts are bound to the physical world. Understanding object properties and categories, such as "heavy", "green", and "mug" requires interacting with and perceiving the physical world. Embodied robots can use manipulation capabilities, such as pushing, picking up, and dropping objects to gather sensory data about them. This data can be used to understand non-visual concepts like "heavy" and "empty" (e.g. "get the empty carton of milk from the fridge"), and assist with concepts that have both visual and non-visual expression (e.g. "tall" things look big and also exert force sooner than "short" things when pressed down on). A second thrust of this thesis focuses on strategies for learning these concepts using multi-modal sensory information. We use human-in-the-loop learning to get labels between concept words and actual objects in the environment. We also explore ways to tease out polysemy and synonymy in concept words such as "light", which can refer to a weight or a color, the latter sense being synonymous with "pale". Additionally, pushing, picking up, and dropping objects to gather sensory information is prohibitively time-consuming, so we investigate strategies for using linguistic information and human input to expedite exploration when learning a new concept. Finally, we build an integrated agent with both parsing and perception capabilities that learns from conversations with users to improve both components over time. We demonstrate that parser learning from conversations can be combined with multi-modal perception using predicate-object labels gathered through opportunistic active learning during those conversations to improve performance for understanding natural language commands from humans. Human users also qualitatively rate this integrated learning agent as more usable after it has improved from conversation-based learning.
ML ID: 361
Guiding Exploratory Behaviors for Multi-Modal Grounding of Linguistic Descriptions
[Details] [PDF]
Jesse Thomason, Jivko Sinapov, Raymond Mooney, Peter Stone
In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) , February 2018.
A major goal of grounded language learning research is to enable robots to connect language predicates to a robot’s physical interactive perception of the world. Coupling object exploratory behaviors such as grasping, lifting, and looking with multiple sensory modalities (e.g., audio, haptics, and vision) enables a robot to ground non-visual words like “heavy” as well as visual words like “red”. A major limitation of existing approaches to multi-modal language grounding is that a robot has to exhaustively explore training objects with a variety of actions when learning a new such language predicate. This paper proposes a method for guiding a robot’s behavioral exploration policy when learning a novel predicate based on known grounded predicates and the novel predicate’s linguistic relationship to them. We demonstrate our approach on two datasets in which a robot explored large sets of objects and was tasked with learning to recognize whether novel words applied to those objects.
ML ID: 357
Improving Black-box Speech Recognition using Semantic Parsing
[Details] [PDF] [Poster]
Rodolfo Corona and Jesse Thomason and Raymond J. Mooney
In Proceedings of the 8th International Joint Conference on Natural Language Processing (IJCNLP-17), 122-127, Taipei, Taiwan, November 2017.
Speech is a natural channel for human-computer interaction in robotics and consumer applications. Natural language understanding pipelines that start with speech can have trouble recovering from speech recognition errors. Black-box automatic speech recognition (ASR) systems, built for general purpose use, are unable to take advantage of in-domain language models that could otherwise ameliorate these errors. In this work, we present a method for re-ranking black-box ASR hypotheses using an in-domain language model and semantic parser trained for a particular task. Our re-ranking method significantly improves both transcription accuracy and semantic understanding over a state-of-the-art ASR’s vanilla output.
ML ID: 351
Knowledge Transfer Using Latent Variable Models
[Details] [PDF] [Slides (PDF)]
Ayan Acharya
PhD Thesis, Department of Electrical and Computer Engineering, The University of Texas at Austin, August 2015.
In several applications, scarcity of labeled data is a challenging problem that hinders the predictive capabilities of machine learning algorithms. Additionally, the distribution of the data changes over time, rendering models trained with older data less capable of discovering useful structure from the newly available data. Transfer learning is a convenient framework to overcome such problems where the learning of a model specific to a domain can benefit the learning of other models in other domains through either simultaneous training of domains or sequential transfer of knowledge from one domain to the others. This thesis explores the opportunities of knowledge transfer in the context of a few applications pertaining to object recognition from images, text analysis, network modeling and recommender systems, using probabilistic latent variable models as building blocks. Both simultaneous and sequential knowledge transfer are achieved through the latent variables, either by sharing these across multiple related domains (for simultaneous learning) or by adapting their distributions to fit data from a new domain (for sequential learning).
ML ID: 322
Inducing Grammars from Linguistic Universals and Realistic Amounts of Supervision
[Details] [PDF]
Dan Garrette
PhD Thesis, Department of Computer Science, The University of Texas at Austin, 2015.
The best performing NLP models to date are learned from large volumes of manually-annotated data. For tasks like part-of-speech tagging and grammatical parsing, high performance can be achieved with plentiful supervised data. However, such resources are extremely costly to produce, making them an unlikely option for building NLP tools in under-resourced languages or domains.
This dissertation is concerned with reducing the annotation required to learn NLP models, with the goal of opening up the range of domains and languages to which NLP technologies may be applied. In this work, we explore the possibility of learning from a degree of supervision that is at or close to the amount that could reasonably be collected from annotators for a particular domain or language that currently has none. We show that just a small amount of annotation input — even that which can be collected in just a few hours — can provide enormous advantages if we have learning algorithms that can appropriately exploit it.
This work presents new algorithms, models, and approaches designed to learn grammatical information from weak supervision. In particular, we look at ways of intersecting a variety of different forms of supervision in complementary ways, thus lowering the overall annotation burden. Sources of information include tag dictionaries, morphological analyzers, constituent bracketings, and partial tree annotations, as well as unannotated corpora. For example, we present algorithms that are able to combine faster-to-obtain type-level annotation with unannotated text to remove the need for slower-to-obtain token-level annotation.
Much of this dissertation describes work on Combinatory Categorial Grammar (CCG), a grammatical formalism notable for its use of structured, logic-backed categories that describe how each word and constituent fits into the overall syntax of the sentence. This work shows how linguistic universals intrinsic to the CCG formalism itself can be encoded as Bayesian priors to improve learning.
ML ID: 321
A Supertag-Context Model for Weakly-Supervised CCG Parser Learning
[Details] [PDF] [Slides (PDF)]
Dan Garrette and Chris Dyer and Jason Baldridge and Noah A. Smith
In Proceedings of the 2015 Conference on Computational Natural Language Learning (CoNLL-2015), 22--31, Beijing, China, 2015.
Combinatory Categorial Grammar (CCG) is a lexicalized grammar formalism in which words are associated with categories that specify the syntactic configurations in which they may occur. We present a novel parsing model with the capacity to capture the associative adjacent-category relationships intrinsic to CCG by parameterizing the relationships between each constituent label and the preterminal categories directly to its left and right, biasing the model toward constituent categories that can combine with their contexts. This builds on the intuitions of Klein and Manning's (2002) "constituent-context" model, which demonstrated the value of modeling context, but has the advantage of being able to exploit the properties of CCG. Our experiments show that our model outperforms a baseline in which this context information is not captured.
ML ID: 320
Weakly-Supervised Grammar-Informed Bayesian CCG Parser Learning
[Details] [PDF] [Slides (PDF)]
Dan Garrette, Chris Dyer, Jason Baldridge, Noah A. Smith
In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI-15), Austin, TX, January 2015.
Combinatory Categorial Grammar (CCG) is a lexicalized grammar formalism in which words are associated with categories that, in combination with a small universal set of rules, specify the syntactic configurations in which they may occur. Previous work has shown that learning sequence models for CCG tagging can be improved by using priors that are sensitive to the formal properties of CCG as well as cross-linguistic universals. We extend this approach to the task of learning a full CCG parser from weak supervision. We present a Bayesian formulation for CCG parser induction that assumes only supervision in the form of an incomplete tag dictionary mapping some word types to sets of potential categories. Our approach outperforms a baseline model trained with uniform priors by exploiting universal, intrinsic properties of the CCG formalism to bias the model toward simpler, more cross-linguistically common categories.
ML ID: 310
Weakly-Supervised Bayesian Learning of a CCG Supertagger
[Details] [PDF] [Slides (PDF)] [Poster]
Dan Garrette and Chris Dyer and Jason Baldridge and Noah A. Smith
In Proceedings of the Eighteenth Conference on Computational Natural Language Learning (CoNLL-2014), 141--150, Baltimore, MD, June 2014.
We present a Bayesian formulation for weakly-supervised learning of a Combinatory Categorial Grammar (CCG) supertagger with an HMM. We assume supervision in the form of a tag dictionary, and our prior encourages the use of cross-linguistically common category structures as well as transitions between tags that can combine locally according to CCG's combinators. Our prior is theoretically appealing since it is motivated by language-independent, universal properties of the CCG formalism. Empirically, we show that it yields substantial improvements over previous work that used similar biases to initialize an EM-based learner. Additional gains are obtained by further shaping the prior with corpus-specific information that is extracted automatically from raw text and a tag dictionary.
ML ID: 307
Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages
[Details] [PDF]
Dan Garrette and Jason Mielens and Jason Baldridge
In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL-2013), 583--592, Sofia, Bulgaria, August 2013.
Developing natural language processing tools for low-resource languages often requires creating resources from scratch. While a variety of semi-supervised methods exist for training from incomplete data, there are open questions regarding what types of training data should be used and how much is necessary. We discuss a series of experiments designed to shed light on such questions in the context of part-of-speech tagging. We obtain timed annotations from linguists for the low-resource languages Kinyarwanda and Malagasy (as well as English) and evaluate how the amounts of various kinds of data affect performance of a trained POS-tagger. Our results show that annotation of word types is the most important, provided a sufficiently capable semi-supervised learning infrastructure is in place to project type information onto a raw corpus. We also show that finite-state morphological analyzers are effective sources of type information when few labeled examples are available.
ML ID: 288
Learning a Part-of-Speech Tagger from Two Hours of Annotation
[Details] [PDF] [Slides (PDF)] [Video]
Dan Garrette, Jason Baldridge
In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT-13), 138--147, Atlanta, GA, June 2013.
Most work on weakly-supervised learning for part-of-speech taggers has been based on unrealistic assumptions about the amount and quality of training data. For this paper, we attempt to create true low-resource scenarios by allowing a linguist just two hours to annotate data and evaluating on the languages Kinyarwanda and Malagasy. Given these severely limited amounts of either type supervision (tag dictionaries) or token supervision (labeled sentences), we are able to dramatically improve the learning of a hidden Markov model through our method of automatically generalizing the annotations, reducing noise, and inducing word-tag frequency information.
ML ID: 283
Type-Supervised Hidden Markov Models for Part-of-Speech Tagging with Incomplete Tag Dictionaries
[Details] [PDF]
Dan Garrette and Jason Baldridge
In Proceedings of the Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 2012), 821--831, Jeju, Korea, July 2012.
Past work on learning part-of-speech taggers from tag dictionaries and raw data has reported good results, but the assumptions made about those dictionaries are often unrealistic: due to historical precedents, they assume access to information about labels in the raw and test sets. Here, we demonstrate ways to learn hidden Markov model taggers from incomplete tag dictionaries. Taking the MIN-GREEDY algorithm (Ravi et al., 2010) as a starting point, we improve it with several intuitive heuristics. We also define a simple HMM emission initialization that takes advantage of the tag dictionary and raw data to capture both the openness of a given tag and its estimated prevalence in the raw data. Altogether, our augmentations produce improvements to performance over the original MIN-GREEDY algorithm for both English and Italian data.
ML ID: 279
Semi-supervised graph clustering: a kernel approach
[Details] [PDF]
Brian Kulis, Sugato Basu, Inderjit Dhillon, and Raymond Mooney
Machine Learning Journal, 74(1):1-22, 2009.
Semi-supervised clustering algorithms aim to improve clustering results using limited supervision. The supervision is generally given as pairwise constraints; such constraints are natural for graphs, yet most semi-supervised clustering algorithms are designed for data represented as vectors. In this paper, we unify vector-based and graph-based approaches. We first show that a recently-proposed objective function for semi-supervised clustering based on Hidden Markov Random Fields, with squared Euclidean distance and a certain class of constraint penalty functions, can be expressed as a special case of the weighted kernel k-means objective (Dhillon et al., in Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining, 2004a). A recent theoretical connection between weighted kernel k-means and several graph clustering objectives enables us to perform semi-supervised clustering of data given either as vectors or as a graph. For graph data, this result leads to algorithms for optimizing several new semi-supervised graph clustering objectives. For vector data, the kernel approach also enables us to find clusters with non-linear boundaries in the input data space. Furthermore, we show that recent work on spectral learning (Kamvar et al., in Proceedings of the 17th International Joint Conference on Artificial Intelligence, 2003) may be viewed as a special case of our formulation. We empirically show that our algorithm is able to outperform current state-of-the-art semisupervised algorithms on both vector-based and graph-based data sets.
ML ID: 224
Watch, Listen & Learn: Co-training on Captioned Images and Videos
[Details] [PDF]
Sonal Gupta, Joohyun Kim, Kristen Grauman and Raymond Mooney
In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), 457--472, Antwerp Belgium, September 2008.
Recognizing visual scenes and activities is challenging: often visual cues alone are ambiguous, and it is expensive to obtain manually labeled examples from which to learn. To cope with these constraints, we propose to leverage the text that often accompanies visual data to learn robust models of scenes and actions from partially labeled collections. Our approach uses co-training, a semi-supervised learning method that accommodates multi-modal views of data. To classify images, our method learns from captioned images of natural scenes; and to recognize human actions, it learns from videos of athletic events with commentary. We show that by exploiting both multi-modal representations and unlabeled data our approach learns more accurate image and video classifiers than standard baseline algorithms.
ML ID: 221
Semi-Supervised Learning for Semantic Parsing using Support Vector Machines
[Details] [PDF] [Slides (PPT)]
Rohit J. Kate and Raymond J. Mooney
In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, Short Papers (NAACL/HLT-2007), 81--84, Rochester, NY, April 2007.
We present a method for utilizing unannotated sentences to improve a semantic parser which maps natural language (NL) sentences into their formal meaning representations (MRs). Given NL sentences annotated with their MRs, the initial supervised semantic parser learns the mapping by training Support Vector Machine (SVM) classifiers for every production in the MR grammar. Our new method applies the learned semantic parser to the unannotated sentences and collects unlabeled examples which are then used to retrain the classifiers using a variant of transductive SVMs. Experimental results show the improvements obtained over the purely supervised parser, particularly when the annotated training set is small.
ML ID: 198
Learnable Similarity Functions and Their Application to Record Linkage and Clustering
[Details] [PDF]
Mikhail Bilenko
PhD Thesis, Department of Computer Sciences, University of Texas at Austin, Austin, TX, August 2006. 136 pages.
Many machine learning and data mining tasks depend on functions that estimate similarity between instances. Similarity computations are particularly important in clustering and information integration applications, where pairwise distances play a central role in many algorithms. Typically, algorithms for these tasks rely on pre-defined similarity measures, such as edit distance or cosine similarity for strings, or Euclidean distance for vector-space data. However, standard distance functions are frequently suboptimal as they do not capture the appropriate notion of similarity for a particular domain, dataset, or application.
In this thesis, we present several approaches for addressing this problem by employing learnable similarity functions. Given supervision in the form of similar or dissimilar pairs of instances, learnable similarity functions can be trained to provide accurate estimates for the domain and task at hand. We study the problem of adapting similarity functions in the context of several tasks: record linkage, clustering, and blocking. For each of these tasks, we present learnable similarity functions and training algorithms that lead to improved performance.
In record linkage, also known as duplicate detection and entity matching, the goal is to identify database records referring to the same underlying entity. This requires estimating similarity between corresponding field values of records, as well as overall similarity between records. For computing field-level similarity between strings, we describe two learnable variants of edit distance that lead to improvements in linkage accuracy. For learning record-level similarity functions, we employ Support Vector Machines to combine similarities of individual record fields in proportion to their relative importance, yielding a high-accuracy linkage system. We also investigate strategies for efficient collection of training data which can be scarce due to the pairwise nature of the record linkage task.
In clustering, similarity functions are essential as they determine the grouping of instances that is the goal of clustering. We describe a framework for integrating learnable similarity functions within a probabilistic model for semi-supervised clustering based on Hidden Markov Random Fields (HMRFs). The framework accommodates learning various distance measures, including those based on Bregman divergences (e.g., parameterized Mahalanobis distance and parameterized KL-divergence), as well as directional measures (e.g., cosine similarity). Thus, it is applicable to a wide range of domains and data representations. Similarity functions are learned within the HMRF-KMeans algorithm derived from the framework, leading to significant improvements in clustering accuracy.
The third application we consider, blocking, is critical in making record linkage and clustering algorithms scalable to large datasets, as it facilitates efficient selection of approximately similar instance pairs without explicitly considering all possible pairs. Previously proposed blocking methods require manually constructing a similarity function or a set of similarity predicates, followed by hand-tuning of parameters. We propose learning blocking functions automatically from linkage and semi-supervised clustering supervision, which allows automatic construction of blocking methods that are efficient and accurate. This approach yields computationally cheap learnable similarity functions that can be used for scaling up in a variety of tasks that rely on pairwise distance computations, including record linkage and clustering.
ML ID: 193
Probabilistic Semi-Supervised Clustering with Constraints
[Details] [PDF]
Sugato Basu, Mikhail Bilenko, Arindam Banerjee and Raymond J. Mooney
In O. Chapelle and B. Sch{"{o}}lkopf and A. Zien, editors, Semi-Supervised Learning, Cambridge, MA, 2006. MIT Press.
In certain clustering tasks it is possible to obtain limited supervision in the form of pairwise constraints, i.e., pairs of instances labeled as belonging to same or different clusters. The resulting problem is known as semi-supervised clustering, an instance of semi-supervised learning stemming from a traditional unsupervised learning setting. Several algorithms exist for enhancing clustering quality by using supervision in the form of constraints. These algorithms typically utilize the pairwise constraints to either modify the clustering objective function or to learn the clustering distortion measure. This chapter describes an approach that employs Hidden Markov Random Fields (HMRFs) as a probabilistic generative model for semi-supervised clustering, thereby providing a principled framework for incorporating constraint-based supervision into prototype-based clustering. The HMRF-based model allows the use of a broad range of clustering distortion measures, including Bregman divergences (e.g., squared Euclidean distance, KL divergence) and directional distance measures (e.g., cosine distance), making it applicable to a number of domains. The model leads to the HMRF-KMeans algorithm which minimizes an objective function derived from the joint probability of the model, and allows unification of constraint-based and distance-based semi-supervised clustering methods. Additionally, a two-phase active learning algorithm for selecting informative pairwise constraints in a query-driven framework is derived from the HMRF model, facilitating improved clustering performance with relatively small amounts of supervision from the user.
ML ID: 176
Semi-supervised Clustering: Probabilistic Models, Algorithms and Experiments
[Details] [PDF]
Sugato Basu
PhD Thesis, University of Texas at Austin, 2005.
Clustering is one of the most common data mining tasks, used frequently for data categorization and analysis in both industry and academia. The focus of our research is on semi-supervised clustering, where we study how prior knowledge, gathered either from automated information sources or human supervision, can be incorporated into clustering algorithms. In this thesis, we present probabilistic models for semi-supervised clustering, develop algorithms based on these models and empirically validate their performances by extensive experiments on data sets from different domains, e.g., text analysis, hand-written character recognition, and bioinformatics.

In many domains where clustering is applied, some prior knowledge is available either in the form of labeled data (specifying the category to which an instance belongs) or pairwise constraints on some of the instances (specifying whether two instances should be in same or different clusters). In this thesis, we first analyze effective methods of incorporating labeled supervision into prototype-based clustering algorithms, and propose two variants of the well-known KMeans algorithm that can improve their performance with limited labeled data.

We then focus on the problem of semi-supervised clustering with constraints and show how this problem can be studied in the framework of a well-defined probabilistic generative model of a Hidden Markov Random Field. We derive an efficient KMeans-type iterative algorithm, HMRF-KMeans, for optimizing a semi-supervised clustering objective function defined on the HMRF model. We also give convergence guarantees of our algorithm for a large class of clustering distortion measures (e.g., squared Euclidean distance, KL divergence, and cosine distance).

Finally, we develop an active learning algorithm for acquiring maximally informative pairwise constraints in an interactive query-driven framework, which to our knowledge is the first active learning algorithm for semi-supervised clustering with constraints.

Other interesting problems of semi-supervised clustering that we discuss in this thesis include (1) semi-supervised graph-based clustering using kernels, (2) using prior knowledge to improve overlapping clustering of data, (3) integration of both constraint-based and distance-based semi-supervised clustering methods using the HMRF model, and (4) model selection techniques that use the available supervision to automatically select the right number of clusters.
ML ID: 174
Semi-supervised Graph Clustering: A Kernel Approach
[Details] [PDF]
B. Kulis, S. Basu, I. Dhillon and Raymond J. Mooney
In Proceedings of the 22nd International Conference on Machine Learning, 457--464, Bonn, Germany, August 2005. (Distinguished Student Paper Award).
Semi-supervised clustering algorithms aim to improve clustering results using limited supervision. The supervision is generally given as pairwise constraints; such constraints are natural for graphs, yet most of the semi-supervised clustering algorithms are designed for data represented as vectors. In this paper, we unify vector-based and graph-based approaches. We show that a recently-proposed objective function for semi-supervised clustering based on Hidden Markov Random Fields, with squared Euclidean distance and a certain class of constraint penalty functions, can be expressed as a special case of the weighted kernel k-means objective. A recent theoretical connection between kernel k-means and several graph clustering objectives enables us to perform semi-supervised clustering of data given either as vectors or as a graph. For vector data, the kernel approach also enables us to find clusters with non-linear boundaries in the input data space. Furthermore, we show that recent work on spectral learning may be viewed as a special case of our formulation. We empirically show that our algorithm is able to outperform current state-of-the-art semi-supervised algorithms on both vector-based and graph-based data sets.
ML ID: 167
A Probabilistic Framework for Semi-Supervised Clustering
[Details] [PDF]
Sugato Basu, Mikhail Bilenko, and Raymond J. Mooney
In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2004), 59-68, Seattle, WA, August 2004.
Unsupervised clustering can be significantly improved using supervision in the form of pairwise constraints, i.e., pairs of instances labeled as belonging to same or different clusters. In recent years, a number of algorithms have been proposed for enhancing clustering quality by employing such supervision. Such methods use the constraints to either modify the objective function, or to learn the distance measure. We propose a probabilistic model for semi-supervised clustering based on Hidden Markov Random Fields (HMRFs) that provides a principled framework for incorporating supervision into prototype-based clustering. The model generalizes a previous approach that combines constraints and Euclidean distance learning, and allows the use of a broad range of clustering distortion measures, including Bregman divergences (e.g., Euclidean distance and I-divergence) and directional similarity measures (e.g., cosine similarity). We present an algorithm that performs partitional semi-supervised clustering of data by minimizing an objective function derived from the posterior energy of the HMRF model. Experimental results on several text data sets demonstrate the advantages of the proposed framework.
ML ID: 154
Semi-supervised Clustering with Limited Background Knowledge
[Details] [PDF]
Sugato Basu
In Proceedings of the Ninth AAAI/SIGART Doctoral Consortium, 979--980, San Jose, CA, July 2004.

ML ID: 149
Learnable Similarity Functions and Their Applications to Clustering and Record Linkage
[Details] [PDF]
Mikhail Bilenko
In Proceedings of the Ninth AAAI/SIGART Doctoral Consortium, 981--982, San Jose, CA, July 2004.

ML ID: 148
Integrating Constraints and Metric Learning in Semi-Supervised Clustering
[Details] [PDF]
Mikhail Bilenko, Sugato Basu, and Raymond J. Mooney
In Proceedings of 21st International Conference on Machine Learning (ICML-2004), 81-88, Banff, Canada, July 2004.
Semi-supervised clustering employs a small amount of labeled data to aid unsupervised learning. Previous work in the area has utilized supervised data in one of two approaches: 1) constraint-based methods that guide the clustering algorithm towards a better grouping of the data, and 2) distance-function learning methods that adapt the underlying similarity metric used by the clustering algorithm. This paper provides new methods for the two approaches as well as presents a new semi-supervised clustering algorithm that integrates both of these techniques in a uniform, principled framework. Experimental results demonstrate that the unified approach produces better clusters than both individual approaches as well as previously proposed semi-supervised clustering algorithms.
ML ID: 147
A Comparison of Inference Techniques for Semi-supervised Clustering with Hidden Markov Random Fields
[Details] [PDF]
Mikhail Bilenko and Sugato Basu
In Proceedings of the ICML-2004 Workshop on Statistical Relational Learning and its Connections to Other Fields (SRL-2004), Banff, Canada, July 2004.
Recently, a number of methods have been proposed for semi-supervised clustering that employ supervision in the form of pairwise constraints. We describe a probabilistic model for semi-supervised clustering based on Hidden Markov Random Fields (HMRFs) that incorporates relational supervision. The model leads to an EM-style clustering algorithm, the E-step of which requires collective assignment of instances to cluster centroids under the constraints. We evaluate three known techniques for such collective assignment: belief propagation, linear programming relaxation, and iterated conditional modes (ICM). The first two methods attempt to globally approximate the optimal assignment, while ICM is a greedy method. Experimental results indicate that global methods outperform the greedy approach when relational supervision is limited, while their benefits diminish as more pairwise constraints are provided.
ML ID: 144
Active Semi-Supervision for Pairwise Constrained Clustering
[Details] [PDF]
Sugato Basu, Arindam Banerjee, and Raymond J. Mooney
In Proceedings of the 2004 SIAM International Conference on Data Mining (SDM-04), April 2004.
Semi-supervised clustering uses a small amount of supervised data to aid unsupervised learning. One typical approach specifies a limited number of must-link and cannot-link constraints between pairs of examples. This paper presents a pairwise constrained clustering framework and a new method for actively selecting informative pairwise constraints to get improved clustering performance. The clustering and active learning methods are both easily scalable to large datasets, and can handle very high dimensional data. Experimental and theoretical results confirm that this active querying of pairwise constraints significantly improves the accuracy of clustering when given a relatively small amount of supervision.
ML ID: 141
Semisupervised Clustering for Intelligent User Management
[Details] [PDF]
Sugato Basu, Mikhail Bilenko, and Raymond J. Mooney
In Proceedings of the IBM Austin Center for Advanced Studies 5th Annual Austin CAS Conference, Austin, TX, February 2004.
Grouping users automatically based on their system usage can be beneficial in an autonomic computing environment. Clustering algorithms can generate meaningful user groups that provide important insights to system administrators about user profiles and group policies. In particular, if a small amount of supervision is provided by the administrator to the clustering process, semi-supervised clustering algorithms can use this supervision to generate clusters which are more useful for user management. In this work, we demonstrate the utility of semi-supervised clustering in intelligent user management. We collect publicly available system usage data of users in a university computing environment, and cluster the users using semi-supervised hierarchical agglomerative clustering based on the profile of the processes they run. Initial supervision is provided in the form of a few users running a specific process. Semi-supervised clustering gives us more meaningful clusters than unsupervised clustering in this domain, demonstrating that our technique can find interesting and useful groups in data with minimal user intervention.
ML ID: 135
Semi-supervised Clustering: Learning with Limited User Feedback
[Details] [PDF]
Sugato Basu
Technical Report, Cornell University, 2004.
In many machine learning domains (e.g. text processing, bioinformatics), there is a large supply of unlabeled data but limited labeled data, which can be expensive to generate. Consequently, semi-supervised learning, learning from a combination of both labeled and unlabeled data, has become a topic of significant recent interest. In the proposed thesis, our research focus is on semi-supervised clustering, which uses a small amount of supervised data in the form of class labels or pairwise constraints on some examples to aid unsupervised clustering. Semi-supervised clustering can be either search-based, i.e., changes are made to the clustering objective to satisfy user-specified labels/constraints, or similarity-based, i.e., the clustering similarity metric is trained to satisfy the given labels/constraints. Our main goal in the proposed thesis is to study search-based semi-supervised clustering algorithms and apply them to different domains.
In our initial work, we have shown how supervision can be provided to clustering in the form of labeled data points or pairwise constraints. We have also developed an active learning framework for selecting informative constraints in the pairwise constrained semi-supervised clustering model, and proposed a method for unifying search-based and similarity-based techniques in semi-supervised clustering.
In this thesis, we want to study other aspects of semi-supervised clustering. Some of the issues we want to investigate include: (1) effect of noisy, probabilistic or incomplete supervision in clustering; (2) model selection techniques for automatic selection of number of clusters in semi-supervised clustering; (3) ensemble semi-supervised clustering. In our work so far, we have mainly focussed on generative clustering models, e.g. KMeans and EM, and ran experiments on clustering low-dimensional UCI datasets or high-dimensional text datasets. In future, we want to study the effect of semi-supervision on other clustering algorithms, especially in the discriminative clustering and online clustering framework. We also want to study the effectiveness of our semi-supervised clustering algorithms on other domains, e.g., web search engines (clustering of search results), astronomy (clustering of Mars spectral images) and bioinformatics (clustering of gene microarray data).
ML ID: 134
Learnable Similarity Functions and Their Applications to Record Linkage and Clustering
[Details] [PDF]
Mikhail Bilenko
2003. Doctoral Dissertation Proposal, University of Texas at Austin.
Many machine learning tasks require similarity functions that estimate likeness between observations. Similarity computations are particularly important for clustering and record linkage algorithms that depend on accurate estimates of the distance between datapoints. However, standard measures such as string edit distance and Euclidean distance often fail to capture an appropriate notion of similarity for a particular domain or dataset. This problem can be alleviated by employing learnable similarity functions that adapt using training data. In this proposal, we introduce two adaptive string similarity measures: (1) Learnable Edit Distance with Affine Gaps, and (2) Learnable Vector-Space Similarity Based on Pairwise Classification. These similarity functions can be trained using a corpus of labeled pairs of equivalent and non-equivalent strings. We illustrate the accuracy improvements obtained with these measures using MARLIN, our system for record linkage in databases that learns to combine adaptive and static string similarity functions in a two-level learning framework.
Obtaining useful training examples for learnable similarity functions can be problematic due to scarcity of informative similar and dissimilar object pairs. We propose two strategies, Static-Active Selection and Weakly-Labeled Negatives, that facilitate efficient training data collection for record linkage. These strategies significantly outperform random selection on real datasets without the computational cost of traditional active learning methods. Additionally, we describe a method for combining seeding with Euclidean distance learning for semi-supervised k-means clustering. Experimental evaluation demonstrates that our method outperforms unsupervised clustering and semi-supervised clustering that employs seeding or metric learning separately.
In future research, we intend to pursue several directions in developing accurate learnable similarity functions and applying them to record linkage and clustering problems. This work will involve improving the proposed string similarity functions as well as introducing several novel approaches to adaptive string distance computation. We also plan to extend our initial work on learnable similarity functions for clustering, particularly for high-dimensional data. Finally, we will investigate the utility of various active learning strategies for learning similarity functions, as well as extend the preliminary work on static-active selection of training pairs.
ML ID: 133
Comparing and Unifying Search-Based and Similarity-Based Approaches to Semi-Supervised Clustering
[Details] [PDF]
Sugato Basu, Mikhail Bilenko, and Raymond J. Mooney
In Proceedings of the ICML-2003 Workshop on the Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining, 42-49, Washington, DC, 2003.
Semi-supervised clustering employs a small amount of labeled data to aid unsupervised learning. Previous work in the area has employed one of two approaches: 1) Search-based methods that utilize supervised data to guide the search for the best clustering, and 2) Similarity-based methods that use supervised data to adapt the underlying similarity metric used by the clustering algorithm. This paper presents a unified approach based on the K-Means clustering algorithm that incorporates both of these techniques. Experimental results demonstrate that the combined approach generally produces better clusters than either of the individual approaches.
ML ID: 125
Semi-supervised Clustering by Seeding
[Details] [PDF]
Sugato Basu, Arindam Banerjee, and Raymond J. Mooney
In Proceedings of 19th International Conference on Machine Learning (ICML-2002), 19-26, 2002.
Semi-supervised clustering uses a small amount of labeled data to aid and bias the clustering of unlabeled data. This paper explores the use of labeled data to generate initial seed clusters, as well as the use of constraints generated from labeled data to guide the clustering process. It introduces two semi-supervised variants of KMeans clustering that can be viewed as instances of the EM algorithm, where labeled data provides prior information about the conditional distributions of hidden category labels. Experimental results demonstrate the advantages of these methods over standard random seeding and COP-KMeans, a previously developed semi-supervised clustering algorithm.
ML ID: 113