RISC: Repository of Information on Semi-supervised Clustering

In many learning tasks, there is a large supply of unlabeled data but insufficient labeled data since it can be expensive to generate. Semi-supervised learning combines labeled and unlabeled data during training to improve performance. Semi-supervised learning is applicable to both classification and clustering.

In supervised classification, there is a known, fixed set of categories and category-labeled training data is used to induce a classification function. In semi-supervised classification, training also exploits additional unlabeled data, frequently resulting in a more accurate classification function.

In unsupervised clustering, an unlabeled dataset is partitioned into groups of similar examples, typically by optimizing an objective function that characterizes good partitions. In semi-supervised clustering, some labeled data is used along with the unlabeled data to obtain a better clustering. These pages will attempt to provide links to datasets used by, software developed by, and papers published by the UT Machine Learning group, related to the problem of semi-supervised clustering.

The construction of this repository is an on-going process. If you are aware of an entry that it that should contain, please send email to Sugato Basu.

Thank you and please come again!

Suggestions, comments, and questions to: Sugato Basu (sugato@cs.utexas.edu)

Acknowledgment: This document was created based on the excellent home page of Misha Bilenko's RIDDLE repository.