The Data Mining Lab at UT Austin is focused on the analysis of very large data sets, especially those that arise in the application areas of text mining and bioinformatics. The emphasis is on finding sound, theoretically-motivated algorithms for the central tasks in data mining, such as high-dimensional clustering, classification algorithms and data visualization.
The current focus of the group is on uncovering the latent low-dimensional structure that is often inherent in high-dimensional data. In many important applications, such as text mining and face recognition, the data matrices that arise are sparse and non-negative. Thus it is natural to seek low-dimensional approximations that preserve these properties -- sparsity in approximations implies economy in representation while non-negativity enhances interpretation (note that traditional methods such as SVD and PCA do not preserve these properties).
With the above goals in mind, the lab has recently been exploring the application of information theory to data mining tasks. Information Theory provides a natural way of dealing with non-negative data vectors by treating them as probability vectors. Problems such as clustering can then be posed as optimization problems in information theory, such as maximizing mutual information. As an application to text mining, such an approach has been shown to reveal the semantic similarity of words thus leading to substantial reduction in classifier complexity and increased accuracy in document classification when training data is sparse. Further directions currently being explored include: (a) information-theoretic clustering and approximation of higher order non-negative tensors (that often arise in applications as multidimensional contingency tables), and (b) new algorithms for low-rank non-negative matrix factorization.
The Data Mining Lab has disseminated publications, software and results for document clustering, clustering of gene expression data in bioinformatics and multidimensional data visualization.