Overview
The Data Mining Lab(DML) is led by
Prof. Inderjit Dhillon.
It is closely affiliated with the
Machine Learning Research Group (MLRG)
(led by Prof. Mooney) and the
Intelligent Data Exploration and Analysis Laboratory (IDEAL)
(led by Prof. Ghosh of
ECE).
For applications in bioinformatics, the group closely collaborates with
Prof. Marcotte
who is a faculty member in the
Chemistry/Biochemistry department and the
Center for Computational Biology and Bioinformatics (CCBB).
The Data Mining Lab at UT Austin is focused on the
analysis of very large data sets, especially those that arise in
the application areas of text mining and bioinformatics.
The emphasis is on finding sound, theoretically-motivated
algorithms for the central tasks in data mining, such as high-dimensional
clustering, classification algorithms and data visualization.
The current focus of the group is on uncovering the latent
low-dimensional structure that is often inherent in high-dimensional
data. In many important applications, such as text mining
and face recognition, the data matrices that arise are
sparse and non-negative. Thus it is natural to seek
low-dimensional approximations that preserve these properties -- sparsity
in approximations implies economy in representation while
non-negativity enhances interpretation (note that traditional
methods such as SVD and PCA do not preserve these properties).
With the above goals in mind, the lab has recently been exploring
the application of information theory to data mining tasks.
Information Theory provides a natural way of dealing with
non-negative data vectors by treating them as probability vectors.
Problems such as clustering can then be posed as optimization
problems in information theory, such as maximizing mutual
information. As an application to text mining, such an approach has
been shown to reveal the semantic similarity of words thus
leading to substantial reduction in classifier complexity and
increased accuracy in document classification when training data is sparse.
Further directions currently being explored include:
(a) information-theoretic clustering and approximation of higher
order non-negative tensors (that often arise in applications
as multidimensional contingency tables), and
(b) new algorithms for low-rank non-negative matrix factorization.
The Data Mining Lab has disseminated
publications,
software and results
for document clustering, clustering of gene expression data in bioinformatics
and multidimensional data visualization.