Description:

Netflix has released movie review data with the goal of improving movie recommendations. They have offered a $1 million prize to the first team to sufficiently improve the recommendations. The problem is a large-scale regression problem: training data, consisting of user ID, date, movie ID, and rating, is given. The test data has user ID, date, and movie ID (but no rating). The goal is to predict the ratings on the test data such that the RMSE (root mean squared error) improves 10 percent over Netflix's current system. This project would explore running large-scale regression algorithms on this data, with the goal of achieving as low an RMSE as possible. We have also generated subsets of the Netflix data for smaller-scale experiments; this file has approximately 2 million reviews. Note that in order to do regression, one needs to create appropriate features from the given data.

Description:

Another possibility is to cluster or co-cluster the data to discover similar movies or users, and to exploit that information when doing predictions. For example, by treating the data as a bipartite graph, you can co-cluster the graph to identify groups of similar movies and similar users, and then apply local regression over this set of users, and finally predict ratings of users in that cluster using this local regression. Refer to project 3 for further details on graph clustering.

References:

The Netflix prize website.

Netflix subset data.

Description:

Non-negative matrix factorization attempts to approximately factorize a matrix into the product of two non-negative matrices, and it has found applications in several areas. Several variants are possible, depending on the chosen loss function. Recently, a connection has been made between non-negative matrix factorization with KL-divergence loss, and probabilistic latent semantic indexing. This project would explore this connection in detail. How do the algorithms for pLSI and non-negative matrix approximation compare, and what does this equivalence imply about the algorithms? One can use the journal/CS dept. data (see below) for this project.

References:

D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788-791, 1999. KL NNMA Code.

T. Hofmann. Probabilistic Latent Semantic Indexing. Proceedings of the 22nd International Conference on Research and Development in Information Retrieval (SIGIR'99), 1999. pLSI Code.

E. Gaussier and C. Goutte. Relation between PLSA and NMF and Implications. Proceedings of the 28th annual international ACM SIGIR conference, 2005.

Description:

Large graphs are increasingly common---social networks, web graphs, biological networks, pixel graphs from images, and others. One of the key tasks is to partition these networks into clusters of nodes. Applications include image segmentation, social network community analysis, and gene clustering. Popular methods for graph clustering involve minimizing various cut criteria, and there have been various methods. This project would explore finding interesting applications of large-scale graph clustering. In particular, power-law graphs have been of particular recent interest, and methods for clustering these graphs have been developed. Power-law distributions arise in many situations: web graphs, social networks, word frequencies in documents, and others. Graclus is a program that minimizes various graph cut objectives over large graphs. One option is to compare Graclus with existing work on clustering power-law graphs.

References:

I. Dhillon, Y. Guan, B. Kulis. Weighted Graph Cuts without Eigenvectors: A Multilevel Approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007. Code.

K. Lang. Fixing Two Weaknesses of the Spectral Method. Advances in Neural Information Processing Systems. vol. 18, 2005. SDPLR Software.

The Graph Partitioning Archive.

Description:

We have data of the following form: the rows represent computer science departments, the columns represent major conferences and journals throughout computer science, and an entry in the resulting matrix corresponds to the number of papers written by members of the corresponding department for the corresponding conference or journal. There are several interesting problems that arise from this data. How can this data be used to rank departments or conferences? Ideally, top-rated departments should heavily publish in top-rated journals or conferences, so the ratings are dependent on each other; one would need to define an appropriate ranking algorithm that exploits this observation. Other problems such as clustering may be explored with this data as well---for example, clustering and information-theoretic clustering. The data is amenable to be interpreted as a joint probability distribution, so information-theoretic approaches would be interesting to pursue. One can also treat this data as a tensor, since publication date information is available; in this case, the third dimension would be year of publication. This is an exploratory project and students are encouraged to find novel ways to slice, dice, and analyze the data.

References:

Journal data.

J. Kleinberg. Authoritative sources in a hyperlinked environment. Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998.

I. S. Dhillon, S. Mallela, and D. S. Modha. Information-Theoretic Co-clustering. Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD), 2003. Code.

Description:

This project will explore the quality of various matrix approximations as a function of the number of parameters of the approximation. For instance, the SVD can be used to create a rank-r approximation for an n x m input matrix. The number of parameters of this approximation is (m + n)r. Similarly, non-negative matrix approximation and co-clustering can be used to find low-parameter estimations of an input matrix. Thus, one can analyze the approximation error (using various measures) as a function of the number of parameters of the approximation, and compare various matrix approximation techniques over many types of matrices.

References:

I. S. Dhillon, S. Mallela, and D. S. Modha. Information-Theoretic Co-clustering. Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD), 2003. Code.

A. Banerjee, I. S. Dhillon, J. Ghosh, S. Merugu, and D. S. Modha. A Generalized Maximum Entropy Approach to Bregman Co-Clustering and Matrix Approximations. Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD), 2004.

Matrix Market.

Description:

Sparse PCA attempts to find principal components of data that are sparse. The resulting optimization problem is intractable, but relaxations involving l-1 norm penalties on the principal components empirically leads to sparse solutions (analogous to the behavior of lasso). A recent formulation poses the problem as a semi-definite programming problem via a convex relaxation. This project would explore these different approaches, comparing them in terms of quality and speed, and finding applications of sparse PCA. Both the journal/CS dept. data and the Netflix data may be used in this project.

References:

H. Zou, T. Hastie, and R. Tibshirani. Sparse Principal Component Analysis., JCGS, 2004.

A. d'Aspremont, L. El Ghaoui, M. I. Jordan, and G. R. G. Lanckriet. A direct formulation for sparse PCA using semidefinite programming. SIAM Review, 2006. Code.

Description:

Co-clustering simultaneously clusters the rows and columns of a matrix such that the row clusters are dependent on the column clusters, and vice-versa. For example, given a word-document matrix, a co-clustering finds clusters of words and documents. Recently, co-clustering has been applied to collaborative filtering (missing-value estimation) for recommender systems. This project would explore using co-clustering for missing value estimation---for example, the Netflix data could be used for this task.

References:

T. George and S. Merugu. A Scalable Collaborative Filtering Framework based on Co-clustering. Proceedings of the 5th IEEE Conference on Data Mining (ICDM), Nov 2005.

H. Cho, I. S. Dhillon, Y. Guan, and S. Sra. Minimum Sum-Squared Residue Co-Clustering of Gene Expression Data. Proceedings of the Fourth SIAM International Conference on Data Mining, pages 114-125, April 2004.

Description:

Maximum-margin factorization is a recent technique for collaborative filtering (missing-value estimation). This project would explore using maximum-margin factorization for this task. The Netflix data would be a possible data set for testing the algorithm---one may need to consider small subsets of this data. What other applications are possible for max-margin matrix factorization?

References:

J. Rennie and N. Srebro. Fast Maximum Margin Matrix Factorization for Collaborative Prediction. ICML, 2005.

N. Srebro, J. Rennie and T. Jaakkola. Maximum Margin Matrix Factorization. Advances in Neural Information Processing Systems (NIPS) 17, 2005. Code.