Project Suggestions and References

1 The Netflix Prize---Regression

Netflix has released movie review data with the goal of improving movie recommendations. They have offered a $1 million prize to the first team to sufficiently improve the recommendations. The problem is a large-scale regression problem: training data, consisting of user ID, date, movie ID, and rating, is given. The test data has user ID, date, and movie ID (but no rating). The goal is to predict the ratings on the test data such that the RMSE (root mean squared error) improves 10 percent over Netflix's current system. This project would explore running large-scale regression algorithms on this data, with the goal of achieving as low an RMSE as possible. Note that in order to do regression, one might need to create appropriate features from the given data. One concrete goal of the project would be to test out the methods in the following papers to verify the RMSE claims on th Netflix data.

The Netflix prize website.
R. Bell and Y. Koren. Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights.ICDM, 2007.
R. Bell, Y. Koren, and C. Volinsky. The BellKor solution to the Netflix Prize. 2007.
R. Bell, Y. Koren, and C. Volinsky. Modeling relationships at multiple scales to improve accuracy of large recommender systems.KDD, 2007.
R. Bell, and Y. Koren. Improved Neighborhood-based Collaborative Filtering. KDD Cup and Workshop, 2007.

2 The Netflix Prize---Co-clustering and Local Regression

Co-clustering simultaneously clusters the rows and columns of a matrix such that the row clusters are dependent on the column clusters, and vice-versa. For example, given a word-document matrix, a co-clustering finds clusters of words and documents. Recently, co-clustering has been applied to collaborative filtering (missing-value estimation) for recommender systems. This project would explore using co-clustering for missing value estimation---for example, the Netflix data could be used for this task. One possibility is to co-cluster the data to discover similar movies or users, and to exploit that information when doing predictions. For example, by treating the data as a bipartite graph, you can co-cluster the graph to identify groups of similar movies and similar users, and then apply local regression over this set of users, and finally predict ratings of users in that cluster using this local regression. Refer to project 3 for further details on graph clustering.

The Netflix prize website.
T. George and S. Merugu. A Scalable Collaborative Filtering Framework based on Co-clustering. Proceedings of the 5th IEEE Conference on Data Mining (ICDM), Nov 2005.
H. Cho, I. S. Dhillon, Y. Guan, and S. Sra. Minimum Sum-Squared Residue Co-Clustering of Gene Expression Data. Proceedings of the Fourth SIAM International Conference on Data Mining, pages 114-125, April 2004.

3. Applications of graph clustering

Large graphs are increasingly common---social networks, web graphs, biological networks, pixel graphs from images, and others. One of the key tasks is to partition these networks into clusters of nodes. Applications include image segmentation, social network community analysis, and gene clustering. Popular methods for graph clustering involve minimizing various cut criteria, and there have been various methods. This project would explore finding interesting applications of large-scale graph clustering. In particular, power-law graphs have been of particular recent interest, and methods for clustering these graphs have been developed. Power-law distributions arise in many situations: web graphs, social networks, word frequencies in documents, and others. Graclus is a program that minimizes various graph cut objectives over large graphs. One option is to compare Graclus with existing work on clustering power-law graphs. Another possibiity is to explore a multilevel graph clustering algorithm to solve semi-supervised clustering problem.

I. Dhillon, Y. Guan, B. Kulis. Weighted Graph Cuts without Eigenvectors: A Multilevel Approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007. Code.
K. Lang. Fixing Two Weaknesses of the Spectral Method. Advances in Neural Information Processing Systems. vol. 18, 2005. SDPLR Software.
The Graph Partitioning Archive.
A. Abou-Rjeili, and G. Karypis. Multilevel Algorithms for Partitioning Power-Law Graphs. IPDPS, 2006.
B. Kulis, S. Basu, I. Dhillon, R. Mooney. Semi-supervised Graph Clustering: A Kernel Approach. Proc. 22nd International Conference on Machine Learning, 2005.

4. Dimensionality Reduction of Massive Graphs using Graph Clustering

Graph clustering using normalized cuts is strongly related to eigenvectors computation. As show by Dhillon et al., graph clustering can be performed without eigenvalue decomposition using multilevel clustering approach. This project aims to go one step further by exploring approaches to compute eigenvectors itself using multilevel clustering based method proposed by Dhillon et al.

I. Dhillon, Y. Guan, B. Kulis. Weighted Graph Cuts without Eigenvectors: A Multilevel Approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007. Code.
G. Gloub, and C. F. Van Loan. Matrix Computations. John Hopkins University Press, 1989.

5. Matrix Approximation Comparisons

This project will explore the quality of various matrix approximations as a function of the number of parameters of the approximation. For instance, the SVD can be used to create a rank-r approximation for an n x m input matrix. The number of parameters of this approximation is (m + n)r. Similarly, non-negative matrix approximation and co-clustering can be used to find low-parameter estimations of an input matrix. Thus, one can analyze the approximation error (using various measures) as a function of the number of parameters of the approximation, and compare various matrix approximation techniques over many types of matrices.

I. S. Dhillon, S. Mallela, and D. S. Modha. Information-Theoretic Co-clustering. Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD), 2003. Code.
A. Banerjee, I. S. Dhillon, J. Ghosh, S. Merugu, and D. S. Modha. A Generalized Maximum Entropy Approach to Bregman Co-Clustering and Matrix Approximations. Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD), 2004.
D. Kim, S. Sra, and I. S. Dhillon. Fast Newton-type Methods for the Least Squares Nonnegative Matrix Approximation Problem. SDM, 2007.
R. Bro and S. Jong. A fast non-negativity-constrained least squares algorithm. Journal of Chemometrics, 1999. Code.
Matrix Market.

6. Sparse PCA: Trace norm and LogDet Heuristic

Sparse PCA attempts to find principal components of data that are sparse. The resulting optimization problem is intractable, but relaxations involving L-1 norm penalties on the principal components empirically lead to sparse solutions (analogous to the behavior of lasso). A recent formulation poses the problem as a semi-definite programming problem via a convex relaxation. Similar relaxations can be obtained using trace-norm and log-det heuristics. This project would explore these different approaches, comparing them in terms of quality and speed, and finding applications of sparse PCA.

H. Zou, T. Hastie, and R. Tibshirani. Sparse Principal Component Analysis., JCGS, 2004.
A. d'Aspremont, L. El Ghaoui, M. I. Jordan, and G. R. G. Lanckriet. A direct formulation for sparse PCA using semidefinite programming. SIAM Review, 2006. Code.
M. Fazel, H. Hindi, and S. Boyd. Log-det Heuristic for Matrix Rank Minimization with Applications to Hankel and Euclidean Distance Matrices. Proc. American Control Conference, 2003.
M. Fazel, H. Hindi, and S. Boyd. A Rank Minimization Heuristic with Application to Minimum Order System Approximation. Proc. American Control Conference, Arlington, Virginia, June 2001.

7. Manifold Learning

Manifold learning involves finding a low-dimensional manifold from which the given dataset is sampled. Recently various approaches have been proposed for manifold learning, e.g. ISOMAP, LLE, Semi-definite Embedding, Laplacian Eigenmaps, Hessian Eigenmaps. Compare these methods both theoretically and empirically. For empirical evaluation try to experiment with your own real-world datasets, preferably from your research work.

J. Tenenbaum, V. De Silva, and J. C. Langford. A global geometric framework for non-linear dimensionality reduction. Science, 2000.
S. T. Roweis, and L. K. Saul. Nonlinear dimensionality reduction by Locally Linear Embedding. Science, 2000.
K. Q. Weinberger, F. Sha, and L. K. Saul. Learning a kernel matrix for nonlinear dimensionality reduction. ICML, 2004.
M. Belkin, and P. Niyogi. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Computation, 2003.
D. Donoho, and C. Grimes. Hessian Eigenmaps: New Locally Linear Embedding Techniques For High-Dimensional Data. Proc. of National Academy of Sciences, 2003.

8. Boosting

Given a classifier, boosting can be use to find a better classifier with guaranteed generalization error. There exist various boosting methods like Ada-boost, BrownBoost, LPBoost, LogitBoost etc. Compare these methods both theoeretically and empirically.

Y. Freund, and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 1997.
Yoav Freund. An adaptive version of the boost by majority algorithm. Machine Learning, 43(3):293--318, June 2001.
A. Demiriz, K.P. Bennett, and J. Shawe-Taylor. Linear Programming Boosting via Column Generation. Machine Learning, 2002.
J. Friedman, T. Hastie, and R. Tibshirani.Additive Logistic Regression: a Statistical View of Boosting. The Annals of Statistics, 2000.

9. Mining Multi-Relational Data
Consider a set of nodes V={v_1, v_2, ..., v_n}. Now, we can define various graphs over this set of nodes by defining a distinct set of edges for each graph. This project would explore approaches to model information provided by each of the graph and combine them to infer some useful properties about the set of nodes V. This project has many real-life applications. For instance, let nodes V be a set of authors. Now, authors can be associated to each other by the area in which they publish, their university or their department. Thus, for same set of nodes multiple graphs and associations are defined. Now challenge is to infer useful information about authors using the available graphs. For example, cluster authors on the basis of their area, while taking into account the university and department information.