Class Projects

Here are a few ideas for class projects. Send me email if you are interested in one of the projects.

1. Analyzing Large Collections of Text Documents.

A large collection of text documents can be condensed to a matrix A , where each column represents a document and each row stands for a word. A_{ij} is nonzero if the word i occurs in document j. The total number of rows equals all the distinct words in all the documents (non-content bearing words such as "and", "the", "a" etc. are deleted), while the total number of columns in A equals the total number of documents. The matrix A will be sparse when the number of documents is large, since each document will typically contain only a small subset of the total number of words.

By using techniques in linear algebra, it is possible to "intelligently" process a large collection of documents. Two applications are

  1. Automatically create a conceptual hierarchy of the text documents, e.g., if there are documents about sports, politics and science, automatic techniques are able to reasonably and automatically recognize such partitions. Such clustering algorithms exploit statistical co-occurrences of words, and work surprisingly well. The ultimate goal would be to create a hierarchy like Yahoo!'s in an automated manner.
  2. Query retrieval that is better than keyword search. By exploiting statistical co-occurrences of words, linear algebra techniques can help in alleviating the problems of synonymy (many words for one concept) and polysemy (one word used in different contexts).
Some of the numerical techniques are the based on the SVD ( Latent Semantic Indexing ), and there are some newer and faster ones (ask me!).

Goal of Project: Understand the basic ideas, and implement a few in software. For the software implementations, you would need some experience in using tools such as Lex (for parsing) and implementing hash tables .

Additional Resources

  • Bibliography.
  • The SMART data sets.
  • The TREC home page.
  • Common resources .
  • 2. Computing the SVD by Sub-sampling.

    An important problem in many applications is to compute the leading singular values and singular vectors (SVD) of a matrix A . However, this computation can be too expensive when A is large. A way to approximate the SVD of A is to compute the SVD of an appropriately chosen submatrix of A . Recently, this idea has been applied to processing large collections of text documents (see Project 1 above). See some papers ( [1] & [2] ) by Ravi Kannan and colleagues.

    Goal of Project: Understand the sub-sampling strategy outlined in the above papers. Implement the strategy and apply it to the matrices that arise from handling large text collections.

    3. Spectral Graph Partitioning.

    A graph G = (V,E) is a set of vertices and edges. A classical hard (NP-complete) problem is to partition V into two parts such that the number of edges that cross the two partitions is minimum among all possible 2^|V| partitions.

    Many heuristics exist for approximating the optimal solution. One successful global heuristic partitions the graph based on eigenvectors of the associated adjacency or Laplacian matrix associated with the graph. Most large graphs that arise in applications lead to sparse matrices, so Lanczos based algorithms for computing eigenvectors are commonly used. There is a lot of existing literature on using eigenvectors for spectral graph partitioning. An overview of graph parititioning, and various algorithms is available in lectures notes ( [1] & [2] ) by Jim Demmel .

    Goal of Project: Use existing Lanczos based software for computing eigenvectors of both adjacency and Laplacian matrices. Compare with other graph partitioning algorithms. Some experience with Lanczos algorithms is essential.

    Additional Resources

  • METIS consists of multilevel graph paritioning algorithms. Papers and software is available.
  • Common resources .
  • 4. Visualizing High-Dimensional Data.

    Clustering and classification are two important problems in machine learning. Clustering is a method in unsupervised learning and is the grouping of similar objects into clusters. Classification is a technique in supervised learning, where given class labels for some data points, the problem is to deduce class labels for newly arriving data points. For example, given class labels for handwritten digit samples (0,1,2,...,9), recognize the digit given a digit sample.

    Many clustering and classification algorithms abound. On a particular data set, some algorithms perform better than others. This performance crucially depends on the data distributions, which can vastly vary depending on the application. Thus, to evaluate clusterings or to design a good classifier, a good understanding of the data is imperative.

    Visualization is an effective tool for this task. A popular scheme for understanding high dimensional data is to take projections onto 2-dimensional planes. The trick is to choose these two dimensional projections to alleviate the inevitable loss of information in projecting from high (10-1000) dimensions to just two dimensions. One scheme for cluster or class visualization is given here .

    Goal of Project: Understand existing projection schemes, such as, Kohonen's self-organizing maps, projection pursuit, techniques in support vector machines (SVMs), multidimensional scaling(MDS). Implement some of these schemes. Experience in graphics and visualization would be handy, but is not essential.

    Additional Resources

  • Bibliography.
  • 5. Visualizing Large, Sparse Matrices.

    Most matrices that arise in real-life applications are large and sparse. Efficient algorithms to process these matrices attempt to utilize their structure (position and pattern) of nonzero elements. Thus, it is helpful to ba able to visually see this structure. MATLAB has some basic commands, such as "spy" etc. Recently, new software called MatView has been developed (by Jim Kohl at Sandia National Labs) for this purpose, and is available here .

    Goal of Project: Experiment with various MATLAB and MatView techniques for visualizing large, sparse matrices. Use these tools to discover the structure of matrices arising from graph partioning (see Project 3 above), handling of large text collections (see Project 1 above), etc.

    Additional Resources

  • Matrix Market contains many sparse matrices from various applications.
  • Common resources .
  • Common Resources

  • Here is a very simple of representing sparse matrices. Please use this format for this course.
  • More general sparse matrix formats are given here.