A large collection of text documents can be condensed to a matrix *
A *, where each column represents a document and each row stands for
a word. * A_{ij} * is nonzero if the word * i * occurs in
document * j*. The total number of rows equals all the distinct
words in * all * the documents (non-content bearing words
such as "and", "the", "a" etc. are deleted), while the total number
of columns in * A * equals the total number of documents.
The matrix * A * will be sparse when the number
of documents is large, since each document will typically contain only
a small subset of the total number of words.

By using techniques in linear algebra, it is possible to "intelligently" process a large collection of documents. Two applications are

- Automatically create a conceptual hierarchy of the text documents,
e.g., if there are documents about sports,
politics and science, automatic techniques are
able to reasonably and automatically recognize such partitions. Such
clustering algorithms exploit statistical co-occurrences of
words, and work surprisingly well. The ultimate goal would be to
create a hierarchy like
*Yahoo!'s*in an automated manner. - Query retrieval that is better than keyword search. By exploiting statistical co-occurrences of words, linear algebra techniques can help in alleviating the problems of synonymy (many words for one concept) and polysemy (one word used in different contexts).

** Additional Resources**

An important problem in many applications is to compute the leading
singular values and singular vectors (SVD) of a matrix * A *.
However, this computation can be too expensive when * A *
is large. A way to approximate the SVD of * A * is to compute
the SVD of an appropriately chosen submatrix of * A *. Recently,
this idea has been applied to processing large collections of text
documents (see Project 1 above). See some papers
( [1]
& [2] )
by Ravi
Kannan and colleagues.

A graph G = (V,E) is a set of vertices and edges. A classical hard (NP-complete) problem is to partition V into two parts such that the number of edges that cross the two partitions is minimum among all possible 2^|V| partitions.

Many heuristics exist for approximating the optimal solution.
One successful global heuristic partitions the graph based on
eigenvectors of the associated * adjacency * or * Laplacian *
matrix associated with the graph. Most large graphs that arise in
applications lead to sparse matrices, so Lanczos based algorithms for
computing eigenvectors are commonly used. There is a lot of existing
literature on using eigenvectors for spectral graph partitioning.
An overview of graph parititioning, and various algorithms is
available in lectures notes
( [1]
& [2] )
by Jim
Demmel .

** Goal of Project:** Use existing Lanczos based software for
computing eigenvectors of both adjacency and Laplacian matrices. Compare with
other graph partitioning algorithms. Some experience with Lanczos
algorithms is essential.

** Additional Resources**

Clustering and classification are two important problems in
machine learning. Clustering is a method in * unsupervised
learning * and is the grouping of similar objects into clusters.
Classification is a technique in * supervised learning*,
where given class labels for some data points, the problem is to
deduce class labels for newly arriving data points. For example, given
class labels for handwritten digit samples (0,1,2,...,9), recognize the
digit given a digit sample.

Many clustering and classification algorithms abound. On a particular data set, some algorithms perform better than others. This performance crucially depends on the data distributions, which can vastly vary depending on the application. Thus, to evaluate clusterings or to design a good classifier, a good understanding of the data is imperative.

Visualization is an effective tool for this task. A popular scheme for understanding high dimensional data is to take projections onto 2-dimensional planes. The trick is to choose these two dimensional projections to alleviate the inevitable loss of information in projecting from high (10-1000) dimensions to just two dimensions. One scheme for cluster or class visualization is given here .

** Additional Resources**

Most matrices that arise in real-life applications are large and sparse. Efficient algorithms to process these matrices attempt to utilize their structure (position and pattern) of nonzero elements. Thus, it is helpful to ba able to visually see this structure. MATLAB has some basic commands, such as "spy" etc. Recently, new software called MatView has been developed (by Jim Kohl at Sandia National Labs) for this purpose, and is available here .

** Additional Resources**