A large collection of text documents can be condensed to a matrix *
A *, where each column represents a document and each row stands for
a word. * A_{ij} * is nonzero if the word * i * occurs in
document * j*. The total number of rows equals all the distinct
words in * all * the documents (non-content bearing words
such as "and", "the", "a" etc. are deleted), while the total number
of columns in * A * equals the total number of documents.
The matrix * A * will be sparse when the number
of documents is large, since each document will typically contain only
a small subset of the total number of words.

This project will involve the use of the SVD (Singular Value Decomposition) and another recent decomposition, Concept Decomposition, in order to do query retrieval that is better than keyword search. By exploiting statistical co-occurrences of words, the above mathematical techniques can help in alleviating the problems of synonymy (many words for one concept) and polysemy (one word used in different contexts).

** Resources**

Vector-Space Models for text are frequently used, for example, see description of Project P1. Very large collections of text documents are becoming increasingly common. Distributed or parallel computing is essential to handle these large data sets.

This project would involve parallelizing the creation of a vector-space model. The input would be a large set of text documents (more than can be handled on 1 processor), and the output needs to be the term-document matrix (see project P1). Data structures, such as distributed hash tables, will need to be carefully implemented for this purpose.

Given a large unstructured collection of text documents, an important problem is to cluster the documents into disjoint conceptual categories. An example of such categorization is the Yahoo! hierarchy of documents. Yahoo!'s hierarchy is created manually. This project will explore automatic methods for clustering.

** Resources**

Take newsgroup text data, and do content (LSI, clustering) and link analysis (hubs and authorities, PageRank) in order to summarize the contents of the newsgroup, and identify the important postings and posters to the newsgroup. See Project P1 and P3.

Automatically sort email into folders (by rules specified by the owner, by content, etc.). Use the metadata (email address, organization) etc to enhance the classifier. Create a link graph (who sends email to whom) to identify clusters or communities. Identify important emailers (hubs & authorities, pagerank).

Analyze traffic on the CS (www.cs.utexas.edu) domain. Come up with a page ranking scheme for each page on the CS domain based on the web traffic (it should be possible to get data logs for traffic on the CS machines).

** Resources**

Although you can search for information on the World-Wide Web using search engines such as AltaVista, Google, HotBot etc., often you may desire to create a website that contains exhaustive information about one particular topic. For example, Disney might want to create a website that contains educational and entertainment links for kids. Or as a researcher in data mining, I might be interested in maintaining a repository of documents containing papers and web pages just about data mining. For these purposes, only a small fraction of the web needs to be considered. Focused crawling is about data structures (databses?), efficient web crawling, classification algorithms in order to enable efficient sifting of the entire internet. Once you have the subset of the web you are interested in, you may want to organize it into categories such as Yahoo!, enable special-purpose querying, automatically scan the relevant websites to detect new material, etc.

** Resources**

A graph G = (V,E) is a set of vertices and edges. A classical hard (NP-complete) problem is to partition V into two parts such that the number of edges that cross the two partitions is minimum among all possible 2^|V| partitions.

Many heuristics exist for approximating the optimal solution.
One successful global heuristic partitions the graph based on
eigenvectors of the associated * adjacency * or * Laplacian *
matrix associated with the graph. Most large graphs that arise in
applications lead to sparse matrices, so Lanczos based algorithms for
computing eigenvectors are commonly used. There is a lot of existing
literature on using eigenvectors for spectral graph partitioning.
An overview of graph partitioning, and various algorithms is
available in lectures notes
( [1]
& [2] )
by Jim Demmel .

** Goal of Project:** Use existing Lanczos based software for
computing eigenvectors of both adjacency and Laplacian matrices. Compare with
other graph partitioning algorithms. Some experience with Lanczos
algorithms is essential.

** Additional Resources**

Clustering and classification are two important problems in
machine learning. Clustering is a method in * unsupervised
learning * and is the grouping of similar objects into clusters.
Classification is a technique in * supervised learning*,
where given class labels for some data points, the problem is to
deduce class labels for newly arriving data points. For example, given
class labels for handwritten digit samples (0,1,2,...,9), recognize the
digit given a digit sample.

Many clustering and classification algorithms abound. On a particular data set, some algorithms perform better than others. This performance crucially depends on the data distributions, which can vastly vary depending on the application. Thus, to evaluate clusterings or to design a good classifier, a good understanding of the data is imperative.

Visualization is an effective tool for this task. A popular scheme for understanding high dimensional data is to take projections onto 2-dimensional planes. The trick is to choose these two dimensional projections to alleviate the inevitable loss of information in projecting from high (10-1000) dimensions to just two dimensions. One scheme for cluster or class visualization is given here .

** Additional Resources**

This project would produce a visualization tool in JAVA that would enable visual browsing and navigation of a text data set. Ideas from project P11 will be used to visualize text as high-dimensional data.

This project would use the SVD (and other techniques?) for the problem of face recognition, i.e., given a database of face images and a query face, what face is most similar to the query face?

** Resources**