CS 395T- Large-scale Data Mining

Homework 4

Clustering

The main goal of this homework is to experiment with some clustering techniques.

Read this paper to understand the spherical k-means algorithm.
Download the spkmeans code and compile it.
Run spkmeans on the classic3 matrix you generated with MC (l=0.2, u=15, t=tfn). Use k=3.
Download this Java cluster browser program that generates a sequence of web pages illustrating your clustering results. (see a sample browser )
Download Metis and install it.
Here is a C program that transforms the CCS matrix to the input for Metis.
Run Metis on the same matrix used for spkmeans (note that this corresponds to a bipartite graph between words and documents).
Run Metis on the graph of documents, where an edge between two documents has weight equal to their cosine similarity.
Write a hierarchical agglomerative clustering program in C/C++/Java and then run it on the same matrix.
Download the spmeans code and compile it. Read its documentation.
Run spmeans on the classic3 matrix.
Choose 100 documents from each category (cisi, med, cran) of classic3, and generate a matrix with MC for these 300 documents.
Run the 5 clustering techniques on the matrix of 300 documents.
Run the 5 clustering techniques on cmu.news 20_cleaned.
Try clustering your email or any other text collection.

Answer the following questions:

    1. Report your clustering results using the 5 techniques on classic3, the 300 document set and cmu.news 20_cleaned. For each clustering, submit the confusion matrix and objective function value (if available).
    2. What is the number of clusters output by spmeans for each of the data sets? Is it 3 for the classic3 data?
    3. How did the clustering programs perform on your email (or other text collection)?
    4. Are your clustering results good? If not, explain why.
    5. In your opinion which of the clustering techniques is the best? Why?

Due date: Oct. 30, 2001