CS 395T Largescale Data Mining
Homework 4
Clustering
The main goal of this homework is to experiment with
some clustering techniques.

Read this
paper to understand the spherical kmeans algorithm.

Download the spkmeans
code and compile it.

Run spkmeans on the classic3 matrix you generated with MC
(l=0.2, u=15, t=tfn). Use k=3.

Download this Java cluster
browser program that generates a sequence of web pages illustrating
your clustering results. (see a sample
browser )

Download Metis
and install it.

Here is
a C program that transforms the CCS matrix to the input for Metis.

Run Metis on the same matrix used for spkmeans (note that
this corresponds to a bipartite graph between words and documents).

Run Metis on the graph of documents, where an edge between two documents
has weight equal to their cosine similarity.

Write a hierarchical agglomerative clustering program in C/C++/Java and
then run it on the same matrix.

Download the spmeans
code and compile it. Read its documentation.

Run spmeans
on the classic3 matrix.

Choose 100 documents from each category (cisi, med, cran) of classic3,
and
generate a matrix with MC for these 300 documents.

Run the 5 clustering techniques on the matrix of 300 documents.

Run the 5 clustering techniques on cmu.news 20_cleaned.

Try clustering your email or any other text collection.
Answer the following questions:
1. Report your clustering results using the 5 techniques
on classic3, the 300 document set and cmu.news 20_cleaned.
For each clustering, submit the confusion matrix and objective function
value (if available).
2. What is the number of clusters output by spmeans
for each of the data sets? Is it 3 for the classic3 data?
3. How did the clustering programs perform on your
email (or other text collection)?
4. Are your clustering results good? If not, explain
why.
5. In your opinion which of the clustering techniques
is the best? Why?
Due date: Oct. 30, 2001