CS 395T- Large-scale Data Mining

Homework 3

Clustering

    The main goal of this homework is to experiment with some clustering techniques.     Before you start the experiments,
  1. Read this paper to understand the spherical k-means algorithm.
  2. Read this paper to understand the first variation technique.
  3. Read this paper to understand k-means using Kullback-Leibler divergences with prior and first variation.
    After you've understood the papers,
  1. Download the MC program and compile it.
  2. Download the gmeans code and compile it. Read the README to understand all the options.
  3. Download this Java cluster browser program to generate web pages illustrating your clustering results. (see a sample browser on clusters of 114,000 NSF award abstracts and a sample browser on clusters of UTCS professors' web pages )
  4. Code up the non-negative matrix factorization algorithm in Lee & Seung's paper :"Algorithms for non-negative matrix factorization". Use the factor matrices to cluster (as decribed in class and this paper).

    Answer the following questions:

    1. Report your clustering results on classic3, the 300 document set and cmu.news 20_cleaned. For each clustering, submit the confusion matrix (or a summary for the cmu.news 20_cleaned data set) and objective function value.
    2. Test the clustering programs on your email (or other text collection). How did the clustering programs perform?
    3. Are your clustering results good? If not, explain why.
    4. In your opinion which of the clustering techniques is the best? Why?

Due date: Nov. 17, 2003