- Read this
paper to understand the
*spherical k-means*algorithm. - Read this paper to understand the first variation technique.
- Read this paper to understand k-means using Kullback-Leibler divergences with prior and first variation.

- Download the
*MC*program and compile it.- Run MC on the Classic3 data set
(/stage/projects1/inderjit/cs395t_lsdm/Data/classic3/)
with options: l=0.2, u=15, t=tfn (make sure you understand these options).
MC will generate a sparse matrix in CCS
format, which is stored in several files. All the files should share the
same prefix, classic3, in their
names.

- Choose 100 documents from each category (
*cisi, med, cran*) of*classic3,*and generate a matrix with*MC*for these 300 documents, with options: l=0.2, u=15, t=tfn. - Run MC on the
*cmu.news 20_cleaned*data set (/stage/projects1/inderjit/cs395t_lsdm/Data/cmu.news20_cleaned/) with the same options.

- Run MC on the Classic3 data set
(/stage/projects1/inderjit/cs395t_lsdm/Data/classic3/)
with options: l=0.2, u=15, t=tfn (make sure you understand these options).
MC will generate a sparse matrix in CCS
format, which is stored in several files. All the files should share the
same prefix, classic3, in their
names.
- Download the gmeans
code and compile it. Read the README
to understand all the options.

- Run gmeans with option '-a s' (option for spherical k-means algorithm)
on the
*Classic3*and the 300-document matrices, with and without first varition '-f'. A true label file for Classic3 is /stage/projects1/inderjit/cs395t_lsdm/Data/Data.moreinfo/classic3/classic3_truelabels.3; A true label file for the 300-document data set is /stage/projects1/inderjit/cs395t_lsdm/Data/info.classic300/classic300_truelabel

- Run gmeans with option '-a k' (option for information-theoretic
clustering algorithm) on the
*Classic3*and the 300-document matrices, with and without the prior option '-L c', and with and without first varition. - Repeat the above experiments with the
*cmu.news 20_cleaned*matrix. A true label file is /stage/projects1/inderjit/cs395t_lsdm/Data/info.cmu20/cmu.news20_cleaned_truelabels.20

- Run gmeans with option '-a s' (option for spherical k-means algorithm)
on the
- Download this Java
*cluster browser*program to generate web pages illustrating your clustering results. (see a sample browser on clusters of 114,000 NSF award abstracts and a sample browser on clusters of UTCS professors' web pages ) - Code up the non-negative matrix factorization algorithm in Lee &
Seung's paper :"Algorithms for
non-negative matrix factorization". Use the factor matrices to cluster (as
decribed in class and this
paper).

__Answer the following questions:__

1. Report your clustering results on *classic3,* the
300 document set and *cmu.news 20_cleaned*. For each clustering, submit the
confusion matrix (or a summary for the *cmu.news 20_cleaned *data set)*
*and objective function value.

2. Test the clustering programs on your email (or other text collection). How did the
clustering programs perform?

3. Are your clustering results good? If not, explain why.

4. In your opinion which of the clustering techniques is
the best? Why?

*Due date: Nov. 17, 2003*