Co-clustering Software (Version 1.0)


The program

Co-cluster (Version 1.0) is a C++ program written by Hyuk Cho, Yuqiang Guan and Suvrit Sra, which implements three co-clustering algorithms: information-theoretic co-clustering algorithm and two types of minimum sum-squared residue co-clustering algorithms (see the papers for details). In our implementation, all the algorithms have the ping-pong structure, i.e., a batch algorithm followed by corresponding chain of first variations. Each algorithm also has five variations, based on in what order to update the row or column centroids.
    1. The input matrix to be co-clustered can be either sparse matrix in CCS format  or a dense matrix. In case of sparse matrix, the sample input files will look like: example1_col_ccs, example1_dim, example1_row_ccs, example1_txx_nz. In case of dense matrix, the sample files will look like: example1_dim, example1. '-F' option controls the format of input matrix.
    2.  Initial seeding file may be in two different format. The simple format has only two lines: the first line contains cluster ID for each row of the input matrix and the second line contains cluster ID for each column. A more complicated format describes each co-cluster by giving the number of row and column in that co-cluster, and the IDs of the rows and columns in the co-cluster. The clustering output also has these two formats. '-i' option controls initial seeding.
    3. Sometimes true label file for column or row or both exsits. '-T' controls that.
    1. The output file formats are the same as the initial seeding files for sometimes we may want to initialize the clustering with the output of previous run. '-O' option controls this.
    1. '-a' chooses co-clustering algorithms: info. theo. co-clustering (default) or min. sum-squared residue co-clusterings
    2. '-c' gives number of column clusters and '-r' gives number of row clusters
    3. '-F' specifies input matrix format.
    4. '-t' scaling method, for CCS only
    5. '-T' specifies true label files
    6. '-O' spceifies output file names
    7. '-l' gives the first variation chain length (default 0)
    8. '-R' gives the number of random runs; this will produces the average objective funtion value and the variance
    9. '-p' gives different prior options
    10. '-e' gives threshold for batch loop and first variation
    11. '-d' gives different level of dump information
    12.  '-I' allows rows of input matrix are negated
    13.  '-V' gives variation of selected algorithm ('-a' option)



You are welcome to use the code under the terms of the GNU Public License (GPL), however please acknowledge its use with a citation: