The `MC' Toolkit

MC: A Toolkit for Creating Vector Models from Text Documents

MC is a C++ program that creates vector-space models from text documents that can be used for text mining applications. MC provides an efficient multi-threaded implementation that can process very large document collections. For example, MC took 1,189 seconds using only 17.5 MBytes of main memory to process a sample collection of about 114,000 documents (the experiment was run on a Sun Ultra10 workstation). More details on MC and its use in a fast clustering algorithm are available in this paper.

About the program

The MC program:

The application does not:

MC was developed on the Sun Solaris operating system. It is known to compile on Linux platforms. Most UNIX systems should be compatible with MC.

The code is released under the GNU Public License (GPL).

Citation

You are welcome to use the code under the terms of the license for research or commercial purposes, however please acknowledge its use with a citation:

   Dhillon, I. S. and Modha, D. M., "Concept Decompositions for Large
   Sparse Text Data using Clustering", Machine Learning,
   42:1, pages 143-175, Jan, 2001.

   Dhillon, I. S. and Fan, J. and Guan, Y., "Efficient Clustering of
   Very Large Document Collections", invited book chapter in Data Mining
   for Scientific and Engineering Applications, Kluwer Academic Publishers, 2001.

Here are the BiBTeX entries:

@ARTICLE{dhillon:modha:mlj01,
      AUTHOR = {Dhillon, I. S. and Modha, D. S.},
      TITLE = { Concept decompositions for large sparse text data using clustering},
      JOURNAL = {Machine Learning},
      YEAR = {2001},
      MONTH = {Jan},
      VOLUME = {42},
      NUMBER = {1},
      PAGES = {143--175} }

@INCOLLECTION{dhillon:fan:guan00,
      AUTHOR = {Dhillon, I. S. and Fan, J. and Guan, Y.},
      TITLE = {Efficient Clustering of Very Large Document Collections},
      BOOKTITLE = {Data Mining for Scientific and Engineering Applications},
      PUBLISHER = {Kluwer Academic Publishers},
      EDITOR = {R. Grossman, C. Kamath, V. Kumar and R. Namburu},
      YEAR = {2001},
      PAGES = {},
      NOTE = {Invited book chapter}
}

Obtaining the Source

The latest source code for the program can be downloaded from here.

Unfortunately we do not have time to help users with all their compilation and usage problems. Feel free to send email asking for help or to give us feedback. But please do not necessarily expect us to have time to help. Most appreciated are bug reports accompanied by fixes.

Usage

See README.

Credits


Last updated: 13 March 2001.