The `MC' Toolkit

MC: A Toolkit for Creating Vector Models from Text Documents

** UPDATE: Thanks to Razvan Surdulescu for sending a patch to make MC compilable with the latest, MC (v 2.29) is now compatible with gcc 3.3.4. **

** UPDATE: Here's a patch from Stefanie Tellex that she had to apply to get MC to build. Some users may find it useful. **

MC is a C++ program that creates vector-space models from text documents that can be used for text mining application. MC provides an efficent multi-threaded implementation that can process very large document collection. For example, MC took 1,189 seconds and 17.5 MB of main memory to process a sample collection of about 120,000 documents (the experiment ran on a Sun Ultra10 workstation). More details on MC and its use in a faster clustering algorithm are available in this paper.

About the program

The MC program:

The application does not:

It is developed on a SunOS system. It is known to compile on Linux platform. Most UNIX systems should be compatible with MC.

The code is released under the GNU Public License (GPL).

Citation

You are welcome to use the code under the terms of the licence for research or commercial purposes, however please acknowledge its use with a citation:

   Dhillon, I. S. and Fan, J. and Guan, Y. 
   "Efficient Clustering of Very Large Document Collections" 
   Data Mining for Scientific and Engineering Applications, Kluwer Academic Publishers, 2001

Here is a BiBTeX entry:

@INCOLLECTION{dhillon:fan:guan00,
      AUTHOR = {Dhillon, I. S. and Fan, J. and Guan, Y.},
      TITLE = {Efficient Clustering of Very Large Document Collections},
      BOOKTITLE = {Data Mining for Scientific and Engineering Applications},
      PUBLISHER = {Kluwer Academic Publishers},
      EDITOR = {R. Grossman, C. Kamath, V. Kumar and R. Namburu},
      YEAR = {2001},
      PAGES = {},
      NOTE = {Invited Book Chapter}
}

Obtaining the Source

The latest source code for the program can be downloaded from here.

The binary version of the latest code for Linux platform using single thread model can be downloaded from here.

Unfortunately I do not have time to help users with all their compilation and usage problems. Feel free to send me mail asking for help, but please do not necessarily expect me to have time to help. Most appreciated are bug reports accompanied by fixes.

Usage

See README.
Last updated: 9 November 2004, jfan@cs.utexas.edu