The `MC' Toolkit

MC: A Toolkit for Creating Vector Models from Text Documents

MC is a C++ program that creates vector-space models from text documents that can be used for text mining applications. MC provides an efficient multi-threaded implementation that can process very large document collections. For example, MC took 1,189 seconds using only 17.5 MBytes of main memory to process a sample collection of about 114,000 documents (the experiment was run on a Sun Ultra10 workstation). More details on MC and its use in a fast clustering algorithm are available in this paper.

About the program

The MC program:

Recursively descends directories, finding text files.
Processes files selectively through full regular expression matching of file names.
Builds a sparse matrix of word/token counts. The particular sparse matrix format used is given here.
Processes any user specified text formats(email address or URLs) as a single token through regular expression matching or Flex definition.
Prunes vocabulary by word length and frequency.
Excludes user specified stop words.
Sets word vector weights according to any of the txx, txn, tfn, tfx, lxx, lxn, lfn, lfx scaling schemes.
Writes all data structures to disk in the Compressed Column Storage format.

The application does not:

Have English parsing or part-of-speech tagging facilities.
Have complete documentation.
Claim to be bug-free.

MC was developed on the Sun Solaris operating system. It is known to compile on Linux platforms. Most UNIX systems should be compatible with MC.

The code is released under the GNU Public License (GPL).

Citation

You are welcome to use the code under the terms of the license for research or commercial purposes, however please acknowledge its use with a citation:

   Dhillon, I. S. and Modha, D. M., "Concept Decompositions for Large
   Sparse Text Data using Clustering", Machine Learning,
   42:1, pages 143-175, Jan, 2001.

   Dhillon, I. S. and Fan, J. and Guan, Y., "Efficient Clustering of
   Very Large Document Collections", invited book chapter in Data Mining
   for Scientific and Engineering Applications, Kluwer Academic Publishers, 2001.

Here are the BiBTeX entries:

@ARTICLE{dhillon:modha:mlj01,
      AUTHOR = {Dhillon, I. S. and Modha, D. S.},
      TITLE = { Concept decompositions for large sparse text data using clustering},
      JOURNAL = {Machine Learning},
      YEAR = {2001},
      MONTH = {Jan},
      VOLUME = {42},
      NUMBER = {1},
      PAGES = {143--175} }

@INCOLLECTION{dhillon:fan:guan00,
      AUTHOR = {Dhillon, I. S. and Fan, J. and Guan, Y.},
      TITLE = {Efficient Clustering of Very Large Document Collections},
      BOOKTITLE = {Data Mining for Scientific and Engineering Applications},
      PUBLISHER = {Kluwer Academic Publishers},
      EDITOR = {R. Grossman, C. Kamath, V. Kumar and R. Namburu},
      YEAR = {2001},
      PAGES = {},
      NOTE = {Invited book chapter}
}

Obtaining the Source

The latest source code for the program can be downloaded from here.

Unfortunately we do not have time to help users with all their compilation and usage problems. Feel free to send email asking for help or to give us feedback. But please do not necessarily expect us to have time to help. Most appreciated are bug reports accompanied by fixes.

Usage

See README.

Credits

This code grew out of a class project for the course Large-Scale Data Mining(Spring 2000).
The main developer of the code is James Fan.

Last updated: 13 March 2001.