* fixed bug on -un option
- How to compile?
* To compile single processor version of MC, just type "make"
* To compile single processor multiple thread version, add "-DMULTI_THREAD" to CFLAGS in Makefile, and remake everything
- How to use?
a> Single processor version: type "./mctester [options] [directory name]"
b> Options
You specify different options on the command line for single processor version, and you can specify the same options in the "mpimcconfig" file with 1 switch per line and its argument on the following line. The last line of the configuration file contains the directory name where the data to be processed is stored.
Here's a list of switches:
-l n --lower bound of percentages of files a word must appear in order to appear in the final matrix. Default to 0.0
-u n --upper bound of percentages of files a word must appear in order to appear in the final matrix. Default to 100.0
-ln n --lower bound of number of files a word must appear in order to appear in the final matrix. Default to 0.0
-un n --upper bound of number of files a word must appear in order to appear in the final matrix. Default to total number of input files
-s filename -- specify a stop word file
-t scalingtype -- specify the type of scaling needed. Scaling type is one of the following pattern:
tfn, where t -- the term frequency option, can be either t (term frequency) or l (log term frequency)
f -- the global frequency scaling factor, can be x (for no scaling), f (inverse global frequency scaling), e (for entropy), 1 (for 1-norm)
n -- the normalization scaling factor, can be x (for no normalization), n (normalization, or 1 (1 norm)
Sample scaling patterns could be: txx txn tfn tfx lxx lxn lfn lfx tx1 tf1 lx1 lf1 t1x t1n t11 l1x l1n l11 txxi txni tfni tfxi lxxi lxni lfni lfxi tx1i tf1i lx1i lf1i t1xi t1ni t11i l1xi l1ni l11i, where the i at the end of the pattern indicates that word normalization and document normalization should be performed independently.
-h --print out hash stat for evaluating hash function. DON'T use this option on large data set! It may cause slow performance, and more resource.
-v --do a verification after matrix is generated.
-V -- do a verification only. Assume a matrix has been generated, and resides in the same directory. All other options are ignored.
-m int -- specify the min word length requirement, any word whose length is less than the min length will be discarded
-M int -- specify the max word length requirement, any word whose length is greater than the max length will be discarded
-f regexp -- specifies the file name mask
-k filename -- specifies the keyword file
-i regexp -- include regexp as tokens
-x regexp -- exclude regexp as tokens
-p threadnum -- the number of processing threads
-w word parser type -- three possible choices SGML, HTML, and NORMAL. NORMAL is the default parser.
-d -- debug/verbose mode output
-o -- position information included in the output
d> Sample usages:
To process all the files in the ./mydataset directory and its subdirectories excluding any words listed in the "english.stop" stop word file, any words that occur in less than 2.5% of the files and any words that occur in more than 99.5% of the files, and use lxx scaling option, type the following command:
%> ./mctester -l 2.5 -u 99.5 -s english.stop -v -t lxx ./mydataset
The above command produces 7 output files:
mydataset_docs -- lists all the files that have been processed and their file id
mydataset_words -- lists all the words that are in those files, not in the stop word list and within the 2.5% and 99.5% boundary specified by the user.
mydataset_dim -- lists the dim of the output vector space model
mydataset_col_ccs -- all the columns in the CCS format
mydataset_row_ccs -- all the rows in the CCS format
mydataset_txx_nz -- all the non-zero elements of the output vector space model with no scaling(txx) in CCS format
mydataset_lxx_nz -- all the non-zero elements of the output vector space model with scaling(lxx) in CCS format. This file is only available when lxx scaling option is specified.