CS 395T- Large-Scale Data Mining

Homework 3

Query retrieval with vector-space model

The main goal of this homework is to use the vector-space model for query retrieval and evaluate its effectiveness.

Familiarize yourself with the med, cisi, cran & TREC data sets in /stage/projects4/cs395t_lsdm/Data. Note that each data set comes with queries, relevance judgements and a sample matrix in CCS format (see /stage/projects4/cs395t_lsdm/Data/Data.moreinfo/)
Suppose A is a matrix in CCS format, and x is a dense vector. Write two C/C++/Java subroutines A_times_x and Atranspose_times_x that multiply A with x, and the transpose of A with x respectively.
Using the above subroutines, perform query retrieval for queries GroupID*3+1, GroupID*3+2, GroupID*3+3 for each of the data sets: med, cisi, cran & TREC.

Note: Query vectors in sparse format are stored in

Try different scaling schemes for both the documents and the queries (i.e. txn, tfn, lfn, etc.)
Read the paper appendixa.ps.gz at http://trec.nist.gov/pubs/trec3/appendices/A/ to see how to generate recall-precision graphs.

Answer the following questions:

    1. Submit the code for your subroutines.
    2. What is the time complexity of your subroutines? (Hint: they should take time preportional to the number of nonzeros in A) Give exact operation counts.
    3. What is the R-precision for each of your query retrieval results?
    4. Plot the average precision-recall curves for the queries assigned to you (scaling scheme tfn.tfn)
    5. Are you satisfied with the output of your query retrieval programs?
    6. What scaling scheme worked best in your results?

Due date: Oct. 11, 2001