CS 395T LargeScale Data Mining
Homework 3
Query retrieval with vectorspace model
The main goal of this homework is to use the vectorspace
model for query retrieval and evaluate its effectiveness.

Familiarize yourself with the med, cisi, cran & TREC
data sets in /stage/projects4/cs395t_lsdm/Data. Note that each data set
comes with queries, relevance judgements and a sample matrix in CCS
format (see /stage/projects4/cs395t_lsdm/Data/Data.moreinfo/)

Suppose A is a matrix in CCS format, and x
is a dense vector. Write two C/C++/Java subroutines A_times_x and
Atranspose_times_x
that multiply A with x, and the transpose of
A
with
x respectively.

Using the above subroutines, perform query retrieval for queries GroupID*3+1,
GroupID*3+2, GroupID*3+3 for each of the data sets: med, cisi,
cran & TREC.
Note: Query vectors in sparse format are stored in /stage/projects4/cs395t_lsdm/Data/Data.moreinfo/med,
cisi, cran,TREC_SGML/query_vector/.

Try different scaling schemes for both the documents and the queries (i.e.
txn,
tfn, lfn, etc.)

Read the paper appendixa.ps.gz at http://trec.nist.gov/pubs/trec3/appendices/A/
to see how to generate recallprecision graphs.
Answer the following questions:
1. Submit the code for your subroutines.
2. What is the time complexity of your subroutines?
(Hint: they should take time preportional to the number of nonzeros in
A)
Give exact operation counts.
3. What is the Rprecision for each of your query
retrieval results?
4. Plot the average precisionrecall curves for
the queries assigned to you (scaling scheme tfn.tfn)
5. Are you satisfied with the output of your query
retrieval programs?
6. What scaling scheme worked best in your results?
Due date: Oct. 11, 2001