CS 395T- Large-Scale Data Mining
Query retrieval with vector-space model
The main goal of this homework is to use the vector-space
model for query retrieval and evaluate its effectiveness.
Answer the following questions:
Familiarize yourself with the med, cisi, cran & TREC
data sets in /stage/projects4/cs395t_lsdm/Data. Note that each data set
comes with queries, relevance judgements and a sample matrix in CCS
format (see /stage/projects4/cs395t_lsdm/Data/Data.moreinfo/)
Suppose A is a matrix in CCS format, and x
is a dense vector. Write two C/C++/Java subroutines A_times_x and
that multiply A with x, and the transpose of
Using the above subroutines, perform query retrieval for queries GroupID*3+1,
GroupID*3+2, GroupID*3+3 for each of the data sets: med, cisi,
cran & TREC.
Note: Query vectors in sparse format are stored in /stage/projects4/cs395t_lsdm/Data/Data.moreinfo/med,
Try different scaling schemes for both the documents and the queries (i.e.
tfn, lfn, etc.)
Read the paper appendixa.ps.gz at http://trec.nist.gov/pubs/trec3/appendices/A/
to see how to generate recall-precision graphs.
1. Submit the code for your subroutines.
2. What is the time complexity of your subroutines?
(Hint: they should take time preportional to the number of nonzeros in
Give exact operation counts.
3. What is the R-precision for each of your query
4. Plot the average precision-recall curves for
the queries assigned to you (scaling scheme tfn.tfn)
5. Are you satisfied with the output of your query
6. What scaling scheme worked best in your results?
Due date: Oct. 11, 2001