CS378 Homework #1
Due: In class, Thursday September 28
Show work for all problems. The coding problems may be implemented in
the language of your choice. Print out all code written and
attach to your homework solutions. Code should be clearly
written and well-commented to receive full credit.
All questions are from the class textbook ("An Introduction to Data
Mining" by Tan, Steinbach, Kumar).
Questions
- Let x = [.3, .4, .1] and let y = [.5, .3, .1]. Compute the
dot product of these vectors. Also compute the
L1, L2, and L3 norms of each of these vectors.
- Question #2, p. 89
- Question #13, p. 91
-
Implement the above K nearest neighbor
algorithm using both Euclidean Distance and also cosine
similarity. Remember that cosine similarity is a similarity
measure, so your K nearest neighbor implementation will need to return the K
instances with largest cosine similarity to the
candidate instance.
Test your implementation on a set of documents, each of which
is a scientific abstract
(download here, extract using
the command tar -zxf classic3.tar.gz). The file
"classic3_mtx" is a data matrix. Each row of this matrix
corresponds to a document. Row i corresponds to
file i in the 'classic3' directory. Each column of this
matrix corresponds to a particular word in the set of
documents. The set of words in the
data matrix can
be found in the file "classic3_words". Note that rare words
(those that appear fewer than 3 times in the entire corpus)
have been ommitted.
Answer/complete the following:
- How has the data been normalized? Tf-idf?
- Compute the term frequency vector for each of the
following phrases: "fatty acids", "lexical indicators for
information retrieval", "supersonic aircraft",
"library". These vectors should be expressed in sparse
format, in terms of non-zero entries. If the vector is
[.2, 0, 0, 0, .3, ,8], then the sparse format should be
(1, .2), (4, .3), (5, .8)
- Compute the K=3 nearest neighbors for each of these
words. Do this for both cosine similarity and Euclidean
distance. Provide both the document row number, as well
as the distance (similarity) values to the given vectors.
- Do your results look reasonable? Why or why not?
- Compare the results between the two distance
measures.
- Provide a mathematical justification for
your answer to the previous question. More specifically,
is there a relationship between cosine similarity and
euclidean distance? Hint: the normalization of the data is
important.
- Analyze the complexity of your implementation in terms
of the number of floating point operations needed. This
running time should be expressed as a function of the
number of documents in the data matrix and also possibly in terms
of k (this will depend on your implementation). It should
be accurate up to the largest term. For example, if the
exact number of floating point operations is 4n^2 + 2n +
3, you would only need to specify 4n^2.
- Question #16, p. 92
- Question #18, p. 92
- Question #20: parts a, b, c, p. 93