CS378 Homework #1

Due: In class, Thursday September 28
Show work for all problems. The coding problems may be implemented in the language of your choice. Print out all code written and attach to your homework solutions. Code should be clearly written and well-commented to receive full credit.

All questions are from the class textbook ("An Introduction to Data Mining" by Tan, Steinbach, Kumar).

Questions

Let x = [.3, .4, .1] and let y = [.5, .3, .1]. Compute the dot product of these vectors. Also compute the L1, L2, and L3 norms of each of these vectors.
Question #2, p. 89
Question #13, p. 91
Implement the above K nearest neighbor algorithm using both Euclidean Distance and also cosine similarity. Remember that cosine similarity is a similarity measure, so your K nearest neighbor implementation will need to return the K instances with largest cosine similarity to the candidate instance. Test your implementation on a set of documents, each of which is a scientific abstract (download here, extract using the command tar -zxf classic3.tar.gz). The file "classic3_mtx" is a data matrix. Each row of this matrix corresponds to a document. Row i corresponds to file i in the 'classic3' directory. Each column of this matrix corresponds to a particular word in the set of documents. The set of words in the data matrix can be found in the file "classic3_words". Note that rare words (those that appear fewer than 3 times in the entire corpus) have been ommitted.
Answer/complete the following:
- How has the data been normalized? Tf-idf?
- Compute the term frequency vector for each of the following phrases: "fatty acids", "lexical indicators for information retrieval", "supersonic aircraft", "library". These vectors should be expressed in sparse format, in terms of non-zero entries. If the vector is [.2, 0, 0, 0, .3, ,8], then the sparse format should be (1, .2), (4, .3), (5, .8)
- Compute the K=3 nearest neighbors for each of these words. Do this for both cosine similarity and Euclidean distance. Provide both the document row number, as well as the distance (similarity) values to the given vectors.
- Do your results look reasonable? Why or why not?
- Compare the results between the two distance measures.
- Provide a mathematical justification for your answer to the previous question. More specifically, is there a relationship between cosine similarity and euclidean distance? Hint: the normalization of the data is important.
- Analyze the complexity of your implementation in terms of the number of floating point operations needed. This running time should be expressed as a function of the number of documents in the data matrix and also possibly in terms of k (this will depend on your implementation). It should be accurate up to the largest term. For example, if the exact number of floating point operations is 4n^2 + 2n + 3, you would only need to specify 4n^2.
Question #16, p. 92
Question #18, p. 92
Question #20: parts a, b, c, p. 93