CS378 Homework #1


Due: In class, Thursday September 28
Show work for all problems. The coding problems may be implemented in the language of your choice. Print out all code written and attach to your homework solutions. Code should be clearly written and well-commented to receive full credit.

All questions are from the class textbook ("An Introduction to Data Mining" by Tan, Steinbach, Kumar).

Questions

  1. Let x = [.3, .4, .1] and let y = [.5, .3, .1]. Compute the dot product of these vectors. Also compute the L1, L2, and L3 norms of each of these vectors.
  2. Question #2, p. 89
  3. Question #13, p. 91
  4. Implement the above K nearest neighbor algorithm using both Euclidean Distance and also cosine similarity. Remember that cosine similarity is a similarity measure, so your K nearest neighbor implementation will need to return the K instances with largest cosine similarity to the candidate instance. Test your implementation on a set of documents, each of which is a scientific abstract (download here, extract using the command tar -zxf classic3.tar.gz). The file "classic3_mtx" is a data matrix. Each row of this matrix corresponds to a document. Row i corresponds to file i in the 'classic3' directory. Each column of this matrix corresponds to a particular word in the set of documents. The set of words in the data matrix can be found in the file "classic3_words". Note that rare words (those that appear fewer than 3 times in the entire corpus) have been ommitted.
    Answer/complete the following:
  5. Question #16, p. 92
  6. Question #18, p. 92
  7. Question #20: parts a, b, c, p. 93