Ans 1) Zipf's law: p(r) = A/r = 0.1/r Cumulative Sum p(1) = 10% 10% p(2) = 5% 15% p(3) = 3.33% 18.33% p(4) = 2.5% 20.83% p(5) = 2% 22.83% p(6) = 1.67% 24.5% p(7) = 1.43% 25.93% p(8) = 1.25% 27.18% p(9) = 1.11% 28.29% p(10) = 1% 29.29% p(11) = 0.91% 30.20% Minimum m is 11. Ans 2) and dollars fifty five hundred six sixty thousand (a) 1/3 1/3 1/3 3/3 1/3 0 0 1/3 (b) 0 1/2 1/2 2/2 1/2 1/2 1/2 1/2 length(a) = sqrt(1/9 * (1^2 + 1^2 + 1^2 + 3^2 + 1^2 + 0^2 + 0^2 + 1^2)) = 1/3 * sqrt(14) = 1.25 length(b) = sqrt(1/4 * (0^2 + 1^2 + 1^2 + 2^2 + 1^2 + 1^2 + 1^2 + 1^2)) = 1/2 * sqrt(10) = 1.58 cos(a,b) = (1/3 * 0 + 1/3 * 1/2 + 1/3 * 1/2 + 3/3 * 2/2 + 1/3 * 1/2 + 0 * 1/2 + 0 * 1/2 + 1/3 * 1/2) / (1.25 * 1.58) = 1/6 * (0 + 1 + 1 + 6 + 1 + 0 + 0 + 1) / (1.25 * 1.58) = 0.85 Ans 3) Doc1: school business Doc2: school social work Doc3: school public affairs affairs -> (Doc3, 1) business -> (Doc1, 1) public -> (Doc3, 1) school -> (Doc1, 1) -> (Doc2, 1) -> (Doc3, 1) social -> (Doc2, 1) work -> (Doc2, 1) Ans 4) Web page visit order Status -------------------- ------ A (initial) Indexed B (from A) Indexed D (from A) Indexed E (from A) Indexed C (from B) Indexed E (from B) Already visited G (from D) Indexed F (from C) Indexed G (from C) Already visited E (from G) Already visited Indexing order in BFS is: A, B, D, E, C, G, F Ans 5) This tag specifies whether a robot may index this page or follow the links on it. noindex: indexing of this page is disallowed follow: following links on this page is allowed Ans 6) ir-course> java ir.webutils.HITS tgraph 3 D A C B E Iteration 1: A = [0.0, 0.0, 2.0, 2.0, 1.0] H = [4.0, 5.0, 0.0, 0.0, 0.0] Norm A = [0.0, 0.0, 0.6666666666666666, 0.6666666666666666, 0.3333333333333333] Norm H = [0.6246950475544243, 0.7808688094430304, 0.0, 0.0, 0.0] Iteration 2: A = [0.0, 0.0, 1.4055638569974547, 1.4055638569974547, 0.7808688094430304] H = [2.8111277139949093, 3.5919965234379396, 0.0, 0.0, 0.0] Norm A = [0.0, 0.0, 0.6581451817144176, 0.6581451817144176, 0.36563621206356534] Norm H = [0.6163082616581106, 0.7875050010075858, 0.0, 0.0, 0.0] Iteration 3: A = [0.0, 0.0, 1.4038132626656963, 1.4038132626656963, 0.7875050010075858] H = [2.8076265253313926, 3.5951315263389785, 0.0, 0.0, 0.0] Norm A = [0.0, 0.0, 0.6572842735788008, 0.6572842735788008, 0.36872044615396143] Norm H = [0.615498370759936, 0.7881381576804058, 0.0, 0.0, 0.0] Ans 7) ir-course> java ir.webutils.PageRank tgraph2 .15 3 A C B R = [0.3333333333333333, 0.3333333333333333, 0.3333333333333333] E = [0.049999999999999996, 0.049999999999999996, 0.049999999999999996] Iteration 1: R' = [0.049999999999999996, 0.475, 0.19166666666666665] Norm R = [0.06976744186046512, 0.6627906976744186, 0.26744186046511625] Iteration 2: R' = [0.049999999999999996, 0.3069767441860465, 0.07965116279069767] Norm R = [0.11451398135818909, 0.7030625832223701, 0.18242343541944075] Iteration 3: R' = [0.049999999999999996, 0.253728362183755, 0.09866844207723036] Norm R = [0.12425545996029119, 0.6305426869622767, 0.24520185307743217] Ans 8) Test document: turkey soda Similarity to "turkey stuffing" = 1/2 Similarity to "buffalo wings" = 0 Similarity to "cream soda" = 1/2 Similarity to "orange soda" = 1/2 The 3 nearest neighbors are: "turkey stuffing" (Food), "cream soda" (Beverage), and "orange soda" (Beverage). So the final category is Beverage. With 1-NN, the result could either be Food or Beverage, depending on which of the 3 documents having equal similarity score is arbitrarily selected. So the result would not be guaranteed to be the same with 1-NN. Ans 9) a) the carbon atom is the foundation of life on earth P(Physics) = 0.35 x 0.2 x 0.01 x 0.9 x 0.001 x 0.005 / Z = 3.15e-9 / Z P(Biology) = 0.4 x 0.01 x 0.1 x 0.999 x 0.2 x 0.008 / Z = 6.39e-7 / Z P(Chemistry) = 0.25 x 0.2 x 0.05 x 0.95 x 0.005 x 0.01 / Z = 1.19e-7 / Z Z = 3.15e-9 + 6.39e-7 + 1.19e-7 = 7.61e-7 P(Physics) = 0.004 P(Biology) = 0.840 P(Chemistry) = 0.156 b) the carbon atom contains 12 protons P(Physics) = 0.35 x 0.2 x 0.01 x 0.1 x 0.999 x 0.995 / Z = 6.96e-5 P(Biology) = 0.4 x 0.01 x 0.1 x 0.001 x 0.8 x 0.992 / Z = 3.17e-7 P(Chemistry) = 0.25 x 0.2 x 0.05 x 0.05 x 0.995 x 0.99 / Z = 1.23e-4 Z = 6.96e-5 + 3.17e-7 + 1.23e-4 = 1.93e-4 P(Physics) = 0.360 P(Biolody) = 0.002 P(Chemistry) = 0.638 Ans 10) go karting monster Doc1 2 0 1 length: 2.24 Doc2 1 1 0 1.41 Doc3 0 1 1 1.41 Doc4 0 0 2 2 Iteration 1: go karting monster Mean1 2 0 1 length: 2.24 Mean2 0 1 1 1.41 cos(Doc2, Mean1) = 2/(1.41*2.24) = 0.63 cos(Doc2, Mean2) = 1/(1.41*1.41) = 0.5 Doc2 is assigned to Cluster1 cos(Doc4, Mean1) = 2/(2*2.24) = 0.45 cos(Doc4, Mean2) = 2/(2*1.41) = 0.71 Doc4 is assigned to Cluster2 Iteration 2: go karting monster Mean1 1.5 0.5 0.5 length: 1.66 Mean2 0 0.5 1.5 1.58 cos(Doc1, Mean1) = 3.5/(2.24*1.66) = 0.94 cos(Doc1, Mean2) = 1.5/(2.24*1.58) = 0.42 Doc1 is assigned to Cluster1 cos(Doc2, Mean1) = 2/(1.41*1.66) = 0.85 cos(Doc2, Mean2) = 0.5/(1.41*1.58) = 0.22 Doc2 is assigned to Cluster1 cos(Doc3, Mean1) = 1/(1.41*1.66) = 0.43 cos(Doc3, Mean2) = 2/(1.41*1.58) = 0.89 Doc3 is assigned to Cluster2 cos(Doc4, Mean1) = 1/(2*1.66) = 0.30 cos(Doc4, Mean2) = 3/(2*1.58) = 0.95 Doc4 is assigned to Cluster2 -----> Convergence Therefore, the final clustering is: Cluster1: Doc1, Doc2 Cluster2: Doc3, Doc4 Ans 11) w(a,1) = s(a,1) * c(a,1) = 3/50 * 0.88 = 0.0528 w(a,2) = s(a,2) * c(a,2) = 3/50 * -0.21 = -0.0126 w(a,3) = s(a,3) * c(a,3) = 2/50 * 1 = 0.04 w(a,4) = s(a,4) * c(a,4) = 2/50 * -1 = -0.04 The 2 most similar neighbors are users 1 and 3. p(a,D) = 14/3 + [(0.0528 * (9-26/5)) + (0.04 * (6-21/4))] / (0.0528+0.04) = 14/3 + 0.23064 / 0.0928 = 7.15 p(a,E) = 14/3 + [(0.0528 * (1-26/5)) + (0.04 * (3-21/4))] / (0.0528+0.04) = 14/3 - 0.31176 / 0.0928 = 1.31 Ans 12) a) IDF stands for "inverse document frequency" and it increases as the number of documents in which a term appears decreases. Its purpose is to give more weight to rare words when computing similarity of documents. b) In the VSR model, cosine similarity is used to determine the most similar document to a query. Since cosine depends on the angle and is independent of the length of the vectors, length is irrelevant. c) As the number of documents retrieved increases, the number of relevant documents retrieved increases, so recall increases; but the number of irrelevant documents also tends to increase, thereby decreasing precision. d) Pseudo relevance feedback assumes that top m retrieved documents are relevant, and uses them to reformulate the query using a relevance feedback algorithm like Ide or Rocchio, but without any user interaction. This allows for query expansion that includes terms that are correlated with the query terms. e) Zipf's law says that most words are rare and do not occur in most documents, therefore, when an inverted index is used to return all documents in which a query word appears, it typically narrows the search to a small number of documents in which the word appears. f) The biggest delay in spidering is the latency in transmitting documents over the Internet. A multi-threaded spider is able to have multiple parallel threads waiting on the retrieval of multiple documents at once while continuing to make new requests instead of stalling waiting on the retireval of each document in turn, thereby increasing throughput. Also, if properly mangaged, a multi-threaded spider is able to dsitribute its requests acorss multiple hosts, thereby increasing throughput further and avoiding hammering a single host which could be misinterpretted as a denial of service attack. g) A robots.txt file in the top level directory of a web site gives information on the "robot exclusion" convention specifying what directories spiders should not traverse. h) By clustering results into coherent groups, the results of an ambiguous query are grouped into sets which refer to the same meaning, making it easier for the user to find the relevant set of results that refer to the desired meaning. i) Model underlying web-browsing behavior, which says that a person browsing the web starts from a page and then moves probabilistically to one of the outlinks of the current page, or to a random page with a small probability. This model assumption helps analysis of link-based web-search algorithms like PageRank. j) Link-analysis by the PageRank algorithm k) Rocchio is a prototype-based model, and hence can have problems with polymorphic (disjunctive) categories. Nearest Neighbour handles such categories better. l) Smoothing is a regularization technique in probability estimation, often done by adding "shadow" counts to actual observed counts when estimating probabilities. Without smoothing, a word unobserved in the training set would have a 0 probability of occurrence, which would give a 0 probability to any test document having that word. m) KMeans runs in time O(nki) [n points, k clusters, i iterations], and is very fast in practice. HAC runs in time O(n^2), and is slower in practice. n) Semi-supervised learning uses small amounts of labeled training data, aided by large amounts of unlabeled data, in categorization of documents. o) 1. Cold start - Need enough other users in the system to find a match 2. First rater - Cannot recommend an item that has not been previously recommended/ Other possible answers: Sparsity, Popularity Bias. p) Significance weighting weights the correlation measure based on the number of co-rated items, thus reducing the chance of spurious correlations based on a small number of co-rated items. q) Writing rules and patterns that identify topics or substrings in text is a dififcult time-consuming engineering task and people can easily overlook important regularities. It is much easier for humans to just label documents and allow a learning algorithm to automatically identify a complete and effective set of patterns. r) Record linkage is needed to match product listings from different sites to the correct underlying items, thereby allowing price comparisons. s) A game related to the phenomenon of "six degrees of separation" in social networks. Given the name of an actor or actress, the goal is to find the shortest path from this person to Kevin Bacon in a graph where nodes are individuals and there is an egde beteween two individuals iff they have acted in some movie together. t) He is Director of the World Wide Web Consortium (W3C) headquartered at MIT.