Project 3 for CS 371R: Information Retrieval and Web Search
Web Spidering and PageRanking

Due: Thursday, April 10
(Making web page: Part 1 due: April 1;  Part 2 due: April 3)

Existing Spiders

As discussed in class, a basic system for spidering the web is available in /u/mooney/ir-code/ir/webutils/ See the Javadoc for this code. Use the main method for the Spider class to start from a particular URL and spider the web breadth-first and save the documents in a specified directory for subsequent indexing and searching with VSR. Also see the specializations SiteSpider and DirectorySpider, which restrict their crawling to a particular site (host) or directory, respectively.

See a sample trace of running SiteSpider on the UT CS department faculty page to collect 100 pages related to CS faculty.

This assignment will not require using the "-safe" spidering flag that invokes restrictions according to the Robot Exclusion Policy since we will be sticking to spidering within the department. Therefore the Safe* and Robot* classes can be ignored for now. However, if you spider outside the department, be sure to use "-safe".

A collection of 800 department pages SiteSpidered from http://www.cs.utexas.edu/users/mooney/faculty.html are cached in /u/mooney/ir-code/corpora/cs-faculty/. Like yahoo-science, this directory can be indexed and searched using VSR, as in Project 1. This database can also be searched using the Simple Search Engine servlet demo. The code for this servlet is available at /u/mooney/ir-code/irs/ ("irs" is for "information retrieval servlets").

Your Task

Your assignment is to make a specialization of the Spider class called PageRankSpider that computes the PageRanks of the spidered pages based on their link structure, and make a specialization of the InvertedIndex class called PageRankInvertedIndex that utilizes the PageRanks to compute the relevance of documents. You should also create further specializations PageRankSiteSpider and PageRankDirectorySpider that restrict their spidering accordingly. 

While crawling, PageRankSpider should form a graph based on the in-coming and out-going links. Only those pages which are actually saved should be included in the graph as nodes. You may find ir.webutils.Graph and ir.webutils.Node data structures helpful for building and manipulating the graph. Then it should run the PageRanking algorithm on the graph and store all the PageRanks  in a file.  With respect to the PageRank algorithm's parameters, use 0.15 for alpha and 50 for the number of iterations.

In addition to indexing the documents, PageRankInvertedIndex should read the PageRanks of the documents. When computing the relevance of the document for a query it should add its PageRank scaled by a weight parameter to the score.  The weight  parameter should be a command line argument for PageRankInvertedIndex specified with "-weight value"

Making Web Pages

As discussed in class, in order to create test data for this assignment, everyone should create a special personal page for this class at http://www.cs.utexas.edu/users/login/ir-course.html by adding an "ir-course.html" file to the /u/login/public_html/ web directory for your login. You might need to set the Linux permissions to make the file readable with the command chmod +r ir-course.html. Make sure that when you click on your link it pulls up your new web-page.

Part 1 (2.5 points): due April 1
You should include links to at least 5 webpages of the courses that you have enjoyed from the list of CS courses located at http://www.cs.utexas.edu/users/moooney/ir-course/proj3/course-list.html. Please include the links exactly as they are given in this list, and don't worry if your favorite class is not included - we are just creating a toy link structure. For example, your "ir-course.html" may look like this.
This simple part must be completed by April 1 and counts for 2.5% of the project grade. They all will be linked from http://www.cs.utexas.edu/users/mooney/ir-course/students.html .

Part 2 (2.5 points): due April 3
Once all ir-course.html pages are in place by April 1, select at least 3 ir-course.html pages of other students and link them from your page. You may use some criteria like how much you agree with their favorite courses. Again, the bottom-line is to make an interesting toy link structure. This part is due by April 3 and carries 2.5% of the project grade. Do not change your page after April 3.

Use your PageRankSiteSpider to crawl from  http://www.cs.utexas.edu/users/mooney/ir-course/students.html and index all student course pages. A limit of 100 pages should suffice. Index and search the resulting directory of pages using PageRankInvertedIndex and compare the search results to those obtained with the original InvertedIndex for several different values of weight. Try the following queries:

In your report, analyze the effects of varying weight.

This is a sample solution trace  in which PageRankSiteSpider is crawled on the students' pages. This script file was used to run the code.

This is a sample trace from running the PageRankInvertedIndex with a couple different weights on the query 'favorite classes'. This script file was used to make multiple code executions easier. Here is another example trace with some other queries.

Submission

In submitting your solution, follow the general course instructions on submitting projects on the course homepage.

Along with that, follow these specific instructions for Project 3: