As discussed in class, a basic system for
spidering the web is available in /u/mooney/ir-code/ir/webutils/
See the Javadoc
for this
code. Use the
main method for the
Spider class to start from a particular URL and spider
the web
breadth-first and save the documents in a specified directory for
subsequent
indexing and searching with VSR. Also see the specializations SiteSpider
and
DirectorySpider, which restrict their crawling to a
particular
site (host) or directory, respectively.
See a sample trace of running
SiteSpider on the UT CS department faculty page to collect
100 pages related to CS faculty.
This assignment will not require using the "-safe"
spidering flag
that invokes restrictions according to the
Robot Exclusion
Policy since we will be sticking to spidering within the
department. Therefore
the Safe* and Robot* classes can be ignored
for now. However, if you spider outside the department, be sure to use
"-safe".
A collection of 800 department pages SiteSpidered from
http://www.cs.utexas.edu/users/mooney/faculty.html are cached in
/u/mooney/ir-code/corpora/cs-faculty/. Like
yahoo-science, this directory can be indexed and searched using
VSR, as in Project 1. This database can also be
searched using the Simple Search Engine servlet
demo. The code for this servlet is available at
/u/mooney/ir-code/irs/ ("irs" is for "information retrieval
servlets").
Your assignment is to make a specialization of the Spider
class
called PageRankSpider that computes the PageRanks of the
spidered pages based on their link structure, and make a specialization
of
the InvertedIndex class called PageRankInvertedIndex that
utilizes the PageRanks to compute the relevance of
documents. You should also create further
specializations PageRankSiteSpider and PageRankDirectorySpider
that restrict their spidering accordingly.
While crawling, PageRankSpider should form a graph
based on the
in-coming and out-going links. Only those pages which are actually
saved should
be included in the graph as nodes. You may find ir.webutils.Graph
and ir.webutils.Node data structures helpful for building
and
manipulating the graph. Then it should run the PageRanking algorithm on
the
graph and store all the PageRanks in a file. With respect
to the
PageRank algorithm's parameters, use 0.15 for alpha and 50 for the number of iterations.
In addition to indexing the documents, PageRankInvertedIndex should read the
PageRanks of the documents. When computing the relevance of the document for a
query it should add its PageRank scaled by a weight parameter to the score. The weight parameter should be a command
line argument for PageRankInvertedIndex
specified with "-weight value"
Making Web Pages
As discussed in class, in order to create test data for this
assignment, everyone
should create a special personal page for this class at
http://www.cs.utexas.edu/users/login/ir-course.html
by adding an
"ir-course.html" file to the /u/login/public_html/
web
directory for your login. You might need to set the Linux permissions to make the file readable with the command chmod +r ir-course.html. Make sure that when you click on
your link it pulls up your new web-page.
Part 1 (2.5 points): due April 1
You should include links to at least 5 webpages
of the courses that you have enjoyed from the list of CS courses
located at http://www.cs.utexas.edu/users/moooney/ir-course/proj3/course-list.html.
Please include the links exactly as they are given in this
list, and don't
worry if your favorite class is not included - we are just creating a
toy link
structure. For example, your "ir-course.html" may look
like this.
This simple part must be completed by April 1 and counts for 2.5% of
the project grade. They all will be linked from
http://www.cs.utexas.edu/users/mooney/ir-course/students.html .
ir-course.html pages of other students and link them
from your page. You may use
some criteria like how much you agree with their favorite courses.
Again, the bottom-line is to make an interesting toy link structure.
This part is due by April 3 and carries 2.5% of the project grade. Do
not change your page after April 3.Use your PageRankSiteSpider to crawl from http://www.cs.utexas.edu/users/mooney/ir-course/students.html
and index all
student course pages. A limit of 100 pages should
suffice. Index and search the resulting directory of pages using PageRankInvertedIndex
and compare the search results to those obtained
with the original InvertedIndex for several different
values of weight. Try the
following queries:
In your report, analyze the effects of varying weight.
This is a sample solution
trace in which PageRankSiteSpider is
crawled on the students'
pages. This script file was used to run the code.
This is a sample trace from running
the PageRankInvertedIndex with a couple different weights on the query 'favorite classes'.
This script file was used to make multiple code executions
easier. Here is another example trace with some other queries.
In submitting your solution, follow the general course instructions on submitting projects on the course homepage.
Along with that, follow these specific instructions for Project 3:
pandora.cs.utexas.edu$ turnin -list jcooper proj3
12027712 4 drwx------ 3 jcooper grad 4096 Nov 6 03:38 ./
12027712 4 drwxr-xr-- 3 jcooper grad 4096 Nov 6 03:38 ./proj3
12027723 4 drwxr-xr-x 4 jcooper grad 4096 Nov 6 03:37 ./proj3/ir
12027724 4 drwxr-xr-x 2 jcooper grad 4096 Nov 6 03:37 ./proj3/ir/webutils
12027725 2 -rw-r--r-- 1 jcooper grad 1471 Oct 23 23:36 ./proj3/ir/webutils/PageRankSiteSpider.java
12027726 6 -rw-r--r-- 1 jcooper grad 6076 Nov 6 03:22 ./proj3/ir/webutils/PageRankSpider.java
12027727 3 -rw-r--r-- 1 jcooper grad 2219 Nov 6 03:35 ./proj3/ir/webutils/PageRankDirectorySpider.java
12027728 4 drwxr-xr-x 2 jcooper grad 4096 Nov 6 03:37 ./proj3/ir/vsr
12027729 9 -rw-r--r-- 1 jcooper grad 8347 Nov 6 02:45 ./proj3/ir/vsr/PageRankInvertedIndex.java
12027730 6 -rw-r--r-- 1 jcooper grad 5199 Oct 10 10:05 ./proj3/REPORT.txt
12027731 18 -rw-r--r-- 1 jcooper grad 18368 Nov 6 03:27 ./proj3/soln-trace