As discussed in class, a basic system for
spidering the web is available in
See the Javadoc
code. Use the
main method for the
Spider class to start from a particular URL and spider
breadth-first and save the documents in a specified directory for
indexing and searching with VSR. Also see the specializations
DirectorySpider, which restrict their crawling to a
site (host) or directory, respectively.
See a sample trace of running
SiteSpider on the UT CS department faculty page to collect
75 pages related to CS faculty.
This assignment will not require using the "
that invokes restrictions according to the
Policy since we will be sticking to spidering within the
Robot* classes can be ignored
for now. However, if you spider outside the department, be sure to use
A collection of 800 department pages
http://www.cs.utexas.edu/faculty are cached in
yahoo-science, this directory can be indexed and searched using
VSR, as in Project 1. This database can also be
searched using the Simple Search Engine servlet
demo. The code for this servlet is available at
/u/mooney/ir-code/irs/ ("irs" is for "information retrieval
Your assignment is to make a specialization of the
PageRankSpider that computes the PageRanks
of the spidered pages based on their link structure, and make a
specialization of the
called PageRankInvertedIndex that utilizes the PageRanks
to compute the relevance of documents. Make sure to override only the
methods you change. You should also create further a
PageRankSiteSpider that restricts its
PageRankSpider should form a graph
based on the incoming and outgoing links. When computing PageRank, only those pages which are actually
indexed (saved to disk) should be included in the graph as nodes. You may find
ir.webutils.Node data structures helpful for building and manipulating the graph. Then it should run the PageRank algorithm on the graph and store all the PageRanks in a text file named
pageRanks in the same directory as the crawled pages. The format of
pageRanks should be similar to this example:
P001.html 0.006494458532952974 P002.html 0.009569125239295519 P003.html 0.006569776377162855Each line contains a file name, some amount of whitespace, and the computed PageRank for that document.
You can crawl the following URL to help you verify that your PageRank algorithm works:
In addition to indexing the documents,
PageRankInvertedIndex should read the
PageRanks of the documents. When computing the relevance of the document for a
query it should add its PageRank scaled by a weight parameter to the score. The weight parameter should be a command
line argument for
PageRankInvertedIndex specified with "
Making Web Pages
As discussed in class, in order to create test data for this
should create a special personal page for this class at
by adding an
ir-course.html" file to the
directory for your
login. You might need to set the Linux permissions to make the file readable with the command
chmod a+r ir-course.html. Make sure that when you click on
your link it pulls up your new web page.
Part 1 (2.5 points): due Nov. 10
You should include links to at least 5 webpages of the courses that you have enjoyed from the list of CS courses located at http://www.cs.utexas.edu/users/moooney/ir-course/proj3/course-list.html. Please include the links exactly as they are given in this list, and don't worry if your favorite class is not included - we are just creating a toy link structure. For example, your "
ir-course.html" may look
This simple part counts for 2.5% of the project grade. They all will be linked from http://www.cs.utexas.edu/users/mooney/ir-course/students.html .
ir-course.htmlpages of other students and link them from your page. You may use some criteria like how much you agree with their favorite courses. Again, the bottom-line is to make an interesting toy link structure. This part counts 2.5% of the project grade. Do not change your page after Nov 12.
PageRankSiteSpider to crawl from http://www.cs.utexas.edu/users/mooney/ir-course/students.html
and index all
student course pages. A limit of 200 pages should
suffice. Index and search the resulting directory of pages using
and compare the search results to those obtained
with the original
InvertedIndex for several different
values of weight. Try the
In your report, describe the PageRank algorithm as you have implemented it, and describe how you changed the retrieval process to incorporate it. Additionally, try several queries to get a feel for the effects of PageRank and then answer at least these two questions:
This is a sample page rank output file in which
PageRankSiteSpider has been
crawled on the students'
This is a sample retrieval trace from running the
PageRankInvertedIndex with a couple different weights on the queries above (see also the filename to URL mapping for this trace).
In submitting your solution, follow the general course instructions on submitting projects on the course homepage. Note that this assignment requires two traces, as described below:
Along with that, follow these specific instructions for Project 3:
PageRankInvertedIndex. You may also wish to create a standalone
PageRankclass to compute PageRank.
pandora.cs.utexas.edu$ turnin -list jiho cs371r-proj3
12027712 4 drwx------ 3 jiho grad 4096 Nov 6 03:38 ./
12027712 4 drwxr-xr-- 3 jiho grad 4096 Nov 6 03:38 ./proj3
12027723 4 drwxr-xr-x 4 jiho grad 4096 Nov 6 03:37 ./proj3/ir
12027724 4 drwxr-xr-x 2 jiho grad 4096 Nov 6 03:37 ./proj3/ir/webutils
12027725 2 -rw-r--r-- 1 jiho grad 1471 Oct 23 23:36 ./proj3/ir/webutils/PageRankSiteSpider.java
12027726 6 -rw-r--r-- 1 jiho grad 6076 Nov 6 03:22 ./proj3/ir/webutils/PageRankSpider.java
12027727 3 -rw-r--r-- 1 jiho grad 2219 Nov 6 03:35 ./proj3/ir/webutils/PageRank.java
12027728 4 drwxr-xr-x 2 jiho grad 4096 Nov 6 03:37 ./proj3/ir/vsr
12027729 9 -rw-r--r-- 1 jiho grad 8347 Nov 6 02:45 ./proj3/ir/vsr/PageRankInvertedIndex.java
12027730 6 -rw-r--r-- 1 jiho grad 5199 Oct 10 10:05 ./proj3/REPORT.pdf
12027731 18 -rw-r--r-- 1 jiho grad 18368 Nov 6 03:27 ./proj3/spider-trace
12027731 18 -rw-r--r-- 1 jiho grad 15323 Nov 6 03:29 ./proj3/retrieve-trace