As discussed in class, a basic system for
spidering the web is available in /u/mooney/ir-code/ir/webutils/
See the Javadoc
for this
code. Use the
main method for the
Spider class to start from a particular URL and spider
the web
breadth-first and save the documents in a specified directory for
subsequent
indexing and searching with VSR. Also see the specializations SiteSpider
and
DirectorySpider, which restrict their crawling to a
particular
site (host) or directory, respectively.
See a sample trace of running
SiteSpider on the UT CS department faculty page to collect
75 pages related to CS faculty.
This assignment will not require using the "-safe"
spidering flag
that invokes restrictions according to the
Robot Exclusion
Policy since we will be sticking to spidering within the
department. Therefore
the Safe* and Robot* classes can be ignored
for now. However, if you spider outside the department, be sure to use
"-safe".
A collection of 1000 department pages SiteSpidered from
http://www.cs.utexas.edu/faculty are cached in
/u/mooney/ir-code/corpora/cs-faculty/. Like
curlie-science-2021, this directory can be indexed and searched using
VSR, as in Project 1. This database can also be
searched using the Simple Search Engine servlet
demo. The code for this servlet is available at
/u/mooney/ir-code/irs/ ("irs" is for "information retrieval
servlets").
Your assignment is to make a specialization of the Spider
class called PageRankSpider that computes the PageRanks
of the spidered pages based on their link structure, and make a
specialization of the InvertedIndex class
called PageRankInvertedIndex that utilizes the PageRanks
to compute the relevance of documents. Make sure to override only the
methods you change. You should also create further a
specialization PageRankSiteSpider that restricts its
spidering accordingly.
While crawling, PageRankSpider should form a graph
based on the incoming and outgoing links. When computing PageRank, only those pages which are actually
indexed (saved to disk) should be included in the graph as nodes. You may find ir.webutils.Graph
and ir.webutils.Node data structures helpful for building and manipulating the graph.
Then it should run the PageRank algorithm on the graph and store all the PageRanks in a text file named
page_ranks.txt in the same directory as the crawled pages. The format of page_ranks.txt should be like this example:
P001.html 0.006494458532952974 P002.html 0.009569125239295519 P003.html 0.006569776377162855Each line contains a file name, a single space, and then computed PageRank for that document.
You can crawl the following URL to help you verify that your PageRank algorithm works:
https://www.cs.utexas.edu/~mooney/ir-course/proj3/a.html
In addition to indexing the documents, PageRankInvertedIndex should read the
PageRanks of the documents from the page_ranks.txt file described above. When computing the relevance of the document for a
query it should add its PageRank scaled by a weight parameter to the score. The weight parameter should be a command
line argument for PageRankInvertedIndex specified with "-weight value"
Making Web Pages (Parts 1 and 2)
As discussed in class, in order to create test data for this
assignment, everyone
should create a special personal page for this class at
http://www.cs.utexas.edu/users/login/ir-course.html
by adding an
"ir-course.html" file to the /u/login/public_html/
web
directory for your login. You might need to set the Linux permissions to make the
file readable with the command chmod a+r ir-course.html. Make sure that when you paste your
URL into your web browser, it pulls up your new web page.
Part 1 (2.5 points): due Oct. 29
You should include links to at least 5 webpages
of the courses that you have enjoyed from the list of CS courses
located at https://www.cs.utexas.edu/users/moooney/ir-course/proj3/course-list.html.
Please include the links exactly as they are given in this
list, and don't
worry if your favorite class is not included - we are just creating a
toy link
structure. For example, your "ir-course.html" may look
like this example (Or the example as a webpage).
This simple part counts for 2.5% of
the project grade. They all will be linked from
https://www.cs.utexas.edu/users/mooney/ir-course/students.html.
Submission instructions for Part 1: Submit a single file on Canvas under the assignment
"Project 3 - Part 1 (Webpage)". The file should be named [PREFIX]_link.txt (e.g. proj3_jd1234_link.txt)
and only have your URL on the first line like so:
http://www.cs.utexas.edu/users/jd1234/ir-course.htmlPart 2 (2.5 points): due Nov. 1
ir-course.html pages of other students and link them
from your page. You may use
some criteria like how much you agree with their favorite courses.
Again, the bottom-line is to make an interesting toy link structure.
This part counts 2.5% of the project grade. Do
not change your page after Nov 1.Crawling, indexing and searching the webpages (Part 3)
Use your PageRankSiteSpider to crawl from https://www.cs.utexas.edu/users/mooney/ir-course/students.html
and index all student course pages. Use a limit of 200 pages.
Index and search the resulting directory of pages using PageRankInvertedIndex
and compare the search results for the following
values of weight: {0, 3, 10}.
Note that a weight of 0 should be the same as the original InvertedIndex. Try the
following queries (TBA):
PageRankInvertedIndex trace file should have the same weighting and queries as this trace solution trace file.
You should submit two trace files of the spidering and the queries as described in the submission section below.
In your report, describe the PageRank algorithm as you have implemented it, and describe how you changed the retrieval process to incorporate it. Additionally, try several queries to get a feel for the effects of PageRank and then answer at least these two questions
In submitting your solution, follow the general course instructions on submitting projects on the course homepage. Along with that, follow these specific instructions for Project 3:
PageRankSpider which extends SpiderPageRankSiteSpider which extends PageRankSpiderPageRankInvertedIndex which extends InvertedIndex[PREFIX]_code.zip - Your code in zip file (*.java and *.class file). Please do not modify the original java files but extend each class and override the appropriate methods.[PREFIX]_report.pdf - A PDF report of your experiment as described above.[PREFIX]_spider_trace.txt
- Trace file of running PageRankSiteSpider as described above.
[PREFIX]_page_ranks.txt - Page ranks file produced by PageRankSiteSpider as described above.
[PREFIX]_retrieve_trace.txt
- Trace file of running PageRankInvertedIndex as described above.
proj3_jd1234_code.zip
proj3_jd1234_report.pdf
proj3_jd1234_spider_trace.txt
proj3_jd1234_page_ranks.txt
proj3_jd1234_retrieve_trace.txt
$ unzip -l proj3_jd1234_code.zip
Archive: proj3_jd1234_code.zip
Length Date Time Name
--------- ---------- ----- ----
21067 2015-09-14 12:57 ir/webutils/PageRankSpider.java
10049 2015-09-14 17:26 ir/webutils/PageRankSpider.class
21067 2015-09-14 12:57 ir/webutils/PageRankSiteSpider.java
10049 2015-09-14 17:26 ir/webutils/PageRankSiteSpider.class
21067 2015-09-14 12:57 ir/vsr/PageRankInvertedIndex.java
10049 2015-09-14 17:26 ir/vsr/PageRankInvertedIndex.class
--------- -------
91106 6 files