CS 371R Information Retrieval and Web Search: Project 3

Project 3 for CS 371R: Information Retrieval and Web Search
Web Spidering and PageRanking

Due: 11:59pm, Nov 3, 2025
(Making web page: Part 1 due: Oct. 24; Part 2 due: Oct. 27)

IMPORTANT: This assignment has THREE parts, each with a different due date. Please note that late submissions will NOT be accepted for the first two parts.

Existing Spiders

As discussed in class, a basic system for spidering the web is available in /u/mooney/ir-code/ir/webutils/ See the Javadoc for this code. Use the main method for the Spider class to start from a particular URL and spider the web breadth-first and save the documents in a specified directory for subsequent indexing and searching with VSR. Also see the specializations SiteSpider and DirectorySpider, which restrict their crawling to a particular site (host) or directory, respectively.

See a sample trace of running SiteSpider on the UT CS department faculty page to collect 75 pages related to CS faculty.

This assignment will not require using the "-safe" spidering flag that invokes restrictions according to the Robot Exclusion Policy since we will be sticking to spidering within the department. Therefore the Safe* and Robot* classes can be ignored for now. However, if you spider outside the department, be sure to use "-safe".

A collection of 1000 department pages SiteSpidered from http://www.cs.utexas.edu/people are cached in /u/mooney/ir-code/corpora/cs-faculty/. Like curlie-science, this directory can be indexed and searched using VSR, as in Project 1.

Your Task

Your assignment is to make a specialization of the Spider class called PageRankSpider that computes the PageRanks of the spidered pages based on their link structure, and make a specialization of the InvertedIndex class called PageRankInvertedIndex that utilizes the PageRanks to compute the relevance of documents. Make sure to override only the methods you change. You should also create further a specialization PageRankSiteSpider that restricts its spidering accordingly.

While crawling, PageRankSpider should form a graph based on the incoming and outgoing links. When computing PageRank, only those pages which are actually indexed (saved to disk) should be included in the graph as nodes. You may find ir.webutils.Graph and ir.webutils.Node data structures helpful for building and manipulating the graph. Then it should run the PageRank algorithm on the graph and store all the PageRanks in a text file named page_ranks.txt in the same directory as the crawled pages. The format of page_ranks.txt should be like this example:

P001.html 0.006494458532952974
P002.html 0.009569125239295519
P003.html 0.006569776377162855

Each line contains a file name, a single space, and then computed PageRank for that document.

With respect to the PageRank algorithm's parameters, use 0.15 for alpha and 50 for the number of iterations.

You can crawl the following URL to help you verify that your PageRank algorithm works:

 https://www.cs.utexas.edu/~mooney/ir-course/proj3/a.html

In addition to indexing the documents, PageRankInvertedIndex should read the PageRanks of the documents from the page_ranks.txt file described above. When computing the relevance of the document for a query it should add its PageRank scaled by a weight parameter to the score. The weight parameter should be a command line argument for PageRankInvertedIndex specified with "-weight value"

Making Web Pages (Parts 1 and 2)

As discussed in class, in order to create test data for this assignment, everyone should create a special personal page for this class and submit it to Canvas.

Important: Parts 1 and 2 of this assignment cannot be turned in late!

Part 1 (2.5 points): due Oct. 24
You should include links to at least 5 webpages of the courses that you have enjoyed from the list of CS courses located at http://www.cs.utexas.edu/users/mooney/ir-course/proj3/course-list.html. Please include the links exactly as they are given in this list, and don't worry if your favorite class is not included - we are just creating a toy link structure. For example, your webpage may look like this example (Or the example as a webpage).
This simple part counts for 2.5% of the project grade. After the deadline, they all will be linked from http://www.cs.utexas.edu/users/mooney/ir-course/favorite_classes_2025.html.
Please do not include any personal information on your webpage. We will post your webpages using anonymized URLs to comply with FERPA.

Submission instructions for Part 1: Submit a single file on Canvas under the assignment "Project 3 - Part 1 (Webpage)". The file should be named [PREFIX]_favorite_classes.html (e.g. proj3_jd1234_favorite_classes.html) and will have the HTML for your webpage.

Part 2 (2.5 points): due Oct. 27
Once all favorite classes pages are in place by Oct. 24, select at least 3 favorite classes pages of other students and link them from your page. You may use some criteria like how much you agree with their favorite courses. Again, the bottom-line is to make an interesting toy link structure. Here is an example of the updated web page. This part counts 2.5% of the project grade.

Submission instructions for Part 2: Submit a single file on Canvas under the assignment "Project 3 - Part 2 (Webpage)". The file should be named [PREFIX]_favorite_classes.html (e.g. proj3_jd1234_favorite_classes.html) and will have the updated HTML for your webpage.

Crawling, indexing and searching the webpages (Part 3): due Nov. 3

Step 1: Spider and compute PageRanks
Use your PageRankSiteSpider to crawl from http://www.cs.utexas.edu/users/mooney/ir-course/favorite_classes_2025.html and index all student course pages. Use a limit of 200 pages.

Run the following command:

java ir.webutils.PageRankSiteSpider \

-u https://www.cs.utexas.edu/~mooney/ir-course/favorite_classes_2025.html -d indexed -c 200

This should create a folder called indexed/ containing approximately 200 crawled HTML pages and a file called page_ranks.txt with the computed PageRanks. You can compare your spider trace with this spider solution trace file.

Step 2: Search with PageRank-weighted retrieval
Index and search the resulting directory of pages using PageRankInvertedIndex with weight values: 0.0, 1.0, and 5.0. Note that a weight of 0 should be the same as the original InvertedIndex.

Run the following commands to generate your retrieval traces:

java ir.vsr.PageRankInvertedIndex -weight 0.0 -html indexed
java ir.vsr.PageRankInvertedIndex -weight 1.0 -html indexed
java ir.vsr.PageRankInvertedIndex -weight 5.0 -html indexed

For each command above, try the following queries:

computer
networks
programming

Note: You should submit trace files for all three weight values (w=0.0, w=1.0, w=5.0). Each trace file should include results for the three queries listed above (computer, networks, programming).

You can compare your results with these sample solution trace files: spider trace (with URL normalization), spider trace (w/o URL normalization); page_ranks.txt (with URL normalization), page_ranks.txt (w/o URL normalization); retrieval: w=0.0 (with URL normalization), w=0.0 (w/o URL normalization), w=1.0 (with URL normalization), w=1.0 (w/o URL normalization), w=5.0 (with URL normalization), w=5.0 (w/o URL normalization).

Your results do not need to be identical to the sample traces - some variation in page discovery order and exact PageRank values is expected. However, the overall behavior should be similar (rankings should change as weight increases, high-PageRank pages should move up with higher weights).

You should submit the trace files of the spidering and the retrieval queries as described in the submission section below.

Report and Trace Files

In your report, briefly describe the PageRank algorithm as you have implemented it, describe how you changed the retrieval process to incorporate it and instructions to run the code. Additionally, try several queries to get a feel for the effects of PageRank and then answer at least these two questions

Does PageRank seem to have an effect on the quality of your results, as compared to the original retrieval code? Why or why not?
How does varying weight change your results? Why do you think it changes in this way?

(You should explicitly answer these questions, i.e. in your report put "Q1. Does..?" and then give the answer underneath so that we do not have to search for your answers. Make sure to explain your answer rather than just state it.)

Submission

You should submit your work on Gradescope. In submitting your solution, follow the general course instructions on submitting projects on the course homepage. Along with that, follow these specific instructions for Project 3:

See Part 1 and 2 submission instructions above!
Create at least the following new classes described above:
1. PageRankSpider which extends Spider
2. PageRankSiteSpider which extends PageRankSpider
3. PageRankInvertedIndex which extends InvertedIndex
For this assignment, you need to submit the following files:

code/ - A folder containing all your code (*.java and *.class file). Please do not modify the original java files but extend each class and override the appropriate methods.

vsr/ - vsr sub-folder containing modified vsr java and class files
webutils/ - webutils sub-folder containing modified webutils java and class files

report.pdf - A PDF report of your experiment as described above.
trace/ - A folder containing all your trace files as described below.

spider.txt - Trace file of running PageRankSiteSpider (with URL normalization, w/o URL normalization)
page_ranks.txt - Page ranks file produced by PageRankSiteSpider (with URL normalization, w/o URL normalization)
retrieve_w0.txt - Trace file of running PageRankInvertedIndex with weight=0.0 (with URL normalization, w/o URL normalization)
retrieve_w1.txt - Trace file of running PageRankInvertedIndex with weight=1.0 (with URL normalization, w/o URL normalization)
retrieve_w5.txt - Trace file of running PageRankInvertedIndex with weight=5.0 (with URL normalization, w/o URL normalization)

You may optionally include your indexed/ directory (as described below) in the Gradescope Project 3: Indexed Folder *Optional*. If included, it may be used to assign partial credit when your implementation is slightly off.

*** You will need to ensure that the following commands run successfully on the lab machines: ***

java ir.webutils.PageRankSiteSpider -u https://www.cs.utexas.edu/~mooney/ir-course/favorite_classes_2025.html -d indexed -c 200

This should create a folder called indexed/ in the directory where it is run.
indexed/ should contain the 200 crawled html pages and also a file called page_ranks.txt

java ir.vsr.PageRankInvertedIndex -weight 0.0 -html indexed
java ir.vsr.PageRankInvertedIndex -weight 1.0 -html indexed
java ir.vsr.PageRankInvertedIndex -weight 5.0 -html indexed

Automated Testing

Your submission will be tested automatically on Gradescope. The autograder runs functional tests on small-scale test cases to verify that your implementation is correct. Test results are displayed as informational feedback showing which functionality is working. For the full-scale experiment, you will be graded manually based on the traces you submit. The spidering may take a few minutes to execute, so you should start early and give yourself enough time to run it and debug.

The autograder runs the following tests:

PageRank Algorithm Tests (3 tests): Your PageRank implementation is tested on three different graph structures with 3, 5, and 15 nodes respectively. These tests validate that:
- PageRanks sum to 1.0 (within a small tolerance for floating-point precision)
- All PageRank values are in the range [0, 1]
- The page_ranks.txt file is correctly formatted
- No invalid values (NaN, infinity, or negative numbers)
Retrieval Integration Tests (3 tests): Your PageRankInvertedIndex is tested with three different weight values (0.0, 1.0, and 5.0) on a small document corpus:
- The program correctly accepts the -weight parameter
- Weight = 0.0 produces content-based rankings (similar to base InvertedIndex)
- Weight = 5.0 elevates high-PageRank documents in the results
Ranking Validation Tests (2 tests): The autograder verifies that rankings change appropriately as the weight parameter increases. Documents with higher PageRank should move up in the rankings as weight increases.

Grading Rubrics

5%: part I & II
10%: Your program compiles successfully and functions normally without throwing out any exception.
10%: Your implementation is efficient. For example, it shouldn't change the overall time complexity. Also, it shouldn't significantly increase the average time it takes to respond to a query
40%: Working Code, Correct PageRank and retrieval trace
10%: Good programming style, with necessary comments, intuitive variable/function names and appropriate indent.
25%: Quality of report, description of approach, good analysis & discussion answering all of the questions.