So, you want to be the next Google?
Assignment Handout: prog7.pdf
Project Files: prog7.zip
Sample Webs:
SuperSpoof.com mirror: superspoof.zip
(4.3MB uncompressed, 205 files)
The entire rec.humor.funny archive from netfunny.com: rhf.zip
(79MB uncompressed, 9406 files)
Due Wednesday December 6th at 4:00pm
You will be implementing an entire web crawler and search engine. You'll crawl a portion of the web, build an index, and then use that index to support various web searches. By the time you finish the assignment, you will have your very own mini-Google.
You must not crawl the real web. This is because it's way too much trouble for this assignment to follow robot exclusion protocols. Also, system administrators tend to get very irate when remote users carelessly crawl sites and eat up bandwidth. Some of them will do bad things to you if you are impolite. Crawl your own personal web server, or better yet, crawl your hard drive (it's much faster this way).
Did you remember to:
Start early. Now would be good. Time is precious, and there are so many fun things you can do with this assignment.
Viva la revolucion! We give you tremendous freedom in this assignment to do everything your way. There are no required interfaces and only two required methods that we will test. Everything else is up to you. With great freedom comes, uh, great responsibility. If you do not write adequate documentation, we will not know what you have done or how you did it. If your design is poor and your code incomprehensible, we will be unhappy. You can imagine the consequences of that!
You can use a tool like httrack
to copy websites to your hard drive for crawling. For example, the
command I used to produce the largest site was httrack -s0
-A30000 -c4 -%c4 -O /tmp/large -%k -D -a -x -n -%P -*p3
http://www.netfunny.com/rhf. See the documentation for how to
use it. Remember to be respectful of other people's bandwidth!