CS 343 Artificial Intelligence
Project 1: Heuristic Search for the Web
Robosurfer


Due: Sept 29, 2010

Existing Robosurfer

A Java program for using heuristics to search the web (also called directed spidering) is on the department networked file system in the file /u/mooney/cs343-code/ir/webutils/BeamSearchSpider.java. This directory and its parent directory contain a variety of support files. The version BeamSearchSiteSpider is a heuristic spider that restricts itself to a particular web host. This code is an extension of a Java web-spidering library that was developed for another course on Information Retrieval and Web Search . First copy the complete /u/mooney/cs343-code/ir subdirectory to your own directory and follow the directions on using the Java course code. You do not need to understand all of the code to do the assignment, just enough to properly update the search heuristic, LinkHeuristic.

See a trace of running the program . Code and traces print out a little nicer using the Unix command "enscript -f Courier8" (less ugly wrap-around).

The program conducts a search through a space of hyperlinks. Each hyperlink is represented by a ScoredAnchoredLink object. An AnchoredLink is a representation of a link to another URL together with the text of the link (the anchor text). A ScoredAnchoredLink also contains heuristic scores for the link itself and for the page to which the link points. The successor of a link is all of the links on the page to which the link points. Searching a space of links allows the system to choose which links to pursue rather than having to immediately download all referenced pages, as in a state-space where a web-page has as successors all the pages to which the page points. The program maintains search paths by maintaining backLink pointers from links to the links that generated them as successor states. The code also prevents revisiting pages that have already been explored (by storing all visited URLs in a HashSet) and prints a trace while it is running showing when states (links) are expanded.

Starting at a particular web page, the program's goal is to find a page that contains all the members of a set of desired search strings called "want strings" (specified with the -w command line option). In order to guide the search, you can also provide a list of "help strings," (specified with the -h command line option) which are other related strings that can help guide the search. For example, starting from the UTCS department home page, the trace presents a search with the list of want-strings: "Mooney", "343", "heuristic search" and "robosurfer" using the help-strings: "people" and "faculty", it successfully found this assignment page.

The robosurfer-trace file shows running some sample problems. See the JavaDoc for the main method of BeamSearchSiteSpider to interpret the command-line arguments.

If a character in a search string is capitalized, case matters (i.e. it must match an upper case character), otherwise case does not matter (i.e. a lower-case character matches either upper or lower case in the text). If an individual search string contains multiple words, it still matches whether they are separated by a space or any other whitespace character (e.g. newline) in the text.

The existing search heuristic considers four factors in order of importance:

  1. (WC) The number of the want-strings that are found
  2. (WT) The total number of times a want-string is found
  3. (HC) The number of the help-strings that are found
  4. (HT) The total number of times a help-string is found
A piece of text is scored as

S = 1000*WC + 100*WT + 10*HC + HT

A link is scored partly based on the text appearing directly in the link and partly based on the surrounding page. If L is the S score for the text in the link and P is the S score for the overall page, then a link is scored as

L/2 + P/2

getting half its score from its own text and half from its surrounding page.

You might first want to experiment with the existing system to get a feel for how it works. You should usually use BeamSearchSiteSpider instead of just BeamSearchSpider. Always be prepared to kill Robosurfer execution by hitting "control-c" if it seems to start surfing aimlessly out of control. If you do use BeamSearchSpider, the -safe command line flag makes the spider obey the Robot Exclusion Policy and prevents it from surfing where robots are forbidden by convention. The -slow flag is also useful; it slows down the program by pausing before every expansion in order to allow you to easily read the trace as it crawls. The -b flag allows you to specify the beam width for the search, which defaults to 100. The -c flag allows you to specify the maximum number of pages downloaded before the search is forced to terminate, the default is 10,000.

Problem with the Existing Heuristic

Of course the existing heuristic is lacking in many cases and does not adequately direct the search. Consider the problems in robosurfer-fail-trace, which presents 3 problems that are not solved as effectively as they could be.

Starting from my home page trying to find the want strings "course" and "situation calculus" (a topic we will discuss in the course later and appears on the course syllabus), the system gets quite lost because it does not directly go to my course web pages since the word "course" appears in the heading above the course links but not in the link text itself. It eventually finds its way to the goal but only after exploring over 120 pages.

Trying to find the phrase "A* search" using the help strings "textbook" and "contents" finds a solution fairly quickly but takes longer than it should since "Textbook" appears in the section heading immediately before the link to the textbook but not in it. It follows each link on the course page sequentially until it fortunately stumbles on the textbook page which is fairly near the top of the page. NOTE: This example requires using BeamSearchSpider -safe instead of BeamSearchSiteSpider to allow searching outside the cs.utexas.edu site.

Finally,trying to find the discussion about the robosurfer program from the course homepage using "robosurfer" and "program" as want strings and no help strings gets completely lost because the link to this assignment contains neither of these terms. Although the phrase "Programming Project" appears in the section heading above the link to this assignment it does not appear in the link itself so does not help. It fails to solve the problem even after exhausting the limit of 250 pages.

Ideally, the heuristic should use the relevant section headings preceding a link to guide it more or less directly to the correct page in all of these cases.

Changing the Heuristic

Your assignment is to improve the heuristic to avoid this general problem. Of course, a hack that works for these specific cases and doesn't solve the general problem of using information in the current section headings to rank links is unacceptable. Anything exploiting the specific layout of the course homepage or other specific pages is inadequate.

A trace for my version of an improved heuristic is in robosurfer-soln-trace.

You should create a specialization of the class BeamSearchSpider called HeaderBeamSearchSpider, and a specialization of the class BeamSearchSiteSpider called HeaderBeamSearchSiteSpider, in both cases, the constructLinkHeuristic() method should be redefined to produce an instance of a new heuristic HeaderLinkHeuristic that is a specialization of the LinkHeuristic class.

Hint: In my solution I looked for the closest preceding section headings relevant to the link being scored. If a want-string or help-string appears in the title of these section headings, then the score calculated by the existing heuristic described above is boosted. Note that in HTML there are section headings of different "depths", from the highest heading H1 to the lowest H6. My heuristic looks for section headings at each level that cover the current link and awards bonus points for current section headings at each level that contain any of the want strings or help strings. More points are awarded for deeper (more specific) headings, e.g. matching the current H3 heading counts more than matching the current H2 or H1 heading. The ScoredAnchoredLink class used to represent links in Robosurfer already extracts and retains the start and end character positions of links in the HTML file that will help you to easily implement such solutions. This is only a hint of how I did it, other approaches are certainly possible in order to get the same general effect of having section headings influence link scoring. Your trace for the sample problems in robosurfer-soln-trace should be at least as direct and efficient as mine.

You should submit the final commented code you write electronically using turnin by 15 minutes before the start of class on the due date. You should submit a directory with an appropriate set of named files. Include a file robosurfer-soln-trace containing a trace of your system working on the three sample problems.

You should also include a one or two page report in a plain ASCII file report.txt or a PDF file report.pdf. Describe the details of your new heuristic and explain why it solves the problem discussed above. Try some other problems of your own. Summarize your experience with testing the new heuristic, both successes and failures. Discuss the remaining limitations of the heuristic (and/or the overall system) and present any additional ideas you have for resolving them.

Follow the submission instructions for submitting your project.

Good luck and have fun robosurfing!