A Java program for using heuristics to search the web (also called
directed spidering) is on the department networked file system in the file
/u/mooney/cs343-code/ir/webutils/BeamSearchSpider.java. This
directory and its parent directory contain a variety of support files. The
version BeamSearchSiteSpider is a heuristic spider that restricts
itself to a particular web host. This code is an extension of a Java
web-spidering library that was developed for another course on Information Retrieval and Web Search .
First copy the complete /u/mooney/cs343-code/ir subdirectory to
your own directory and follow the directions
on using the Java course code. You do not need to understand all of the
code to do the assignment, just enough to properly update the search heuristic,
LinkHeuristic.
See a trace of running the program
. Code and traces print out a little nicer using the Unix command
"enscript -f Courier8" (less ugly wrap-around).
The program conducts a search through a space of hyperlinks. Each hyperlink is
represented by a ScoredAnchoredLink object. An
AnchoredLink is a representation of a link to another URL together
with the text of the link (the anchor text). A ScoredAnchoredLink
also contains heuristic scores for the link itself and for the page to which
the link points. The successor of a link is all of the links on the page to
which the link points. Searching a space of links allows the system to choose
which links to pursue rather than having to immediately download all referenced
pages, as in a state-space where a web-page has as successors all the pages to
which the page points. The program maintains search paths by maintaining
backLink pointers from links to the links that generated them as
successor states. The code also prevents revisiting pages that have already
been explored (by storing all visited URLs in a HashSet) and
prints a trace while it is running showing when states (links) are expanded.
Starting at a particular web page, the program's goal is to find a page that
contains all the members of a set of desired search strings called "want
strings" (specified with the -w command line option). In order to
guide the search, you can also provide a list of "help strings," (specified
with the -h command line option) which are other related strings
that can help guide the search. For example, starting from the UTCS department
home page, I did a search with the list of want-strings:
"mooney", "343", "project" and "robosurfer"
using the help-strings:
"people" and "faculty",
it successfully found this assignment page. The search is a little quicker
if "Mooney" is capitalized to avoid dead-in searches.
The robosurfer-trace file
shows running some sample problems. See the
JavaDoc for the main method of
BeamSearchSiteSpider to interpret the command-line arguments.
If a character in a search string is capitalized, case matters (i.e. it must match an upper case character), otherwise case does not matter (i.e. a lower-case character matches either upper or lower case in the text). If an individual search string contains multiple words, it still matches whether they are separated by a space or any other whitespace character (e.g. newline) in the text.
The existing search heuristic considers four factors in order of importance:
You might first want to experiment with the existing system to get a feel for
how it works. You should usually use BeamSearchSiteSpider
instead of just BeamSearchSpider. Always be prepared to kill
Robosurfer execution by hitting
"control-c" if it seems to start surfing aimlessly out of control. If you do use
BeamSearchSpider, the
-safe command line flag makes the spider obey the Robot Exclusion
Policy and prevents it from surfing where robots are forbidden by convention.
The -slow flag is also
useful; it slows down the program by pausing before every expansion in order to
allow you to easily read the trace as it crawls. The -b flag
allows you to specify the beam width for the search, which defaults to 100.
The -c flag allows you to specify the maximum number of pages
downloaded before the search is forced to terminate, the default is 10,000.
Of course the existing heuristic is lacking in many cases and does not adequately direct the search.
Consider the problems in
robosurfer-fail-trace. Trying to find the discussion about
the robosurfer program from the course homepage using "robosurfer" and
"program" as want strings and no help strings gets lost because the link to
this assignment contains neither of these terms. Although the phrase
"Programming Project" appears right before the link to this assignment it does
not appear in the link itself so does not help. Trying to find the phrase "A*
search" using the help strings "textbook" and "contents" gets lost also since
"Textbook" appears before the link to the textbook but not in it, and the word
"content" appears right before the link to the full list of contents for the
textbook on the textbook homepage (where "A* search" appears) but not in the link
itself.
Ideally, the heuristic should use nearby text preceding the relevant link to guide it more or less directly to the correct page in both of these cases.
Your assignment is to improve the heuristic to avoid this general problem. Of course, a hack that works for these specific cases and doesn't solve the general problem of using information from nearby text to rank links is unacceptable. Anything exploiting the specific layout of the course homepage or other specific pages is inadequate.
A trace for my version of an improved heuristic is in robosurfer-soln-trace.
You should create a specialization of the class BeamSearchSpider
called ProximityBeamSearchSpider, and a specialization of the class
BeamSearchSiteSpider called
ProximityBeamSearchSiteSpider, in both cases, the
constructLinkHeuristic() method should be redefined to produce an
instance of a new heuristic ProximityLinkHeuristic that is a
specialization of the LinkHeuristic class.
Hint: In my solution I looked for the closest preceding instance of a
want-string or help-string that did not appear inside of another link. The
closer this was, the higher the link is rated (the score calculated by the
existing heuristic described above is boosted some). The
ScoredAnchoredLink class used to represent links in Robosurfer
already extracts and retains the start and end character positions of links in
the HTML file that will help you to easily implement such solutions. This is
only a hint of how I did it, other approaches are presumably possible in order
to get the same general effect of having nearby surrounding text influence link
scoring. Your trace for the sample problems in
robosurfer-soln-trace should be at least as direct as mine.
You should submit the final commented code you write electronically using
turnin by the start of class on the due date.
You should submit a directory with an appropriate set of named files. Include
a file robosurfer-soln-trace containing a trace of your system
working on the two sample problems.
You should also include a one or two page report in a plain ASCII file called
report.txt. Describe the details of your new heuristic and
explain why it solves the problem discussed above. Try some other
problems of your own. Summarize your experience with testing the new
heuristic, both successes and failures. Discuss the remaining limitations of
the heuristic (and/or the overall system) and present any additional ideas you
have for resolving them.
Follow the submission instructions for submitting your project.
Good luck and have fun robosurfing!