ir.webutils
Class BeamSearchSpider

java.lang.Object
  extended by ir.webutils.Spider
      extended by ir.webutils.BeamSearchSpider
Direct Known Subclasses:
BeamSearchSiteSpider

public class BeamSearchSpider
extends Spider

A spider that uses heuristic beam search to find a web page that contains a set of "want strings" using a set of "help strings" to guide the search. Conducts a search through a space of ScoredAnchoredLinks to find a page that satisfies the goal, i.e. contains all of the "want strings".


Field Summary
protected  int beamSize
          The beam width to use.
protected  PageGoal goal
          Defines the goal predicate over HTMLPage's that is to be satisfied.
protected  HTMLPage goalPage
          The page found that satisfies the goal
protected  LinkHeuristic heuristic
          Defines the heuristic that is used to sort ScoredAnchoredLink's in the queue
 
Fields inherited from class ir.webutils.Spider
count, linksToVisit, maxCount, retriever, saveDir, slow, visited
 
Constructor Summary
BeamSearchSpider()
           
 
Method Summary
protected  LinkHeuristic constructLinkHeuristic()
          Return default LinkHeuristic.
 void doCrawl()
          Crawls the web using beam search with given heuristic to find a page that satisfies goal.
protected  java.util.List<Link> getNewLinks(HTMLPage page)
          Returns a list of scored links to follow from a given page.
 void go(java.lang.String[] args)
          Interprets command line arguments and performs the crawl.
protected  void handleBCommandLineOption(java.lang.String value)
          Called when "-b" is passed in on the command line to sets beam width.
protected  void handleHCommandLineOption(java.lang.String value)
          Called when "-h" is passed in on the command line to set help strings.
protected  void handleUCommandLineOption(java.lang.String value)
          Called when "-u" is passed in on the command line.
protected  void handleWCommandLineOption(java.lang.String value)
          Called when "-w" is passed in on the command line to set "want strings".
static void main(java.lang.String[] args)
          Search the web using beam search according to the following command options: -safe : Check for and obey robots.txt and robots META tag directives. -c <maxCount> : Download at most <maxCount> pages (default is 10,000). -u <url> : Start at <url>. -w <strings> : <strings> should be a list of "need strings" separated by ";"'s. -h <strings> : <strings> should be a list of "help strings" separated by ";"'s. -b <size> : Use a beam width of given <size> (default is 100) -slow : Pause briefly before getting a page.
 void processArgs(java.lang.String[] args)
          Processes command-line arguments.
protected  void scoreLinks(java.util.List<Link> links, HTMLPage page)
          Use the heuristic to score each of the new links on a given page that was expanded.
 
Methods inherited from class ir.webutils.Spider
handleCCommandLineOption, handleDCommandLineOption, handleSafeCommandLineOption, handleSlowCommandLineOption, indexPage, linkToHTMLPage
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

goal

protected PageGoal goal
Defines the goal predicate over HTMLPage's that is to be satisfied.


heuristic

protected LinkHeuristic heuristic
Defines the heuristic that is used to sort ScoredAnchoredLink's in the queue


beamSize

protected int beamSize
The beam width to use. Size of queue is kept to the best beamSize links.


goalPage

protected HTMLPage goalPage
The page found that satisfies the goal

Constructor Detail

BeamSearchSpider

public BeamSearchSpider()
Method Detail

go

public void go(java.lang.String[] args)
Interprets command line arguments and performs the crawl. Determines if goal page was found and if so displays it using Browser and prints path to goal page from start URL.

Overrides:
go in class Spider
Parameters:
args - Command line arguments.

processArgs

public void processArgs(java.lang.String[] args)
Processes command-line arguments.

The following options are handled by this function:

Each option has a corresponding handleXXXCommandLineOption function that will be called when the option is found. Subclasses may find it convenient to change how options are handled by overriding those methods instead of this one. Only the above options will be dealt with by this function, and the input array will remain unchanged. Note that if the flag for an option appears in the input array, any value associated with that option will be assumed to follow. Thus if a "-c" flag appears in args, the next value in args will be blindly treated as the count.

Overrides:
processArgs in class Spider
Parameters:
args - Array of arguments as passed in from the command line.

handleUCommandLineOption

protected void handleUCommandLineOption(java.lang.String value)
Called when "-u" is passed in on the command line.

This implementation adds value to the list of links to visit. This version creates an initial ScoredAnchoredLink.

Overrides:
handleUCommandLineOption in class Spider
Parameters:
value - The value associated with the "-u" option.

handleWCommandLineOption

protected void handleWCommandLineOption(java.lang.String value)
Called when "-w" is passed in on the command line to set "want strings". Sets "want strings" for the search by parsing the value into an array of want strings using ";" as a separator. Uses result to initialize goal and heuristic.


handleHCommandLineOption

protected void handleHCommandLineOption(java.lang.String value)
Called when "-h" is passed in on the command line to set help strings. Sets "help strings" for the search by parsing the value into an array of help strings using ";" as a separator. Uses result to initialize heuristic.


constructLinkHeuristic

protected LinkHeuristic constructLinkHeuristic()
Return default LinkHeuristic. Specializations can override this method to utilize alternate link heuristics.


handleBCommandLineOption

protected void handleBCommandLineOption(java.lang.String value)
Called when "-b" is passed in on the command line to sets beam width.


doCrawl

public void doCrawl()
Crawls the web using beam search with given heuristic to find a page that satisfies goal. Sets goalPage if successful.

Overrides:
doCrawl in class Spider

getNewLinks

protected java.util.List<Link> getNewLinks(HTMLPage page)
Returns a list of scored links to follow from a given page.

Overrides:
getNewLinks in class Spider
Parameters:
page - The current page.
Returns:
Links to be visited from this page

scoreLinks

protected void scoreLinks(java.util.List<Link> links,
                          HTMLPage page)
Use the heuristic to score each of the new links on a given page that was expanded.


main

public static void main(java.lang.String[] args)
Search the web using beam search according to the following command options: