ir.webutils
Class Spider

java.lang.Object
  extended by ir.webutils.Spider
Direct Known Subclasses:
BeamSearchSpider, DirectorySpider, SiteSpider

public class Spider
extends java.lang.Object

Spider defines a framework for writing a web crawler. Users can change the behavior of the spider by overriding methods. Default spider does a breadth first crawl starting from a given URL up to a specified maximum number of pages, saving (caching) the pages in a given directory. Also adds a "BASE" HTML command to cached pages so links can be followed from the cached version.


Field Summary
protected  int count
          The number of pages indexed.
protected  java.util.List<Link> linksToVisit
          The queue of links maintained by the spider
protected  int maxCount
          The maximum number of pages to be indexed.
protected  HTMLPageRetriever retriever
          The object to be used to retrieve pages
protected  java.io.File saveDir
          The directory to save the downloaded files to.
protected  boolean slow
          Flag to purposely slow the crawl for debugging purposes
protected  java.util.HashSet<Link> visited
          The URLs that have already been visited.
 
Constructor Summary
Spider()
           
 
Method Summary
 void doCrawl()
          Performs the crawl.
protected  java.util.List<Link> getNewLinks(HTMLPage page)
          Returns a list of links to follow from a given page.
 void go(java.lang.String[] args)
          Checks command line arguments and performs the crawl.
protected  void handleCCommandLineOption(java.lang.String value)
          Called when "-c" is passed in on the command line.
protected  void handleDCommandLineOption(java.lang.String value)
          Called when "-d" is passed in on the command line.
protected  void handleSafeCommandLineOption()
          Called when "-safe" is passed in on the command line.
protected  void handleSlowCommandLineOption()
          Called when "-slow" is passed in on the command line.
protected  void handleUCommandLineOption(java.lang.String value)
          Called when "-u" is passed in on the command line.
protected  void indexPage(HTMLPage page)
          "Indexes" a HTMLpage.
protected  boolean linkToHTMLPage(Link link)
          Check if this is a link to an HTML page.
static void main(java.lang.String[] args)
          Spider the web according to the following command options: -safe : Check for and obey robots.txt and robots META tag directives. -d <directory> : Store indexed files in <directory>. -c <maxCount> : Store at most <maxCount> files (default is 10,000). -u <url> : Start at <url>. -slow : Pause briefly before getting a page.
 void processArgs(java.lang.String[] args)
          Processes command-line arguments.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

linksToVisit

protected java.util.List<Link> linksToVisit
The queue of links maintained by the spider


slow

protected boolean slow
Flag to purposely slow the crawl for debugging purposes


retriever

protected HTMLPageRetriever retriever
The object to be used to retrieve pages


saveDir

protected java.io.File saveDir
The directory to save the downloaded files to.


count

protected int count
The number of pages indexed. In the default implementation a page is considered to be indexed only if it is written to a file.


maxCount

protected int maxCount
The maximum number of pages to be indexed.


visited

protected java.util.HashSet<Link> visited
The URLs that have already been visited.

Constructor Detail

Spider

public Spider()
Method Detail

go

public void go(java.lang.String[] args)
Checks command line arguments and performs the crawl.

This implementation calls processArgs and doCrawl.

Parameters:
args - Command line arguments.

processArgs

public void processArgs(java.lang.String[] args)
Processes command-line arguments.

The following options are handled by this function:

Each option has a corresponding handleXXXCommandLineOption function that will be called when the option is found. Subclasses may find it convenient to change how options are handled by overriding those methods instead of this one. Only the above options will be dealt with by this function, and the input array will remain unchanged. Note that if the flag for an option appears in the input array, any value associated with that option will be assumed to follow. Thus if a "-c" flag appears in args, the next value in args will be blindly treated as the count.

Parameters:
args - Array of arguments as passed in from the command line.

handleSafeCommandLineOption

protected void handleSafeCommandLineOption()
Called when "-safe" is passed in on the command line.

This implementation sets retriever to a SafeHTMLPageRetriever.


handleDCommandLineOption

protected void handleDCommandLineOption(java.lang.String value)
Called when "-d" is passed in on the command line.

This implementation sets saveDir to value.

Parameters:
value - The value associated with the "-d" option.

handleCCommandLineOption

protected void handleCCommandLineOption(java.lang.String value)
Called when "-c" is passed in on the command line.

This implementation sets maxCount to the integer represented by value.

Parameters:
value - The value associated with the "-c" option.

handleUCommandLineOption

protected void handleUCommandLineOption(java.lang.String value)
Called when "-u" is passed in on the command line.

This implementation adds value to the list of links to visit.

Parameters:
value - The value associated with the "-u" option.

handleSlowCommandLineOption

protected void handleSlowCommandLineOption()
Called when "-slow" is passed in on the command line.

This implementation sets a flag that will be used in go to pause briefly before downloading each page.


doCrawl

public void doCrawl()
Performs the crawl. Should be called after processArgs has been called. Assumes that starting url has been set.

This implementation iterates through a list of links to visit. For each link a check is performed using visited to make sure the link has not already been visited. If it has not, the link is added to visited, and the page is retrieved. If access to the page has been disallowed by a robots.txt file or a robots META tag, or if there is some other problem retrieving the page, then the page is skipped. If the page is downloaded successfully indexPage and getNewLinks are called if allowed. go terminates when there are no more links to visit or count >= maxCount


linkToHTMLPage

protected boolean linkToHTMLPage(Link link)
Check if this is a link to an HTML page.

Returns:
true if a directory or clearly an HTML page

getNewLinks

protected java.util.List<Link> getNewLinks(HTMLPage page)
Returns a list of links to follow from a given page. Subclasses can use this method to direct the spider's path over the web by returning a subset of the links on the page.

Parameters:
page - The current page.
Returns:
Links to be visited from this page

indexPage

protected void indexPage(HTMLPage page)
"Indexes" a HTMLpage. This version just writes it out to a file in the specified directory with a "P.html" file name.

Parameters:
page - An HTMLPage that contains the page to index.

main

public static void main(java.lang.String[] args)
Spider the web according to the following command options: