ir.webutils
Class SafeHTMLPageRetriever

java.lang.Object
  extended by ir.webutils.HTMLPageRetriever
      extended by ir.webutils.SafeHTMLPageRetriever

public final class SafeHTMLPageRetriever
extends HTMLPageRetriever

Keeps track of Robot Exclusion information. Clients can use this class to ensure that they do not access pages prohibited either by the Robots Exclusion Protocol or Robots META tags.


Constructor Summary
SafeHTMLPageRetriever()
           
 
Method Summary
 HTMLPage getHTMLPage(Link link)
          Tries to download the given web page.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

SafeHTMLPageRetriever

public SafeHTMLPageRetriever()
Method Detail

getHTMLPage

public HTMLPage getHTMLPage(Link link)
                     throws PathDisallowedException
Tries to download the given web page. Throws PathDisallowedException if access to the page is prohibited. Also updates Robots Exclusion information based on the new page.

Overrides:
getHTMLPage in class HTMLPageRetriever
Parameters:
link - The Link to follow and download.
Returns:
The web page specified by the URL.
Throws:
PathDisallowedException - If url is disallowed by a robots.txt file or Robots META tag.