ir.webutils
Class HTMLPage

java.lang.Object
  extended by ir.webutils.HTMLPage
Direct Known Subclasses:
SafeHTMLPage

public class HTMLPage
extends java.lang.Object

HTMLPage is a representation of information about a web page.


Field Summary
protected  Link link
          The original link to this page
protected  java.util.List<Link> outLinks
          The links on this page
protected  java.lang.String text
          The text of the page
 
Constructor Summary
HTMLPage(Link link, java.lang.String text)
          Constructs an HTMLPage with the given link and text.
 
Method Summary
protected static java.net.URL addEndSlash(java.net.URL url)
          If URL looks like a directory rather than a file, then add a "/" at the end so that it acts as a proper base URL for completing URLs in this page
 boolean empty()
          Returns true if the page is empty or a 404 error.
 Link getLink()
          Returns the Link object that was used to access this page.
 java.util.List<Link> getOutLinks()
          Get the list of out links from this page.
 java.lang.String getText()
          Returns the full text of this page.
 boolean indexAllowed()
          Clients should always call this method before indexing an HTML page if they want to obey the "NOINDEX" directive in the Robots META tag.
 void setOutLinks(java.util.List<Link> links)
          Set of the outLinks for this page to given list
 void write(java.io.File dir, java.lang.String name)
          Writes web page to a file with a BASE HTML element with the original URL.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

link

protected final Link link
The original link to this page


text

protected final java.lang.String text
The text of the page


outLinks

protected java.util.List<Link> outLinks
The links on this page

Constructor Detail

HTMLPage

public HTMLPage(Link link,
                java.lang.String text)
Constructs an HTMLPage with the given link and text.

Parameters:
link - Link object to the given page.
text - The text of the page.
Method Detail

getText

public java.lang.String getText()
Returns the full text of this page. None of the HTML is stripped out.

Returns:
The text of the this page.

getLink

public Link getLink()
Returns the Link object that was used to access this page.

Returns:
The Link object that was used to access this page.

setOutLinks

public void setOutLinks(java.util.List<Link> links)
Set of the outLinks for this page to given list


getOutLinks

public java.util.List<Link> getOutLinks()
Get the list of out links from this page.


indexAllowed

public boolean indexAllowed()
Clients should always call this method before indexing an HTML page if they want to obey the "NOINDEX" directive in the Robots META tag. Always returns true in default implementation.

Returns:
true iff. the page can be indexed. Always returns true in the default implementation.

empty

public boolean empty()
Returns true if the page is empty or a 404 error.


write

public void write(java.io.File dir,
                  java.lang.String name)
Writes web page to a file with a BASE HTML element with the original URL.

Parameters:
dir - The directory to store the file in.
name - The name of the file.

addEndSlash

protected static java.net.URL addEndSlash(java.net.URL url)
If URL looks like a directory rather than a file, then add a "/" at the end so that it acts as a proper base URL for completing URLs in this page