ir.webutils
Class AnchoredLinkExtractor

java.lang.Object
  extended by javax.swing.text.html.HTMLEditorKit.ParserCallback
      extended by ir.webutils.LinkExtractor
          extended by ir.webutils.AnchoredLinkExtractor
Direct Known Subclasses:
ScoredAnchoredLinkExtractor

public class AnchoredLinkExtractor
extends LinkExtractor

Extractor for AnchoredLink's. Modifies the HTML parser callback routines to also extract and store anchor text for all links.


Field Summary
protected  java.lang.StringBuffer anchorText
          Buffer to store anchor text encountered between an "a" start tag and end tag.
protected  AnchoredLink currentLink
          The current link being processed
 
Fields inherited from class ir.webutils.LinkExtractor
links, page, url
 
Fields inherited from class javax.swing.text.html.HTMLEditorKit.ParserCallback
IMPLIED
 
Constructor Summary
AnchoredLinkExtractor(HTMLPage page)
          Create an anchored link extractor for the given page
 
Method Summary
protected  void addLink(javax.swing.text.MutableAttributeSet attributes, javax.swing.text.html.HTML.Attribute attr)
          Retrieves a link from an attribute set and completes it against the base URL.
static void appendTag(java.lang.StringBuffer buffer, javax.swing.text.html.HTML.Tag tag, javax.swing.text.MutableAttributeSet attributes)
          Write this tag with attributes out to the buffer
 void handleEndTag(javax.swing.text.html.HTML.Tag tag, int position)
          Executed when a closing HTML tag is found in the document.
 void handleSimpleTag(javax.swing.text.html.HTML.Tag tag, javax.swing.text.MutableAttributeSet attributes, int position)
          Executed when an HTML tag that has no closing tag is found in the document.
 void handleStartTag(javax.swing.text.html.HTML.Tag tag, javax.swing.text.MutableAttributeSet attributes, int position)
          Executed when an opening HTML tag is found in the document.
 void handleText(char[] text, int position)
          Executed when a block of text is encountered.
static void main(java.lang.String[] args)
           
 
Methods inherited from class ir.webutils.LinkExtractor
extractLinks
 
Methods inherited from class javax.swing.text.html.HTMLEditorKit.ParserCallback
flush, handleComment, handleEndOfLineString, handleError
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

anchorText

protected java.lang.StringBuffer anchorText
Buffer to store anchor text encountered between an "a" start tag and end tag.


currentLink

protected AnchoredLink currentLink
The current link being processed

Constructor Detail

AnchoredLinkExtractor

public AnchoredLinkExtractor(HTMLPage page)
Create an anchored link extractor for the given page

Method Detail

handleText

public void handleText(char[] text,
                       int position)
Executed when a block of text is encountered. If inside anchor tag, store text in anchorText.

Overrides:
handleText in class LinkExtractor
Parameters:
text - A char array representation of the text.
position - The position of the text in the document.

handleStartTag

public void handleStartTag(javax.swing.text.html.HTML.Tag tag,
                           javax.swing.text.MutableAttributeSet attributes,
                           int position)
Executed when an opening HTML tag is found in the document. Note that this method only handles tags that also have a closing tag. If "a" tags starts new anchorText buffer. If already in a "a" tag, store tag info in the anchorText.

Overrides:
handleStartTag in class LinkExtractor
Parameters:
tag - The tag that caused this function to be executed.
attributes - The attributes of tag.
position - The start of the tag in the document. If the tag is implied (filled in by the parser but not actually present in the document) then position will correspond to that of the next encountered tag.

appendTag

public static void appendTag(java.lang.StringBuffer buffer,
                             javax.swing.text.html.HTML.Tag tag,
                             javax.swing.text.MutableAttributeSet attributes)
Write this tag with attributes out to the buffer


handleEndTag

public void handleEndTag(javax.swing.text.html.HTML.Tag tag,
                         int position)
Executed when a closing HTML tag is found in the document. Note that the parser may add "implied" closing tags. For example, the default parser adds closing <p> tags. If end of "a" tag then add the accumulated anchorText to the current link (the last one added to links). If already in a "a" tag, store tag info in the anchorText.

Overrides:
handleEndTag in class LinkExtractor
Parameters:
tag - The tag found.
position - The position of the tag in the document.

handleSimpleTag

public void handleSimpleTag(javax.swing.text.html.HTML.Tag tag,
                            javax.swing.text.MutableAttributeSet attributes,
                            int position)
Executed when an HTML tag that has no closing tag is found in the document. If already in a "a" tag, store tag info in the anchorText.

Overrides:
handleSimpleTag in class LinkExtractor
Parameters:
tag - The tag that caused this function to be executed.
attributes - The attributes of tag.
position - The start of the tag in the document. If the tag is implied (filled in by the parser but not actually present in the document) then position will correspond to that of the next encountered tag.

addLink

protected void addLink(javax.swing.text.MutableAttributeSet attributes,
                       javax.swing.text.html.HTML.Attribute attr)
Retrieves a link from an attribute set and completes it against the base URL. This version creates AnchoredLink's

Overrides:
addLink in class LinkExtractor
Parameters:
attributes - The attribute set.
attr - The attribute that should be treated as a URL. For example, attr should be HTML.Attribute.HREF if attributes is from an anchor tag.

main

public static void main(java.lang.String[] args)
                 throws java.lang.Exception
Throws:
java.lang.Exception