ir.webutils
Class YahooSiteLinkExtractor

java.lang.Object
  extended by javax.swing.text.html.HTMLEditorKit.ParserCallback
      extended by ir.webutils.YahooSiteLinkExtractor

public class YahooSiteLinkExtractor
extends javax.swing.text.html.HTMLEditorKit.ParserCallback

YahooSiteLinkExtractor defines a callback that extracts site links from a Yahoo directory page and provides functionality to parse a document. The extracted links are absolute. Uses the HTML parser in Java Swing to parse the document and find links and translate them to absolute URL's (instead of relative ones).


Field Summary
protected  boolean inSiteSection
          Flag that is true during parsing while HTML parser is in the section of the webpage that lists site links
protected  java.util.List<Link> links
          The current list of extracted site links
protected  java.lang.String moreURL
          Flag that is true during parser while the HTML parser in inside an anchor link text for a Yahoo link that refers to more sites not listed on the current page Stores the URL for this link while in its anchor text.
protected  HTMLPage page
          The page from which to extract links
protected  java.net.URL url
          The URL for this page
 
Fields inherited from class javax.swing.text.html.HTMLEditorKit.ParserCallback
IMPLIED
 
Constructor Summary
YahooSiteLinkExtractor(HTMLPage page)
          Create an link extractor for the given page
 
Method Summary
protected  void addLink(javax.swing.text.MutableAttributeSet attributes, javax.swing.text.html.HTML.Attribute attr)
          Retrieves a link from an attribute set and completes it against the base URL.
 java.util.List<Link> extractLinks()
          Extracts site links from the given Yahoo page.
 void handleEndTag(javax.swing.text.html.HTML.Tag tag, int position)
          Executed when a closing HTML tag is found in the document.
 void handleSimpleTag(javax.swing.text.html.HTML.Tag tag, javax.swing.text.MutableAttributeSet attributes, int position)
          Executed when an HTML tag that has no closing tag is found in the document.
 void handleStartTag(javax.swing.text.html.HTML.Tag tag, javax.swing.text.MutableAttributeSet attributes, int position)
          Executed when an opening HTML tag is found in the document.
 void handleText(char[] text, int position)
          Executed when a block of text is encountered.
static void main(java.lang.String[] args)
          Given Yahoo directory URL as a single arg, test extraction of site links from this page.
 
Methods inherited from class javax.swing.text.html.HTMLEditorKit.ParserCallback
flush, handleComment, handleEndOfLineString, handleError
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

links

protected java.util.List<Link> links
The current list of extracted site links


page

protected HTMLPage page
The page from which to extract links


url

protected java.net.URL url
The URL for this page


inSiteSection

protected boolean inSiteSection
Flag that is true during parsing while HTML parser is in the section of the webpage that lists site links


moreURL

protected java.lang.String moreURL
Flag that is true during parser while the HTML parser in inside an anchor link text for a Yahoo link that refers to more sites not listed on the current page Stores the URL for this link while in its anchor text.

Constructor Detail

YahooSiteLinkExtractor

public YahooSiteLinkExtractor(HTMLPage page)
Create an link extractor for the given page

Method Detail

handleText

public void handleText(char[] text,
                       int position)
Executed when a block of text is encountered. If text indicates entering the site listing part of a Yahoo directory page, then sets the inSiteSection flag to true. If in the anchor text of a link indicating more site links that explicitly refers to the "Next" set of results, then recursively extract the site links on this page of additional results by creating a YahooSiteLinkExtractor for that page and adding the extracted links to the links for this category

Overrides:
handleText in class javax.swing.text.html.HTMLEditorKit.ParserCallback
Parameters:
text - A char array representation of the text.
position - The position of the text in the document.

handleStartTag

public void handleStartTag(javax.swing.text.html.HTML.Tag tag,
                           javax.swing.text.MutableAttributeSet attributes,
                           int position)
Executed when an opening HTML tag is found in the document. Note that this method only handles tags that also have a closing tag. If currently in the site listing section, then save any link in the set of extracted links. If an anchor link to more Yahoo site results, then save the URL in the moreURL flag.

Overrides:
handleStartTag in class javax.swing.text.html.HTMLEditorKit.ParserCallback
Parameters:
tag - The tag that caused this function to be executed.
attributes - The attributes of tag.
position - The start of the tag in the document. If the tag is implied (filled in by the parser but not actually present in the document) then position will correspond to that of the next encountered tag.

handleEndTag

public void handleEndTag(javax.swing.text.html.HTML.Tag tag,
                         int position)
Executed when a closing HTML tag is found in the document. Note that the parser may add "implied" closing tags. For example, the default parser adds closing <p> tags. If encounters end of TABLE tag while in the site listing section of Yahoo page, indicates the end of this section and sets the inSiteSection flag to false. If ending an anchor text section of a link to more results then set moreURL flag to null to indicate no longer in such a link

Overrides:
handleEndTag in class javax.swing.text.html.HTMLEditorKit.ParserCallback
Parameters:
tag - The tag found.
position - The position of the tag in the document.

handleSimpleTag

public void handleSimpleTag(javax.swing.text.html.HTML.Tag tag,
                            javax.swing.text.MutableAttributeSet attributes,
                            int position)
Executed when an HTML tag that has no closing tag is found in the document. Nothing to do here.

Overrides:
handleSimpleTag in class javax.swing.text.html.HTMLEditorKit.ParserCallback
Parameters:
tag - The tag that caused this function to be executed.
attributes - The attributes of tag.
position - The start of the tag in the document. If the tag is implied (filled in by the parser but not actually present in the document) then position will correspond to that of the next encountered tag.

extractLinks

public java.util.List<Link> extractLinks()
Extracts site links from the given Yahoo page. This method constructs a parser and registers this as the callback.

Returns:
A list of Link objects containing the links found on this page. The links will all be absolute links.

addLink

protected void addLink(javax.swing.text.MutableAttributeSet attributes,
                       javax.swing.text.html.HTML.Attribute attr)
Retrieves a link from an attribute set and completes it against the base URL.

Parameters:
attributes - The attribute set.
attr - The attribute that should be treated as a URL. For example, attr should be HTML.Attribute.HREF if attributes is from an anchor tag.

main

public static void main(java.lang.String[] args)
Given Yahoo directory URL as a single arg, test extraction of site links from this page.