LinkExtractor

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

ir.webutils
Class LinkExtractor

java.lang.Object
  javax.swing.text.html.HTMLEditorKit.ParserCallback
      ir.webutils.LinkExtractor

Direct Known Subclasses:: AnchoredLinkExtractor

public class LinkExtractor
extends javax.swing.text.html.HTMLEditorKit.ParserCallback
extends javax.swing.text.html.HTMLEditorKit.ParserCallback

LinkExtractor defines a callback that extracts the links from an HTML document and provides functionality to parse a document. The extracted links are absolute. Uses the HTML parser in Java Swing to parse the document and find links and translate them to absolute URL's (instead of relative ones).

Field Summary
`protected java.util.List<Link>`	`links` The current list of extracted links
`protected HTMLPage`	`page` The page from which to extract links
`protected java.net.URL`	`url` The URL for this page

Fields inherited from class javax.swing.text.html.HTMLEditorKit.ParserCallback
`IMPLIED`

Constructor Summary
`LinkExtractor(HTMLPage page)` Create an link extractor for the given page

Method Summary
`protected void`	`addLink(javax.swing.text.MutableAttributeSet attributes, javax.swing.text.html.HTML.Attribute attr)` Retrieves a link from an attribute set and completes it against the base URL.
`java.util.List<Link>`	`extractLinks()` Extracts links from the given page.
`void`	`handleEndTag(javax.swing.text.html.HTML.Tag tag, int position)` Executed when a closing HTML tag is found in the document.
`void`	`handleSimpleTag(javax.swing.text.html.HTML.Tag tag, javax.swing.text.MutableAttributeSet attributes, int position)` Executed when an HTML tag that has no closing tag is found in the document.
`void`	`handleStartTag(javax.swing.text.html.HTML.Tag tag, javax.swing.text.MutableAttributeSet attributes, int position)` Executed when an opening HTML tag is found in the document.
`void`	`handleText(char[] text, int position)` Executed when a block of text is encountered.

Methods inherited from class javax.swing.text.html.HTMLEditorKit.ParserCallback
`flush, handleComment, handleEndOfLineString, handleError`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

links

protected java.util.List<Link> links

The current list of extracted links

page

protected HTMLPage page

The page from which to extract links

url

protected java.net.URL url

The URL for this page

Constructor Detail

LinkExtractor

public LinkExtractor(HTMLPage page)

Create an link extractor for the given page

Method Detail

handleText

public void handleText(char[] text,
                       int position)

Executed when a block of text is encountered. Just ignores text.

Overrides:: handleText in class javax.swing.text.html.HTMLEditorKit.ParserCallback

Parameters:: text - A char array representation of the text.; position - The position of the text in the document.

handleStartTag

public void handleStartTag(javax.swing.text.html.HTML.Tag tag,
                           javax.swing.text.MutableAttributeSet attributes,
                           int position)

Executed when an opening HTML tag is found in the document. Note that this method only handles tags that also have a closing tag. Catches "a" tags and adds links for them (after completing them)

Overrides:: handleStartTag in class javax.swing.text.html.HTMLEditorKit.ParserCallback

Parameters:: tag - The tag that caused this function to be executed.; attributes - The attributes of tag.; position - The start of the tag in the document. If the tag is implied (filled in by the parser but not actually present in the document) then position will correspond to that of the next encountered tag.

handleEndTag

public void handleEndTag(javax.swing.text.html.HTML.Tag tag,
                         int position)

Executed when a closing HTML tag is found in the document. Note that the parser may add "implied" closing tags. For example, the default parser adds closing <p> tags. This version just ignores end tags.

Overrides:: handleEndTag in class javax.swing.text.html.HTMLEditorKit.ParserCallback

Parameters:: tag - The tag found.; position - The position of the tag in the document.

handleSimpleTag

public void handleSimpleTag(javax.swing.text.html.HTML.Tag tag,
                            javax.swing.text.MutableAttributeSet attributes,
                            int position)

Executed when an HTML tag that has no closing tag is found in the document. Adds link for FRAME's

Overrides:: handleSimpleTag in class javax.swing.text.html.HTMLEditorKit.ParserCallback

Parameters:: tag - The tag that caused this function to be executed.; attributes - The attributes of tag.; position - The start of the tag in the document. If the tag is implied (filled in by the parser but not actually present in the document) then position will correspond to that of the next encountered tag.

extractLinks

public java.util.List<Link> extractLinks()

Extracts links from the given page. This method constructs a parser and registers this as the callback.

Returns:: A list of Link objects containing the links found on this page. The links will all be absolute links.

addLink

protected void addLink(javax.swing.text.MutableAttributeSet attributes,
                       javax.swing.text.html.HTML.Attribute attr)

Retrieves a link from an attribute set and completes it against the base URL.

Parameters:: attributes - The attribute set.; attr - The attribute that should be treated as a URL. For example, attr should be HTML.Attribute.HREF if attributes is from an anchor tag.

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

ir.webutils Class LinkExtractor

links

page

url

LinkExtractor

handleText

handleStartTag

handleEndTag

handleSimpleTag

extractLinks

addLink

ir.webutils
Class LinkExtractor