ir.webutils

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV PACKAGE NEXT PACKAGE

FRAMES NO FRAMES

Package ir.webutils

Provides web utilities for downloading web pages and spidering the web.

See:
Description

Class Summary
AnchoredLink	Link with included anchor text
AnchoredLinkExtractor	Extractor for AnchoredLink's.
BeamSearchSiteSpider	A BeamSearchSpider that limits itself to a given site (web host).
BeamSearchSpider	A spider that uses heuristic beam search to find a web page that contains a set of "want strings" using a set of "help strings" to guide the search.
DirectorySpider	Spider that limits itself to the directory it started in.
Graph	Graph data structure.
HTMLPage	HTMLPage is a representation of information about a web page.
HTMLPageRetriever	HTMLPageRetriever allows clients to download web pages from URLs.
HTMLParserMaker	HTMLParserMaker allows clients to retrieve an HTMLEditorKit.Parser instance.
Link	Link is a class that contains a URL.
LinkExtractor	LinkExtractor defines a callback that extracts the links from an HTML document and provides functionality to parse a document.
LinkHeuristic	Evaluates a web link (ScoredAnchoredLink) based on satisfying a set of "want strings" and "help strings".
Node	Node in the the Graph data structure.
PageGoal	Object for defining the goal in a heuristic web search.
RobotExclusionSet	RobotExclusionSet provides support for the Robots Exclusion Protocol.
RobotsMetaTagParser	Parser callback that extracts robots META tag information.
SafeHTMLPage	SafeHTMLPage is an immutable representation of information about a web page that includes information about whether or not this page can be indexed.
SafeHTMLPageRetriever	Keeps track of Robot Exclusion information.
ScoredAnchoredLink	An AnchoredLink that can be used in heuristic web search where links are scored for their promise.
ScoredAnchoredLinkExtractor	An AnchoredLinkExtractor that extracts ScoredAnchoredLink's that can be scored and used in heuristic web search.
SiteSpider	A spider that limits itself to a given site.
Spider	Spider defines a framework for writing a web crawler.
StringSearchResult	Lightweight object for storing both the number of DIFFERENT strings in a set of search strings that are found in a text as well as the total number of occurrences in the text of ANY of the strings in the set.
URLChecker	URLChecker tries to clean up some URLs that do not conform to the standard and cause confusion.
WebPage	WebPage is a static utility class that provides operations for downloading web pages.
WebPageViewer	WebPageViewer contains utilities to download and display HTML pages.

Exception Summary
PathDisallowedException	PathDisallowedException is thrown to indicate that a client program tried to access a path that was disallowed by either a robots.txt file or a robots META tag.

Package ir.webutils Description

Provides web utilities for downloading web pages and spidering the web.

For command line interfaces see the main methods of the following classes:

Spider
SiteSpider
DirectorySpider \

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV PACKAGE NEXT PACKAGE

FRAMES NO FRAMES