|
|||||||||
| PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES | ||||||||
See:
Description
| Class Summary | |
|---|---|
| AnchoredLink | Link with included anchor text |
| AnchoredLinkExtractor | Extractor for AnchoredLink's. |
| BeamSearchSiteSpider | A BeamSearchSpider that limits itself to a given site (web host). |
| BeamSearchSpider | A spider that uses heuristic beam search to find a web page that contains a set of "want strings" using a set of "help strings" to guide the search. |
| DirectorySpider | Spider that limits itself to the directory it started in. |
| Graph | Graph data structure. |
| HTMLPage | HTMLPage is a representation of information about a web page. |
| HTMLPageRetriever | HTMLPageRetriever allows clients to download web pages from URLs. |
| HTMLParserMaker | HTMLParserMaker allows clients to retrieve an HTMLEditorKit.Parser instance. |
| Link | Link is a class that contains a URL. |
| LinkExtractor | LinkExtractor defines a callback that extracts the links from an HTML document and provides functionality to parse a document. |
| LinkHeuristic | Evaluates a web link (ScoredAnchoredLink) based on satisfying a set of "want strings" and "help strings". |
| Node | Node in the the Graph data structure. |
| PageGoal | Object for defining the goal in a heuristic web search. |
| RobotExclusionSet | RobotExclusionSet provides support for the Robots Exclusion Protocol. |
| RobotsMetaTagParser | Parser callback that extracts robots META tag information. |
| SafeHTMLPage | SafeHTMLPage is an immutable representation of information about a web page that includes information about whether or not this page can be indexed. |
| SafeHTMLPageRetriever | Keeps track of Robot Exclusion information. |
| ScoredAnchoredLink | An AnchoredLink that can be used in heuristic web search where links are scored for their promise. |
| ScoredAnchoredLinkExtractor | An AnchoredLinkExtractor that extracts ScoredAnchoredLink's that can be scored and used in heuristic web search. |
| SiteSpider | A spider that limits itself to a given site. |
| Spider | Spider defines a framework for writing a web crawler. |
| StringSearchResult | Lightweight object for storing both the number of DIFFERENT strings in a set of search strings that are found in a text as well as the total number of occurrences in the text of ANY of the strings in the set. |
| URLChecker | URLChecker tries to clean up some URLs that do not conform to the standard and cause confusion. |
| WebPage | WebPage is a static utility class that provides operations for downloading web pages. |
| WebPageViewer | WebPageViewer contains utilities to download and display HTML pages. |
| Exception Summary | |
|---|---|
| PathDisallowedException | PathDisallowedException is thrown to indicate that a client program tried to access a path that was disallowed by either a robots.txt file or a robots META tag. |
Provides web utilities for downloading web pages and spidering the web.
For command line interfaces see the main methods of the following classes:
|
|||||||||
| PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES | ||||||||