ir.webutils
Class YahooSpider

java.lang.Object
  extended by ir.webutils.YahooSpider

public class YahooSpider
extends java.lang.Object

Specific spider for extracting and saving a particular number of random set of pages for a particular topic category in the Yahoo directory. Starts from directory page and repeatedly randomly follows subcateogry and site links to find a random site and save it.


Field Summary
protected  java.util.List<Link> categoryLinks
          List of category links found for the current directory page
 java.util.Map<Link,java.util.List<Link>> categoryLinksMap
          The HashMap for storing categoryLinks for already downloaded Links
protected  int count
          The number of pages indexed.
protected  java.lang.String filePrefix
          Prefix to add to the name of all saved files for the current cateogry
protected  int maxCount
          The number of pages to be found and indexed.
protected  java.util.Random random
          Random number generator to use
protected  HTMLPageRetriever retriever
          The object to be used to retrieve pages
protected  java.io.File saveDir
          The directory to save the downloaded files to.
protected  java.util.List<Link> siteLinks
          List of site links found for the current directory page
 java.util.Map<Link,java.util.List<Link>> siteLinksMap
          The HashMap for storing siteLinks for already downloaded Links
protected  boolean slow
          Flag to purposely slow the crawl for debugging purposes
protected  Link topCategoryLink
          Link for the main topic Yahoo category
protected  java.util.HashSet<Link> visitedSites
          The sites that have already been indexed.
 
Constructor Summary
YahooSpider()
           
 
Method Summary
 void doCrawl()
          Performs the crawl.
protected  Link getRandomLink(java.util.List<Link> links)
          Pick a random link from a list of links
 void go(java.lang.String[] args)
          Checks command line arguments and performs the crawl.
protected  void handleCCommandLineOption(java.lang.String value)
          Called when "-c" is passed in on the command line.
protected  void handleDCommandLineOption(java.lang.String value)
          Called when "-d" is passed in on the command line.
protected  void handlePCommandLineOption(java.lang.String value)
          Called when "-p" is passed on the command line.
protected  void handleSlowCommandLineOption()
          Called when "-slow" is passed in on the command line.
protected  void handleUCommandLineOption(java.lang.String value)
          Called when "-u" is passed in on the command line.
protected  void indexPage(HTMLPage page)
          "Indexes" a HTMLpage.
protected  boolean linkToHTMLPage(Link link)
          Check if this is a link to an HTML page.
static void main(java.lang.String[] args)
          Spider Yahoo category to randomly collect pages according to the following command options: -d <directory> : Store indexed files in <directory>. -c <maxCount> : Find <maxCount> files (default is 10,000). -u <url> : Start at Yahoo directory page given by <url>. -p <prefix > : Prefix saved file names with <prefix>. -slow : Pause briefly before getting a page.
 void processArgs(java.lang.String[] args)
          Processes command-line arguments.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

topCategoryLink

protected Link topCategoryLink
Link for the main topic Yahoo category


filePrefix

protected java.lang.String filePrefix
Prefix to add to the name of all saved files for the current cateogry


categoryLinks

protected java.util.List<Link> categoryLinks
List of category links found for the current directory page


siteLinks

protected java.util.List<Link> siteLinks
List of site links found for the current directory page


slow

protected boolean slow
Flag to purposely slow the crawl for debugging purposes


retriever

protected HTMLPageRetriever retriever
The object to be used to retrieve pages


saveDir

protected java.io.File saveDir
The directory to save the downloaded files to.


count

protected int count
The number of pages indexed. In the default implementation a page is considered to be indexed only if it is written to a file.


maxCount

protected int maxCount
The number of pages to be found and indexed.


categoryLinksMap

public java.util.Map<Link,java.util.List<Link>> categoryLinksMap
The HashMap for storing categoryLinks for already downloaded Links


siteLinksMap

public java.util.Map<Link,java.util.List<Link>> siteLinksMap
The HashMap for storing siteLinks for already downloaded Links


visitedSites

protected java.util.HashSet<Link> visitedSites
The sites that have already been indexed.


random

protected java.util.Random random
Random number generator to use

Constructor Detail

YahooSpider

public YahooSpider()
Method Detail

go

public void go(java.lang.String[] args)
Checks command line arguments and performs the crawl.

This implementation calls processArgs and doCrawl.

Parameters:
args - Command line arguments.

processArgs

public void processArgs(java.lang.String[] args)
Processes command-line arguments.

The following options are handled by this function:

Each option has a corresponding handleXXXCommandLineOption function that will be called when the option is found. Subclasses may find it convenient to change how options are handled by overriding those methods instead of this one. Only the above options will be dealt with by this function, and the input array will remain unchanged. Note that if the flag for an option appears in the input array, any value associated with that option will be assumed to follow. Thus if a "-c" flag appears in args, the next value in args will be blindly treated as the count.

Parameters:
args - Array of arguments as passed in from the command line.

handleDCommandLineOption

protected void handleDCommandLineOption(java.lang.String value)
Called when "-d" is passed in on the command line.

This implementation sets saveDir to value.

Parameters:
value - The value associated with the "-d" option.

handleCCommandLineOption

protected void handleCCommandLineOption(java.lang.String value)
Called when "-c" is passed in on the command line.

This implementation sets maxCount to the integer represented by value.

Parameters:
value - The value associated with the "-c" option.

handleUCommandLineOption

protected void handleUCommandLineOption(java.lang.String value)
Called when "-u" is passed in on the command line.

This implementation sets the top level Yahoo directory category link to value

Parameters:
value - The value associated with the "-u" option.

handlePCommandLineOption

protected void handlePCommandLineOption(java.lang.String value)
Called when "-p" is passed on the command line. Sets file name prefix for saved files.


handleSlowCommandLineOption

protected void handleSlowCommandLineOption()
Called when "-slow" is passed in on the command line.

This implementation sets a flag that will be used in go to pause briefly before downloading each page.


doCrawl

public void doCrawl()
Performs the crawl. Should be called after processArgs has been called. Assumes that starting url has been set.

This implementation iterates until count >= maxCount


getRandomLink

protected Link getRandomLink(java.util.List<Link> links)
Pick a random link from a list of links


linkToHTMLPage

protected boolean linkToHTMLPage(Link link)
Check if this is a link to an HTML page.

Returns:
true if a directory or clearly an HTML page

indexPage

protected void indexPage(HTMLPage page)
"Indexes" a HTMLpage. This version just writes it out to a file in the specified directory with a filePrefix<count>.html file name.

Parameters:
page - An HTMLPage that contains the page to index.

main

public static void main(java.lang.String[] args)
Spider Yahoo category to randomly collect pages according to the following command options: