org.semanticdesktop.aperture.crawler.base
Class CrawlerBase

java.lang.Object
  extended by org.semanticdesktop.aperture.crawler.base.CrawlerBase
All Implemented Interfaces:
Crawler
Direct Known Subclasses:
AbstractJavaMailCrawler, AbstractTagCrawler, AddressbookCrawler, BibsonomyCrawler, FileSystemCrawler, FlickrCrawler, IcalCrawler, OutlookCrawler, SambaCrawler, WebCrawler

public abstract class CrawlerBase
extends Object
implements Crawler

An implementation of the Crawler interface that offers generic implementations for some of its methods.


Field Summary
protected  AccessData accessData
          The current AccessData instance.
protected  DataAccessorRegistry accessorRegistry
          The current DataAccessorRegistry.
protected  File crawlReportFile
          The file for persistent storage of CrawlReports.
protected  DataSource source
          The DataSource representing the physical source of information.
protected  boolean stopRequested
          Flag indicating that this Crawler should stop scanning or clearing as soon as possible.
 
Constructor Summary
CrawlerBase()
          The default constructor
 
Method Summary
 void clear()
          Reports all IDs stored in the AccessData as being cleared to the CrawlerHandler and then gets rid of the AccessData instance.
protected  void clear(String url)
           
 void crawl()
          Starts crawling the domain defined in the DataSource of this Crawler.
protected abstract  ExitCode crawlObjects()
          Method called by crawl() that should implement the actual crawling of the DataSource.
 AccessData getAccessData()
          Returns the AccessData instance used by the crawler
 CrawlerHandler getCrawlerHandler()
          Returns the crawler handler
 CrawlReport getCrawlReport()
          Gets the CrawlReport of the last performed crawl, or the current crawl when it is in progress.
 File getCrawlReportFile()
          Returns the file where the crawl report is to be saved
 DataAccessorRegistry getDataAccessorRegistry()
          Returns the data accessor registry
 DataSource getDataSource()
          Returns the data source
protected  RDFContainerFactory getRDFContainerFactory(String url)
           
protected  boolean inDomain(String uri)
           
 boolean isStopRequested()
          Returns true if the crawler is currently stopping, false otherwise
protected  void reportAccessingObject(String url)
           
protected  void reportDeletedDataObject(String url)
           
protected  ExitCode reportFatalErrorCause(String msg)
          Reports the cause of the fatal error.
protected  ExitCode reportFatalErrorCause(String msg, Throwable cause)
          Reports the cause of the fatal error.
protected  ExitCode reportFatalErrorCause(Throwable t)
          Reports the cause of the fatal error.
protected  void reportModifiedDataObject(DataObject object)
           
protected  void reportNewDataObject(DataObject object)
           
protected  void reportUnmodifiedDataObject(String url)
           
protected  void reportUntouched()
           
 void runSubCrawler(SubCrawler localSubCrawler, DataObject object, InputStream stream, Charset charset, String mimeType)
          Runs the given SubCrawler on the given stream.
 void setAccessData(AccessData accessData)
          Sets the AccessData instance to be used by the crawler
 void setCrawlerHandler(CrawlerHandler handler)
          Sets the crawler handler
 void setCrawlReportFile(File file)
          Sets the file where the crawl report is to be saved
 void setDataAccessorRegistry(DataAccessorRegistry accessorRegistry)
          Sets the data accessor registry
 void setDataSource(DataSource source)
          Sets the data source
 void stop()
          Stops a running crawl or clear operation as fast as possible.
protected  void storeCrawlReport()
          Stores the current CrawlReport, if any, to the crawl report file, is set.
protected  void touchObject(String string)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

source

protected DataSource source
The DataSource representing the physical source of information.


accessorRegistry

protected DataAccessorRegistry accessorRegistry
The current DataAccessorRegistry.


accessData

protected AccessData accessData
The current AccessData instance.


crawlReportFile

protected File crawlReportFile
The file for persistent storage of CrawlReports.


stopRequested

protected boolean stopRequested
Flag indicating that this Crawler should stop scanning or clearing as soon as possible.

Constructor Detail

CrawlerBase

public CrawlerBase()
The default constructor

Method Detail

setDataSource

public void setDataSource(DataSource source)
Sets the data source

Parameters:
source - the new data source

getDataSource

public DataSource getDataSource()
Returns the data source

Specified by:
getDataSource in interface Crawler
Returns:
the data source
See Also:
Crawler.getDataSource()

setDataAccessorRegistry

public void setDataAccessorRegistry(DataAccessorRegistry accessorRegistry)
Sets the data accessor registry

Specified by:
setDataAccessorRegistry in interface Crawler
Parameters:
accessorRegistry - the new data accessor registry
See Also:
Crawler.setDataAccessorRegistry(DataAccessorRegistry)

getDataAccessorRegistry

public DataAccessorRegistry getDataAccessorRegistry()
Returns the data accessor registry

Specified by:
getDataAccessorRegistry in interface Crawler
Returns:
the data accessor registry
See Also:
Crawler.getDataAccessorRegistry()

setAccessData

public void setAccessData(AccessData accessData)
Sets the AccessData instance to be used by the crawler

Specified by:
setAccessData in interface Crawler
Parameters:
accessData - the AccessData instance to be used by the crawler
See Also:
Crawler.setAccessData(AccessData)

getAccessData

public AccessData getAccessData()
Returns the AccessData instance used by the crawler

Specified by:
getAccessData in interface Crawler
Returns:
the AccessData instance used by the crawler
See Also:
Crawler.getAccessData()

setCrawlerHandler

public void setCrawlerHandler(CrawlerHandler handler)
Sets the crawler handler

Specified by:
setCrawlerHandler in interface Crawler
Parameters:
handler - the crawler handler
See Also:
Crawler.setCrawlerHandler(CrawlerHandler)

getCrawlerHandler

public CrawlerHandler getCrawlerHandler()
Returns the crawler handler

Specified by:
getCrawlerHandler in interface Crawler
Returns:
the crawler handler
See Also:
Crawler.getCrawlerHandler()

crawl

public void crawl()
Description copied from interface: Crawler
Starts crawling the domain defined in the DataSource of this Crawler. If this is a subsequent run of this method, it will only report the differences with the previous run, unless the previous scan results have been cleared. Any CrawlerListeners registered on this Crawler will get notified about the crawling progress.

Specified by:
crawl in interface Crawler
See Also:
Crawler.crawl()

crawlObjects

protected abstract ExitCode crawlObjects()
Method called by crawl() that should implement the actual crawling of the DataSource. The return value of this method should indicate whether the scanning was completed successfully (i.e. it wasn't interrupted or anything). Also this method is expected to update the deprecatedUrls set, as any remaining URLs in this set will be removed as being removed after this method completes.

Returns:
An ExitCode indicating how the crawl procedure terminated.

stop

public void stop()
Description copied from interface: Crawler
Stops a running crawl or clear operation as fast as possible. This method may return before the crawling has actually stopped.

Specified by:
stop in interface Crawler
See Also:
Crawler.stop()

isStopRequested

public boolean isStopRequested()
Returns true if the crawler is currently stopping, false otherwise

Returns:
true if the crawler is currently stopping, false otherwise

clear

public void clear()
Reports all IDs stored in the AccessData as being cleared to the CrawlerHandler and then gets rid of the AccessData instance.

Specified by:
clear in interface Crawler

clear

protected void clear(String url)

setCrawlReportFile

public void setCrawlReportFile(File file)
Sets the file where the crawl report is to be saved

Parameters:
file - the file where the crawl report is to be saved

getCrawlReportFile

public File getCrawlReportFile()
Returns the file where the crawl report is to be saved

Returns:
the file where the crawl report is to be saved

getCrawlReport

public CrawlReport getCrawlReport()
Description copied from interface: Crawler
Gets the CrawlReport of the last performed crawl, or the current crawl when it is in progress. Returns null when no crawl has been performed in this application's session yet and there is no report available from the previous session.

Specified by:
getCrawlReport in interface Crawler
Returns:
The CrawlReport of the last run, or null when this is not available.
See Also:
Crawler.getCrawlReport()

reportAccessingObject

protected void reportAccessingObject(String url)

reportNewDataObject

protected void reportNewDataObject(DataObject object)

touchObject

protected void touchObject(String string)

reportModifiedDataObject

protected void reportModifiedDataObject(DataObject object)

reportUnmodifiedDataObject

protected void reportUnmodifiedDataObject(String url)

reportDeletedDataObject

protected void reportDeletedDataObject(String url)

reportUntouched

protected void reportUntouched()

reportFatalErrorCause

protected ExitCode reportFatalErrorCause(String msg)
Reports the cause of the fatal error. Should always be called in subclasses if crawlObjects() is to return ExitCode.FATAL_ERROR

Parameters:
t -
Returns:
ExitCode.FATAL_ERROR

reportFatalErrorCause

protected ExitCode reportFatalErrorCause(String msg,
                                         Throwable cause)
Reports the cause of the fatal error. Should always be called in subclasses if crawlObjects() is to return ExitCode.FATAL_ERROR

Parameters:
t -
Returns:
ExitCode.FATAL_ERROR

reportFatalErrorCause

protected ExitCode reportFatalErrorCause(Throwable t)
Reports the cause of the fatal error. Should always be called in subclasses if crawlObjects() is to return ExitCode.FATAL_ERROR

Parameters:
t -
Returns:
ExitCode.FATAL_ERROR

getRDFContainerFactory

protected RDFContainerFactory getRDFContainerFactory(String url)

storeCrawlReport

protected void storeCrawlReport()
Stores the current CrawlReport, if any, to the crawl report file, is set.


inDomain

protected boolean inDomain(String uri)

runSubCrawler

public void runSubCrawler(SubCrawler localSubCrawler,
                          DataObject object,
                          InputStream stream,
                          Charset charset,
                          String mimeType)
                   throws SubCrawlerException
Description copied from interface: Crawler
Runs the given SubCrawler on the given stream.

This method uses the information stored within the crawler to provide appropriate arguments to the SubCrawler.subCrawl(...) method. DataObjects found by the SubCrawler will be reported to the CrawlerHandler registered with this crawler with the Crawler.setCrawlerHandler(CrawlerHandler) method. The AccessData and the internal data structures of this crawler will be updated correctly. The SubCrawler will be stopped if the Crawler.stop() method is invoked on this crawler.

In most cases, the SubCrawler is used to extract additional information from the DataObject found by a normal Crawler. In such cases using this method is strongly recommended instead of invoking SubCrawler.subCrawl(...) directly. The Crawler may behave unpredictably.

IMPORTANT

There are two important issues to take care about when calling this method.

Firstly. If this method is called from a CrawlerHandler method (e.g. CrawlerHandler.objectNew(Crawler, DataObject) or CrawlerHandler.objectChanged(Crawler, DataObject)) that has been invoked by a running crawler, it SHOULD be run on the same thread that called the CrawlerHandler method (i.e. the crawling thread). Trying to run this method in a new thread may result in unpredictable behavior of the Crawler.

The second issue is that after this method is called, the crawler will report new or modified objects before this method returns, so the implementations of CrawlerHandler methods must be reentrant. See wikipedia. It is recommended that processing any metadata in a DataObject takes place before the invocation of a SubCrawler.

Specified by:
runSubCrawler in interface Crawler
Parameters:
localSubCrawler - the subcrawler to be used
object - the parent data object, its metadata may be augmented by the SubCrawler
stream - the InputStream for the SubCrawler to work on. Note that even though there may be additional resources stored in the DataObject itself (like an InputStream or a File) they are not used.
charset - the charset in which the input stream is encoded (optional)
mimeType - the mime type of the input stream (optional)
Throws:
SubCrawlerException - if some error during SubCrawling occurs.
See Also:
Crawler.runSubCrawler(SubCrawler, DataObject, InputStream, Charset, String)


Copyright © 2010 Aperture Development Team. All Rights Reserved.