org.semanticdesktop.aperture.crawler.base
Class CrawlerHandlerBase

java.lang.Object
  extended by org.semanticdesktop.aperture.crawler.base.CrawlerHandlerBase
All Implemented Interfaces:
CrawlerHandler

public class CrawlerHandlerBase
extends Object
implements CrawlerHandler

A base implementation of the CrawlerHandler interface. The method implementations are simplest possible, that fulfill the contract. The applications are expected to override the methods they need. The processBinary(Crawler, DataObject) object is provided as a reference implementation to show how to use the MIMEtype-detector and the extractors.

Subclassing CrawlerHandlerBase

Create a subclass of this class to integrate Aperture into existing applications. The objectXXX methods need to be implemented to do something with the dataobjects found by Aperture.

Author:
leo sauermann, antoni mylka

Field Summary
protected  boolean extractingContents
          should binaries be processed?
protected  ExtractorRegistry extractorRegistry
          Extractor registry, may be set by overriding classes to use processBinary
protected  MimeTypeIdentifier mimeTypeIdentifier
          Mime-type identifier, must be set by overriding classes to use processBinary
protected  SubCrawlerRegistry subCrawlerRegistry
          Subcrawler registry, may be set by overriding classes to use processBinary
 
Constructor Summary
CrawlerHandlerBase()
          Construct and empty BaseCrawlerHandler.
CrawlerHandlerBase(MimeTypeIdentifier mimeTypeIdentifier, ExtractorRegistry extractorRegistry, SubCrawlerRegistry subCrawlerRegistry)
          Construct an initialised BaseCrawlerHandler.
 
Method Summary
 void accessingObject(Crawler crawler, String url)
          This method implementation doesn't do anything, it is meant to be overridden.
 void clearFinished(Crawler crawler, ExitCode exitCode)
          This method implementation doesn't do anything, it is meant to be overridden.
 void clearingObject(Crawler crawler, String url)
          This method implementation doesn't do anything, it is meant to be overridden.
 void clearStarted(Crawler crawler)
          This method implementation doesn't do anything, it is meant to be overridden.
 void crawlStarted(Crawler crawler)
          This method implementation doesn't do anything, it is meant to be overridden.
 void crawlStopped(Crawler crawler, ExitCode exitCode)
          This method implementation doesn't do anything, it is meant to be overridden.
 RDFContainerFactory getRDFContainerFactory(Crawler crawler, String url)
          Returns an rdf container factory.
 boolean isExtractingContents()
          should binaries be processed?
 void objectChanged(Crawler crawler, DataObject object)
          This method implementation only disposes the data object and does nothing more.
 void objectNew(Crawler crawler, DataObject object)
          This method implementation only disposes the data object and does nothing more.
 void objectNotModified(Crawler crawler, String url)
          This method implementation doesn't do anything, it is meant to be overridden.
 void objectRemoved(Crawler crawler, String url)
          This method implementation doesn't do anything, it is meant to be overridden.
protected  void processBinary(Crawler crawler, DataObject dataObject)
          Default and reference implementation of the handling of objects found in the crawling process: Identify the mime-type, invoke Extractors.
 void setExtractingContents(boolean extractingContents)
          should binaries be processed?
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

extractingContents

protected boolean extractingContents
should binaries be processed?


mimeTypeIdentifier

protected MimeTypeIdentifier mimeTypeIdentifier
Mime-type identifier, must be set by overriding classes to use processBinary


extractorRegistry

protected ExtractorRegistry extractorRegistry
Extractor registry, may be set by overriding classes to use processBinary


subCrawlerRegistry

protected SubCrawlerRegistry subCrawlerRegistry
Subcrawler registry, may be set by overriding classes to use processBinary

Constructor Detail

CrawlerHandlerBase

public CrawlerHandlerBase()
Construct and empty BaseCrawlerHandler. set the extractorRegistry, mimeTypeIdentifier, and subCrawlerRegistry yourself.


CrawlerHandlerBase

public CrawlerHandlerBase(MimeTypeIdentifier mimeTypeIdentifier,
                          ExtractorRegistry extractorRegistry,
                          SubCrawlerRegistry subCrawlerRegistry)
Construct an initialised BaseCrawlerHandler. Pass the needed objects for binary handling.

Parameters:
mimeTypeIdentifier - initialised MimeTypeIdentifier
extractorRegistry - initialised ExtractorRegistry, can be null if binary handling is not needed
subCrawlerRegistry - initialised SubCrawlerRegistry, can be null if binary handling is not needed
Method Detail

getRDFContainerFactory

public RDFContainerFactory getRDFContainerFactory(Crawler crawler,
                                                  String url)
Returns an rdf container factory. This method implementation returns a factory which delivers simple RDFContainers backed by in-memory models obtained from the RDF2Go.getModelFactory() method. Each model is separate.

Specified by:
getRDFContainerFactory in interface CrawlerHandler
Parameters:
crawler - The requesting Crawler.
url - The url of the resource that is currently being accessed.
Returns:
an RDFContainer instance.
See Also:
CrawlerHandler.getRDFContainerFactory(Crawler, String)

accessingObject

public void accessingObject(Crawler crawler,
                            String url)
This method implementation doesn't do anything, it is meant to be overridden.

Specified by:
accessingObject in interface CrawlerHandler
Parameters:
crawler - The reporting Crawler.
url - The url of the resource that is going to be accessed.
See Also:
CrawlerHandler.accessingObject(Crawler, String)

clearFinished

public void clearFinished(Crawler crawler,
                          ExitCode exitCode)
This method implementation doesn't do anything, it is meant to be overridden.

Specified by:
clearFinished in interface CrawlerHandler
Parameters:
crawler - The concerning Crawler.
exitCode - The status with which the clearing stopped.
See Also:
CrawlerHandler.clearFinished(Crawler, ExitCode)

clearingObject

public void clearingObject(Crawler crawler,
                           String url)
This method implementation doesn't do anything, it is meant to be overridden.

Specified by:
clearingObject in interface CrawlerHandler
Parameters:
crawler - The reporting Crawler.
url - The url of the resource whose crawl results are being cleared.
See Also:
CrawlerHandler.clearingObject(Crawler, String)

clearStarted

public void clearStarted(Crawler crawler)
This method implementation doesn't do anything, it is meant to be overridden.

Specified by:
clearStarted in interface CrawlerHandler
Parameters:
crawler - The reporting Crawler.
See Also:
CrawlerHandler.clearStarted(Crawler)

crawlStarted

public void crawlStarted(Crawler crawler)
This method implementation doesn't do anything, it is meant to be overridden.

Specified by:
crawlStarted in interface CrawlerHandler
Parameters:
crawler - The reporting Crawler.
See Also:
CrawlerHandler.crawlStarted(Crawler)

crawlStopped

public void crawlStopped(Crawler crawler,
                         ExitCode exitCode)
This method implementation doesn't do anything, it is meant to be overridden.

Specified by:
crawlStopped in interface CrawlerHandler
Parameters:
crawler - The reporting Crawler.
exitCode - The status with which the crawling stopped.
See Also:
CrawlerHandler.crawlStopped(Crawler, ExitCode)

objectChanged

public void objectChanged(Crawler crawler,
                          DataObject object)
This method implementation only disposes the data object and does nothing more. It is meant to be overridden.

Specified by:
objectChanged in interface CrawlerHandler
Parameters:
crawler - The reporting Crawler.
object - The constructed DataObject modeling the changed resource.
See Also:
CrawlerHandler.objectChanged(Crawler, DataObject)

objectNew

public void objectNew(Crawler crawler,
                      DataObject object)
This method implementation only disposes the data object and does nothing more. It is meant to be overridden.

Specified by:
objectNew in interface CrawlerHandler
Parameters:
crawler - The reporting Crawler.
object - The constructed DataObject modeling the new resource.
See Also:
CrawlerHandler.objectNew(Crawler, DataObject)

objectNotModified

public void objectNotModified(Crawler crawler,
                              String url)
This method implementation doesn't do anything, it is meant to be overridden.

Specified by:
objectNotModified in interface CrawlerHandler
Parameters:
crawler - The reporting Crawler.
url - The url of the unmodified resource.
See Also:
CrawlerHandler.objectNotModified(Crawler, String)

objectRemoved

public void objectRemoved(Crawler crawler,
                          String url)
This method implementation doesn't do anything, it is meant to be overridden.

Specified by:
objectRemoved in interface CrawlerHandler
Parameters:
crawler - The reporting Crawler.
url - The url that could no longer be found.
See Also:
CrawlerHandler.objectRemoved(Crawler, String)

processBinary

protected void processBinary(Crawler crawler,
                             DataObject dataObject)
                      throws IOException,
                             ExtractorException,
                             SubCrawlerException
Default and reference implementation of the handling of objects found in the crawling process: Identify the mime-type, invoke Extractors. Interprets the boolean value "extractingContents" which is by default true.

Parameters:
crawler - the crawler that reported the dataObject. The crawler will be used to invoke subcrawlers, if needed. The control then stays within the crawler's thread.
dataObject - the data object to process. When the passed DataObject is not a FileDataObject, nothing will be done.
Throws:
IOException - when the stream cannot be read
ExctractorException - when the extractor fails
SubCrawlerException - when the extraction of contents using a SubCrawler failed.
ExtractorException

isExtractingContents

public boolean isExtractingContents()
should binaries be processed?

Returns:
true, when binaries are processed

setExtractingContents

public void setExtractingContents(boolean extractingContents)
should binaries be processed?

Parameters:
extractingContents - set to true to extract the contents when calling #processBinary(DataObject)


Copyright © 2010 Aperture Development Team. All Rights Reserved.