CrawlerHandlerBase (Aperture Core 1.5.0 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.semanticdesktop.aperture.crawler.base
Class CrawlerHandlerBase

java.lang.Object
  org.semanticdesktop.aperture.crawler.base.CrawlerHandlerBase

All Implemented Interfaces:: CrawlerHandler

public class CrawlerHandlerBase
extends Object
implements CrawlerHandler
extends Object
implements CrawlerHandler

A base implementation of the CrawlerHandler interface. The method implementations are simplest possible, that fulfill the contract. The applications are expected to override the methods they need. The processBinary(Crawler, DataObject) object is provided as a reference implementation to show how to use the MIMEtype-detector and the extractors.

Subclassing CrawlerHandlerBase

Create a subclass of this class to integrate Aperture into existing applications.

Write a constructor or initializing method to set the mimeTypeIdentifier, extractorRegistry and subCrawlerRegistry.
Review the method getRDFContainerFactory(Crawler, String) to influence the RDF containers used.
Override all objectXXX methods to handle the data, possibly calling processBinary(Crawler, DataObject) to extract the contents of binary streams.

The objectXXX methods need to be implemented to do something with the dataobjects found by Aperture.

Author:: leo sauermann, antoni mylka

Field Summary
`protected boolean`	`extractingContents` should binaries be processed?
`protected ExtractorRegistry`	`extractorRegistry` Extractor registry, may be set by overriding classes to use processBinary
`protected MimeTypeIdentifier`	`mimeTypeIdentifier` Mime-type identifier, must be set by overriding classes to use processBinary
`protected SubCrawlerRegistry`	`subCrawlerRegistry` Subcrawler registry, may be set by overriding classes to use processBinary

Constructor Summary
`CrawlerHandlerBase()` Construct and empty BaseCrawlerHandler.
`CrawlerHandlerBase(MimeTypeIdentifier mimeTypeIdentifier, ExtractorRegistry extractorRegistry, SubCrawlerRegistry subCrawlerRegistry)` Construct an initialised BaseCrawlerHandler.

Method Summary
`void`	`accessingObject(Crawler crawler, String url)` This method implementation doesn't do anything, it is meant to be overridden.
`void`	`clearFinished(Crawler crawler, ExitCode exitCode)` This method implementation doesn't do anything, it is meant to be overridden.
`void`	`clearingObject(Crawler crawler, String url)` This method implementation doesn't do anything, it is meant to be overridden.
`void`	`clearStarted(Crawler crawler)` This method implementation doesn't do anything, it is meant to be overridden.
`void`	`crawlStarted(Crawler crawler)` This method implementation doesn't do anything, it is meant to be overridden.
`void`	`crawlStopped(Crawler crawler, ExitCode exitCode)` This method implementation doesn't do anything, it is meant to be overridden.
`RDFContainerFactory`	`getRDFContainerFactory(Crawler crawler, String url)` Returns an rdf container factory.
`boolean`	`isExtractingContents()` should binaries be processed?
`void`	`objectChanged(Crawler crawler, DataObject object)` This method implementation only disposes the data object and does nothing more.
`void`	`objectNew(Crawler crawler, DataObject object)` This method implementation only disposes the data object and does nothing more.
`void`	`objectNotModified(Crawler crawler, String url)` This method implementation doesn't do anything, it is meant to be overridden.
`void`	`objectRemoved(Crawler crawler, String url)` This method implementation doesn't do anything, it is meant to be overridden.
`protected void`	`processBinary(Crawler crawler, DataObject dataObject)` Default and reference implementation of the handling of objects found in the crawling process: Identify the mime-type, invoke Extractors.
`void`	`setExtractingContents(boolean extractingContents)` should binaries be processed?

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

extractingContents

protected boolean extractingContents

should binaries be processed?

mimeTypeIdentifier

protected MimeTypeIdentifier mimeTypeIdentifier

Mime-type identifier, must be set by overriding classes to use processBinary

extractorRegistry

protected ExtractorRegistry extractorRegistry

Extractor registry, may be set by overriding classes to use processBinary

subCrawlerRegistry

protected SubCrawlerRegistry subCrawlerRegistry

Subcrawler registry, may be set by overriding classes to use processBinary

Constructor Detail

CrawlerHandlerBase

public CrawlerHandlerBase()

Construct and empty BaseCrawlerHandler. set the extractorRegistry, mimeTypeIdentifier, and subCrawlerRegistry yourself.

CrawlerHandlerBase

public CrawlerHandlerBase(MimeTypeIdentifier mimeTypeIdentifier,
                          ExtractorRegistry extractorRegistry,
                          SubCrawlerRegistry subCrawlerRegistry)

Construct an initialised BaseCrawlerHandler. Pass the needed objects for binary handling.

Parameters:: mimeTypeIdentifier - initialised MimeTypeIdentifier; extractorRegistry - initialised ExtractorRegistry, can be null if binary handling is not needed; subCrawlerRegistry - initialised SubCrawlerRegistry, can be null if binary handling is not needed

Method Detail

getRDFContainerFactory

public RDFContainerFactory getRDFContainerFactory(Crawler crawler,
                                                  String url)

Returns an rdf container factory. This method implementation returns a factory which delivers simple RDFContainers backed by in-memory models obtained from the RDF2Go.getModelFactory() method. Each model is separate.

Specified by:: getRDFContainerFactory in interface CrawlerHandler

Parameters:: crawler - The requesting Crawler.; url - The url of the resource that is currently being accessed.
Returns:: an RDFContainer instance.
See Also:: CrawlerHandler.getRDFContainerFactory(Crawler, String)

accessingObject

public void accessingObject(Crawler crawler,
                            String url)

This method implementation doesn't do anything, it is meant to be overridden.

Specified by:: accessingObject in interface CrawlerHandler

Parameters:: crawler - The reporting Crawler.; url - The url of the resource that is going to be accessed.
See Also:: CrawlerHandler.accessingObject(Crawler, String)

clearFinished

public void clearFinished(Crawler crawler,
                          ExitCode exitCode)

This method implementation doesn't do anything, it is meant to be overridden.

Specified by:: clearFinished in interface CrawlerHandler

Parameters:: crawler - The concerning Crawler.; exitCode - The status with which the clearing stopped.
See Also:: CrawlerHandler.clearFinished(Crawler, ExitCode)

clearingObject

public void clearingObject(Crawler crawler,
                           String url)

This method implementation doesn't do anything, it is meant to be overridden.

Specified by:: clearingObject in interface CrawlerHandler

Parameters:: crawler - The reporting Crawler.; url - The url of the resource whose crawl results are being cleared.
See Also:: CrawlerHandler.clearingObject(Crawler, String)

clearStarted

public void clearStarted(Crawler crawler)

This method implementation doesn't do anything, it is meant to be overridden.

Specified by:: clearStarted in interface CrawlerHandler

Parameters:: crawler - The reporting Crawler.
See Also:: CrawlerHandler.clearStarted(Crawler)

crawlStarted

public void crawlStarted(Crawler crawler)

This method implementation doesn't do anything, it is meant to be overridden.

Specified by:: crawlStarted in interface CrawlerHandler

Parameters:: crawler - The reporting Crawler.
See Also:: CrawlerHandler.crawlStarted(Crawler)

crawlStopped

public void crawlStopped(Crawler crawler,
                         ExitCode exitCode)

This method implementation doesn't do anything, it is meant to be overridden.

Specified by:: crawlStopped in interface CrawlerHandler

Parameters:: crawler - The reporting Crawler.; exitCode - The status with which the crawling stopped.
See Also:: CrawlerHandler.crawlStopped(Crawler, ExitCode)

objectChanged

public void objectChanged(Crawler crawler,
                          DataObject object)

This method implementation only disposes the data object and does nothing more. It is meant to be overridden.

Specified by:: objectChanged in interface CrawlerHandler

Parameters:: crawler - The reporting Crawler.; object - The constructed DataObject modeling the changed resource.
See Also:: CrawlerHandler.objectChanged(Crawler, DataObject)

objectNew

public void objectNew(Crawler crawler,
                      DataObject object)

This method implementation only disposes the data object and does nothing more. It is meant to be overridden.

Specified by:: objectNew in interface CrawlerHandler

Parameters:: crawler - The reporting Crawler.; object - The constructed DataObject modeling the new resource.
See Also:: CrawlerHandler.objectNew(Crawler, DataObject)

objectNotModified

public void objectNotModified(Crawler crawler,
                              String url)

This method implementation doesn't do anything, it is meant to be overridden.

Specified by:: objectNotModified in interface CrawlerHandler

Parameters:: crawler - The reporting Crawler.; url - The url of the unmodified resource.
See Also:: CrawlerHandler.objectNotModified(Crawler, String)

objectRemoved

public void objectRemoved(Crawler crawler,
                          String url)

This method implementation doesn't do anything, it is meant to be overridden.

Specified by:: objectRemoved in interface CrawlerHandler

Parameters:: crawler - The reporting Crawler.; url - The url that could no longer be found.
See Also:: CrawlerHandler.objectRemoved(Crawler, String)

processBinary

protected void processBinary(Crawler crawler,
                             DataObject dataObject)
                      throws IOException,
                             ExtractorException,
                             SubCrawlerException

Default and reference implementation of the handling of objects found in the crawling process: Identify the mime-type, invoke Extractors. Interprets the boolean value "extractingContents" which is by default true.

Parameters:: crawler - the crawler that reported the dataObject. The crawler will be used to invoke subcrawlers, if needed. The control then stays within the crawler's thread.; dataObject - the data object to process. When the passed DataObject is not a FileDataObject, nothing will be done.
Throws:: IOException - when the stream cannot be read; ExctractorException - when the extractor fails; SubCrawlerException - when the extraction of contents using a SubCrawler failed.; ExtractorException

isExtractingContents

public boolean isExtractingContents()

should binaries be processed?

Returns:: true, when binaries are processed

setExtractingContents

public void setExtractingContents(boolean extractingContents)

should binaries be processed?

Parameters:: extractingContents - set to true to extract the contents when calling #processBinary(DataObject)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.semanticdesktop.aperture.crawler.base Class CrawlerHandlerBase

Subclassing CrawlerHandlerBase

extractingContents

mimeTypeIdentifier

extractorRegistry

subCrawlerRegistry

CrawlerHandlerBase

CrawlerHandlerBase

getRDFContainerFactory

accessingObject

clearFinished

clearingObject

clearStarted

crawlStarted

crawlStopped

objectChanged

objectNew

objectNotModified

objectRemoved

processBinary

isExtractingContents

setExtractingContents

org.semanticdesktop.aperture.crawler.base
Class CrawlerHandlerBase