CrawlerBase (Aperture Core 1.5.0 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.semanticdesktop.aperture.crawler.base
Class CrawlerBase

java.lang.Object
  org.semanticdesktop.aperture.crawler.base.CrawlerBase

All Implemented Interfaces:: Crawler

Direct Known Subclasses:: AbstractJavaMailCrawler, AbstractTagCrawler, AddressbookCrawler, BibsonomyCrawler, FileSystemCrawler, FlickrCrawler, IcalCrawler, OutlookCrawler, SambaCrawler, WebCrawler

public abstract class CrawlerBase
extends Object
implements Crawler
extends Object
implements Crawler

An implementation of the Crawler interface that offers generic implementations for some of its methods.

Field Summary
`protected AccessData`	`accessData` The current AccessData instance.
`protected DataAccessorRegistry`	`accessorRegistry` The current DataAccessorRegistry.
`protected File`	`crawlReportFile` The file for persistent storage of CrawlReports.
`protected DataSource`	`source` The DataSource representing the physical source of information.
`protected boolean`	`stopRequested` Flag indicating that this Crawler should stop scanning or clearing as soon as possible.

Constructor Summary
`CrawlerBase()` The default constructor

Method Summary
`void`	`clear()` Reports all IDs stored in the AccessData as being cleared to the CrawlerHandler and then gets rid of the AccessData instance.
`protected void`	`clear(String url)`
`void`	`crawl()` Starts crawling the domain defined in the DataSource of this Crawler.
`protected abstract ExitCode`	`crawlObjects()` Method called by crawl() that should implement the actual crawling of the DataSource.
`AccessData`	`getAccessData()` Returns the AccessData instance used by the crawler
`CrawlerHandler`	`getCrawlerHandler()` Returns the crawler handler
`CrawlReport`	`getCrawlReport()` Gets the CrawlReport of the last performed crawl, or the current crawl when it is in progress.
`File`	`getCrawlReportFile()` Returns the file where the crawl report is to be saved
`DataAccessorRegistry`	`getDataAccessorRegistry()` Returns the data accessor registry
`DataSource`	`getDataSource()` Returns the data source
`protected RDFContainerFactory`	`getRDFContainerFactory(String url)`
`protected boolean`	`inDomain(String uri)`
`boolean`	`isStopRequested()` Returns true if the crawler is currently stopping, false otherwise
`protected void`	`reportAccessingObject(String url)`
`protected void`	`reportDeletedDataObject(String url)`
`protected ExitCode`	`reportFatalErrorCause(String msg)` Reports the cause of the fatal error.
`protected ExitCode`	`reportFatalErrorCause(String msg, Throwable cause)` Reports the cause of the fatal error.
`protected ExitCode`	`reportFatalErrorCause(Throwable t)` Reports the cause of the fatal error.
`protected void`	`reportModifiedDataObject(DataObject object)`
`protected void`	`reportNewDataObject(DataObject object)`
`protected void`	`reportUnmodifiedDataObject(String url)`
`protected void`	`reportUntouched()`
`void`	`runSubCrawler(SubCrawler localSubCrawler, DataObject object, InputStream stream, Charset charset, String mimeType)` Runs the given SubCrawler on the given stream.
`void`	`setAccessData(AccessData accessData)` Sets the AccessData instance to be used by the crawler
`void`	`setCrawlerHandler(CrawlerHandler handler)` Sets the crawler handler
`void`	`setCrawlReportFile(File file)` Sets the file where the crawl report is to be saved
`void`	`setDataAccessorRegistry(DataAccessorRegistry accessorRegistry)` Sets the data accessor registry
`void`	`setDataSource(DataSource source)` Sets the data source
`void`	`stop()` Stops a running crawl or clear operation as fast as possible.
`protected void`	`storeCrawlReport()` Stores the current CrawlReport, if any, to the crawl report file, is set.
`protected void`	`touchObject(String string)`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

source

protected DataSource source

The DataSource representing the physical source of information.

accessorRegistry

protected DataAccessorRegistry accessorRegistry

The current DataAccessorRegistry.

accessData

protected AccessData accessData

The current AccessData instance.

crawlReportFile

protected File crawlReportFile

The file for persistent storage of CrawlReports.

stopRequested

protected boolean stopRequested

Flag indicating that this Crawler should stop scanning or clearing as soon as possible.

Constructor Detail

CrawlerBase

public CrawlerBase()

The default constructor

Method Detail

setDataSource

public void setDataSource(DataSource source)

Sets the data source

Parameters:: source - the new data source

getDataSource

public DataSource getDataSource()

Returns the data source

Specified by:: getDataSource in interface Crawler

Returns:: the data source
See Also:: Crawler.getDataSource()

setDataAccessorRegistry

public void setDataAccessorRegistry(DataAccessorRegistry accessorRegistry)

Sets the data accessor registry

Specified by:: setDataAccessorRegistry in interface Crawler

Parameters:: accessorRegistry - the new data accessor registry
See Also:: Crawler.setDataAccessorRegistry(DataAccessorRegistry)

getDataAccessorRegistry

public DataAccessorRegistry getDataAccessorRegistry()

Returns the data accessor registry

Specified by:: getDataAccessorRegistry in interface Crawler

Returns:: the data accessor registry
See Also:: Crawler.getDataAccessorRegistry()

setAccessData

public void setAccessData(AccessData accessData)

Sets the AccessData instance to be used by the crawler

Specified by:: setAccessData in interface Crawler

Parameters:: accessData - the AccessData instance to be used by the crawler
See Also:: Crawler.setAccessData(AccessData)

getAccessData

public AccessData getAccessData()

Returns the AccessData instance used by the crawler

Specified by:: getAccessData in interface Crawler

Returns:: the AccessData instance used by the crawler
See Also:: Crawler.getAccessData()

setCrawlerHandler

public void setCrawlerHandler(CrawlerHandler handler)

Sets the crawler handler

Specified by:: setCrawlerHandler in interface Crawler

Parameters:: handler - the crawler handler
See Also:: Crawler.setCrawlerHandler(CrawlerHandler)

getCrawlerHandler

public CrawlerHandler getCrawlerHandler()

Returns the crawler handler

Specified by:: getCrawlerHandler in interface Crawler

Returns:: the crawler handler
See Also:: Crawler.getCrawlerHandler()

crawl

public void crawl()

Description copied from interface: Crawler

Starts crawling the domain defined in the DataSource of this Crawler. If this is a subsequent run of this method, it will only report the differences with the previous run, unless the previous scan results have been cleared. Any CrawlerListeners registered on this Crawler will get notified about the crawling progress.

Specified by:: crawl in interface Crawler

See Also:: Crawler.crawl()

crawlObjects

protected abstract ExitCode crawlObjects()

Method called by crawl() that should implement the actual crawling of the DataSource. The return value of this method should indicate whether the scanning was completed successfully (i.e. it wasn't interrupted or anything). Also this method is expected to update the deprecatedUrls set, as any remaining URLs in this set will be removed as being removed after this method completes.

Returns:: An ExitCode indicating how the crawl procedure terminated.

stop

public void stop()

Description copied from interface: Crawler

Stops a running crawl or clear operation as fast as possible. This method may return before the crawling has actually stopped.

Specified by:: stop in interface Crawler

See Also:: Crawler.stop()

isStopRequested

public boolean isStopRequested()

Returns true if the crawler is currently stopping, false otherwise

Returns:: true if the crawler is currently stopping, false otherwise

clear

public void clear()

Reports all IDs stored in the AccessData as being cleared to the CrawlerHandler and then gets rid of the AccessData instance.

Specified by:: clear in interface Crawler

clear

protected void clear(String url)

setCrawlReportFile

public void setCrawlReportFile(File file)

Sets the file where the crawl report is to be saved

Parameters:: file - the file where the crawl report is to be saved

getCrawlReportFile

public File getCrawlReportFile()

Returns the file where the crawl report is to be saved

Returns:: the file where the crawl report is to be saved

getCrawlReport

public CrawlReport getCrawlReport()

Description copied from interface: Crawler

Gets the CrawlReport of the last performed crawl, or the current crawl when it is in progress. Returns null when no crawl has been performed in this application's session yet and there is no report available from the previous session.

Specified by:: getCrawlReport in interface Crawler

Returns:: The CrawlReport of the last run, or null when this is not available.
See Also:: Crawler.getCrawlReport()

reportAccessingObject

protected void reportAccessingObject(String url)

reportNewDataObject

protected void reportNewDataObject(DataObject object)

touchObject

protected void touchObject(String string)

reportModifiedDataObject

protected void reportModifiedDataObject(DataObject object)

reportUnmodifiedDataObject

protected void reportUnmodifiedDataObject(String url)

reportDeletedDataObject

protected void reportDeletedDataObject(String url)

reportUntouched

protected void reportUntouched()

reportFatalErrorCause

protected ExitCode reportFatalErrorCause(String msg)

Reports the cause of the fatal error. Should always be called in subclasses if crawlObjects() is to return ExitCode.FATAL_ERROR

Parameters:: t -
Returns:: ExitCode.FATAL_ERROR

reportFatalErrorCause

protected ExitCode reportFatalErrorCause(String msg,
                                         Throwable cause)

Reports the cause of the fatal error. Should always be called in subclasses if crawlObjects() is to return ExitCode.FATAL_ERROR

Parameters:: t -
Returns:: ExitCode.FATAL_ERROR

reportFatalErrorCause

protected ExitCode reportFatalErrorCause(Throwable t)

Reports the cause of the fatal error. Should always be called in subclasses if crawlObjects() is to return ExitCode.FATAL_ERROR

Parameters:: t -
Returns:: ExitCode.FATAL_ERROR

getRDFContainerFactory

protected RDFContainerFactory getRDFContainerFactory(String url)

storeCrawlReport

protected void storeCrawlReport()

Stores the current CrawlReport, if any, to the crawl report file, is set.

inDomain

protected boolean inDomain(String uri)

runSubCrawler

public void runSubCrawler(SubCrawler localSubCrawler,
                          DataObject object,
                          InputStream stream,
                          Charset charset,
                          String mimeType)
                   throws SubCrawlerException

Description copied from interface: Crawler

Runs the given SubCrawler on the given stream.

This method uses the information stored within the crawler to provide appropriate arguments to the SubCrawler.subCrawl(...) method. DataObjects found by the SubCrawler will be reported to the CrawlerHandler registered with this crawler with the Crawler.setCrawlerHandler(CrawlerHandler) method. The AccessData and the internal data structures of this crawler will be updated correctly. The SubCrawler will be stopped if the Crawler.stop() method is invoked on this crawler.

In most cases, the SubCrawler is used to extract additional information from the DataObject found by a normal Crawler. In such cases using this method is strongly recommended instead of invoking SubCrawler.subCrawl(...) directly. The Crawler may behave unpredictably.

IMPORTANT

There are two important issues to take care about when calling this method.

Firstly. If this method is called from a CrawlerHandler method (e.g. CrawlerHandler.objectNew(Crawler, DataObject) or CrawlerHandler.objectChanged(Crawler, DataObject)) that has been invoked by a running crawler, it SHOULD be run on the same thread that called the CrawlerHandler method (i.e. the crawling thread). Trying to run this method in a new thread may result in unpredictable behavior of the Crawler.

The second issue is that after this method is called, the crawler will report new or modified objects before this method returns, so the implementations of CrawlerHandler methods must be reentrant. See wikipedia. It is recommended that processing any metadata in a DataObject takes place before the invocation of a SubCrawler.

Specified by:: runSubCrawler in interface Crawler

Parameters:: localSubCrawler - the subcrawler to be used; object - the parent data object, its metadata may be augmented by the SubCrawler; stream - the InputStream for the SubCrawler to work on. Note that even though there may be additional resources stored in the DataObject itself (like an InputStream or a File) they are not used.; charset - the charset in which the input stream is encoded (optional); mimeType - the mime type of the input stream (optional)
Throws:: SubCrawlerException - if some error during SubCrawling occurs.
See Also:: Crawler.runSubCrawler(SubCrawler, DataObject, InputStream, Charset, String)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.semanticdesktop.aperture.crawler.base Class CrawlerBase

source

accessorRegistry

accessData

crawlReportFile

stopRequested

CrawlerBase

setDataSource

getDataSource

setDataAccessorRegistry

getDataAccessorRegistry

setAccessData

getAccessData

setCrawlerHandler

getCrawlerHandler

crawl

crawlObjects

stop

isStopRequested

clear

clear

setCrawlReportFile

getCrawlReportFile

getCrawlReport

reportAccessingObject

reportNewDataObject

touchObject

reportModifiedDataObject

reportUnmodifiedDataObject

reportDeletedDataObject

reportUntouched

reportFatalErrorCause

reportFatalErrorCause

reportFatalErrorCause

getRDFContainerFactory

storeCrawlReport

inDomain

runSubCrawler

org.semanticdesktop.aperture.crawler.base
Class CrawlerBase