org.semanticdesktop.aperture.crawler
Interface Crawler

All Known Implementing Classes:
AbstractJavaMailCrawler, AbstractTagCrawler, AddressbookCrawler, BibsonomyCrawler, CrawlerBase, DeliciousCrawler, FileSystemCrawler, FlickrCrawler, IcalCrawler, ImapCrawler, MboxCrawler, OutlookCrawler, SambaCrawler, ThunderbirdCrawler, WebCrawler

public interface Crawler

A Crawler accesses the physical source represented by a DataSource and delivers a stream of DataObjects representing the resources in that source.

An AccessData instance can optionally be specified to a Crawler, allowing it to perform incremental crawling, i.e. to scan and report the differences in the data source since the last crawl.


Method Summary
 void clear()
          Clears the information the crawler had about the state of the data source.
 void crawl()
          Starts crawling the domain defined in the DataSource of this Crawler.
 AccessData getAccessData()
          Returns the AccessData used by this Crawler.
 CrawlerHandler getCrawlerHandler()
          Returns the currently registered CrawlerHandler.
 CrawlReport getCrawlReport()
          Gets the CrawlReport of the last performed crawl, or the current crawl when it is in progress.
 DataAccessorRegistry getDataAccessorRegistry()
          Returns the DataAccessorRegistry currently used by this Crawler.
 DataSource getDataSource()
          Returns the DataSource crawled by this Crawler.
 void runSubCrawler(SubCrawler subCrawler, DataObject object, InputStream stream, Charset charset, String mimeType)
          Runs the given SubCrawler on the given stream.
 void setAccessData(AccessData accessData)
          Sets the AccessData instance to be used.
 void setCrawlerHandler(CrawlerHandler handler)
          Sets the CrawlerHandler to which this Crawler should report any scanned or cleared resources and from which it obtains RDFContainer.
 void setDataAccessorRegistry(DataAccessorRegistry registry)
          Sets the DataAccessorRegistry to obtain DataAccessorFactories from.
 void stop()
          Stops a running crawl or clear operation as fast as possible.
 

Method Detail

getDataSource

DataSource getDataSource()
Returns the DataSource crawled by this Crawler.

Returns:
the DataSource crawled by this Crawler.

setDataAccessorRegistry

void setDataAccessorRegistry(DataAccessorRegistry registry)
Sets the DataAccessorRegistry to obtain DataAccessorFactories from.

Parameters:
registry - The DataAccessorRegistry to use, or 'null' when then DataAccessorRegistry should be unset.

getDataAccessorRegistry

DataAccessorRegistry getDataAccessorRegistry()
Returns the DataAccessorRegistry currently used by this Crawler.

Returns:
The currently used DataAccessorRegistry, or 'null' when no DataAccessorRegistry has been set.

setAccessData

void setAccessData(AccessData accessData)
Sets the AccessData instance to be used.

Parameters:
accessData - The AccessData instance to use, or 'null' when no AccessData is to be used.

getAccessData

AccessData getAccessData()
Returns the AccessData used by this Crawler.

Returns:
The AccessData used by this Crawler, or 'null' when no AccessData is used.

crawl

void crawl()
Starts crawling the domain defined in the DataSource of this Crawler. If this is a subsequent run of this method, it will only report the differences with the previous run, unless the previous scan results have been cleared. Any CrawlerListeners registered on this Crawler will get notified about the crawling progress.


clear

void clear()
Clears the information the crawler had about the state of the data source.

This means deleting the stored crawl results from the AccessData instance registered with this crawler with the setAccessData(AccessData). Note that this entails clearing ONLY the information stored in that AccessData instance, not the information stored in the data source itself.

The CrawlerHandler registered with this Crawler is notified of the removal of the individual crawl results. Starting the clearing process results in a call to CrawlerHandler.clearStarted(Crawler). Afterwards each deleted entry in the AccessData is reported to the CrawlerHandler with a call to the CrawlerHandler.clearingObject(Crawler, String). At the end, the CrawlerHandler receives a call to CrawlerHandler.clearFinished(Crawler, ExitCode).

As a result of calling this method, the AccessData instance is left in an empty state meaning that the next call to crawl() will report all DataObjects in the data source as new CrawlerHandler.objectNew(Crawler, org.semanticdesktop.aperture.accessor.DataObject)


stop

void stop()
Stops a running crawl or clear operation as fast as possible. This method may return before the crawling has actually stopped.


getCrawlReport

CrawlReport getCrawlReport()
Gets the CrawlReport of the last performed crawl, or the current crawl when it is in progress. Returns null when no crawl has been performed in this application's session yet and there is no report available from the previous session.

Returns:
The CrawlReport of the last run, or null when this is not available.

setCrawlerHandler

void setCrawlerHandler(CrawlerHandler handler)
Sets the CrawlerHandler to which this Crawler should report any scanned or cleared resources and from which it obtains RDFContainer.

Parameters:
handler - The CrawlerHandler to register.

getCrawlerHandler

CrawlerHandler getCrawlerHandler()
Returns the currently registered CrawlerHandler.

Returns:
The current CrawlerHandler.

runSubCrawler

void runSubCrawler(SubCrawler subCrawler,
                   DataObject object,
                   InputStream stream,
                   Charset charset,
                   String mimeType)
                   throws SubCrawlerException
Runs the given SubCrawler on the given stream.

This method uses the information stored within the crawler to provide appropriate arguments to the SubCrawler.subCrawl(...) method. DataObjects found by the SubCrawler will be reported to the CrawlerHandler registered with this crawler with the setCrawlerHandler(CrawlerHandler) method. The AccessData and the internal data structures of this crawler will be updated correctly. The SubCrawler will be stopped if the stop() method is invoked on this crawler.

In most cases, the SubCrawler is used to extract additional information from the DataObject found by a normal Crawler. In such cases using this method is strongly recommended instead of invoking SubCrawler.subCrawl(...) directly. The Crawler may behave unpredictably.

IMPORTANT

There are two important issues to take care about when calling this method.

Firstly. If this method is called from a CrawlerHandler method (e.g. CrawlerHandler.objectNew(Crawler, DataObject) or CrawlerHandler.objectChanged(Crawler, DataObject)) that has been invoked by a running crawler, it SHOULD be run on the same thread that called the CrawlerHandler method (i.e. the crawling thread). Trying to run this method in a new thread may result in unpredictable behavior of the Crawler.

The second issue is that after this method is called, the crawler will report new or modified objects before this method returns, so the implementations of CrawlerHandler methods must be reentrant. See wikipedia. It is recommended that processing any metadata in a DataObject takes place before the invocation of a SubCrawler.

Parameters:
subCrawler - the subcrawler to be used
object - the parent data object, its metadata may be augmented by the SubCrawler
stream - the InputStream for the SubCrawler to work on. Note that even though there may be additional resources stored in the DataObject itself (like an InputStream or a File) they are not used.
charset - the charset in which the input stream is encoded (optional)
mimeType - the mime type of the input stream (optional)
Throws:
SubCrawlerException - if some error during SubCrawling occurs.


Copyright © 2010 Aperture Development Team. All Rights Reserved.