|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
public interface Crawler
A Crawler accesses the physical source represented by a DataSource and delivers a stream of DataObjects representing the resources in that source.
An AccessData instance can optionally be specified to a Crawler, allowing it to perform incremental crawling, i.e. to scan and report the differences in the data source since the last crawl.
Method Summary | |
---|---|
void |
clear()
Clears the information the crawler had about the state of the data source. |
void |
crawl()
Starts crawling the domain defined in the DataSource of this Crawler. |
AccessData |
getAccessData()
Returns the AccessData used by this Crawler. |
CrawlerHandler |
getCrawlerHandler()
Returns the currently registered CrawlerHandler. |
CrawlReport |
getCrawlReport()
Gets the CrawlReport of the last performed crawl, or the current crawl when it is in progress. |
DataAccessorRegistry |
getDataAccessorRegistry()
Returns the DataAccessorRegistry currently used by this Crawler. |
DataSource |
getDataSource()
Returns the DataSource crawled by this Crawler. |
void |
runSubCrawler(SubCrawler subCrawler,
DataObject object,
InputStream stream,
Charset charset,
String mimeType)
Runs the given SubCrawler on the given stream. |
void |
setAccessData(AccessData accessData)
Sets the AccessData instance to be used. |
void |
setCrawlerHandler(CrawlerHandler handler)
Sets the CrawlerHandler to which this Crawler should report any scanned or cleared resources and from which it obtains RDFContainer. |
void |
setDataAccessorRegistry(DataAccessorRegistry registry)
Sets the DataAccessorRegistry to obtain DataAccessorFactories from. |
void |
stop()
Stops a running crawl or clear operation as fast as possible. |
Method Detail |
---|
DataSource getDataSource()
void setDataAccessorRegistry(DataAccessorRegistry registry)
registry
- The DataAccessorRegistry to use, or 'null' when then DataAccessorRegistry should be
unset.DataAccessorRegistry getDataAccessorRegistry()
void setAccessData(AccessData accessData)
accessData
- The AccessData instance to use, or 'null' when no AccessData is to be used.AccessData getAccessData()
void crawl()
void clear()
setAccessData(AccessData)
. Note that this entails clearing ONLY the information
stored in that AccessData instance, not the information stored in the data source itself. CrawlerHandler.clearStarted(Crawler)
. Afterwards each deleted entry in the AccessData is
reported to the CrawlerHandler with a call to the
CrawlerHandler.clearingObject(Crawler, String)
. At the end, the CrawlerHandler receives a call
to CrawlerHandler.clearFinished(Crawler, ExitCode)
.crawl()
will report all DataObjects in the data source as new
CrawlerHandler.objectNew(Crawler, org.semanticdesktop.aperture.accessor.DataObject)
void stop()
CrawlReport getCrawlReport()
void setCrawlerHandler(CrawlerHandler handler)
handler
- The CrawlerHandler to register.CrawlerHandler getCrawlerHandler()
void runSubCrawler(SubCrawler subCrawler, DataObject object, InputStream stream, Charset charset, String mimeType) throws SubCrawlerException
SubCrawler
.subCrawl(...) method. DataObjects found by the SubCrawler will be reported to the
CrawlerHandler
registered with this crawler with the setCrawlerHandler(CrawlerHandler)
method. The AccessData
and the internal data structures of this crawler will be updated
correctly. The SubCrawler will be stopped if the stop()
method is invoked on this crawler.SubCrawler
.subCrawl(...) directly. The Crawler may behave unpredictably.CrawlerHandler.objectNew(Crawler, DataObject)
or
CrawlerHandler.objectChanged(Crawler, DataObject)
) that has been invoked by a running crawler,
it SHOULD be run on the same thread that called the CrawlerHandler method (i.e. the crawling thread).
Trying to run this method in a new thread may result in unpredictable behavior of the Crawler.CrawlerHandler
methods must be reentrant.
See wikipedia. It is recommended that processing any metadata in
a DataObject takes place before the invocation of a SubCrawler.
subCrawler
- the subcrawler to be usedobject
- the parent data object, its metadata may be augmented by the SubCrawlerstream
- the InputStream for the SubCrawler to work on. Note that even though there may be
additional resources stored in the DataObject itself (like an InputStream or a File) they
are not used.charset
- the charset in which the input stream is encoded (optional)mimeType
- the mime type of the input stream (optional)
SubCrawlerException
- if some error during SubCrawling occurs.
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |