|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.semanticdesktop.aperture.crawler.base.CrawlerBase
public abstract class CrawlerBase
An implementation of the Crawler interface that offers generic implementations for some of its methods.
Field Summary | |
---|---|
protected AccessData |
accessData
The current AccessData instance. |
protected DataAccessorRegistry |
accessorRegistry
The current DataAccessorRegistry. |
protected File |
crawlReportFile
The file for persistent storage of CrawlReports. |
protected DataSource |
source
The DataSource representing the physical source of information. |
protected boolean |
stopRequested
Flag indicating that this Crawler should stop scanning or clearing as soon as possible. |
Constructor Summary | |
---|---|
CrawlerBase()
The default constructor |
Method Summary | |
---|---|
void |
clear()
Reports all IDs stored in the AccessData as being cleared to the CrawlerHandler and then gets rid of the AccessData instance. |
protected void |
clear(String url)
|
void |
crawl()
Starts crawling the domain defined in the DataSource of this Crawler. |
protected abstract ExitCode |
crawlObjects()
Method called by crawl() that should implement the actual crawling of the DataSource. |
AccessData |
getAccessData()
Returns the AccessData instance used by the crawler |
CrawlerHandler |
getCrawlerHandler()
Returns the crawler handler |
CrawlReport |
getCrawlReport()
Gets the CrawlReport of the last performed crawl, or the current crawl when it is in progress. |
File |
getCrawlReportFile()
Returns the file where the crawl report is to be saved |
DataAccessorRegistry |
getDataAccessorRegistry()
Returns the data accessor registry |
DataSource |
getDataSource()
Returns the data source |
protected RDFContainerFactory |
getRDFContainerFactory(String url)
|
protected boolean |
inDomain(String uri)
|
boolean |
isStopRequested()
Returns true if the crawler is currently stopping, false otherwise |
protected void |
reportAccessingObject(String url)
|
protected void |
reportDeletedDataObject(String url)
|
protected ExitCode |
reportFatalErrorCause(String msg)
Reports the cause of the fatal error. |
protected ExitCode |
reportFatalErrorCause(String msg,
Throwable cause)
Reports the cause of the fatal error. |
protected ExitCode |
reportFatalErrorCause(Throwable t)
Reports the cause of the fatal error. |
protected void |
reportModifiedDataObject(DataObject object)
|
protected void |
reportNewDataObject(DataObject object)
|
protected void |
reportUnmodifiedDataObject(String url)
|
protected void |
reportUntouched()
|
void |
runSubCrawler(SubCrawler localSubCrawler,
DataObject object,
InputStream stream,
Charset charset,
String mimeType)
Runs the given SubCrawler on the given stream. |
void |
setAccessData(AccessData accessData)
Sets the AccessData instance to be used by the crawler |
void |
setCrawlerHandler(CrawlerHandler handler)
Sets the crawler handler |
void |
setCrawlReportFile(File file)
Sets the file where the crawl report is to be saved |
void |
setDataAccessorRegistry(DataAccessorRegistry accessorRegistry)
Sets the data accessor registry |
void |
setDataSource(DataSource source)
Sets the data source |
void |
stop()
Stops a running crawl or clear operation as fast as possible. |
protected void |
storeCrawlReport()
Stores the current CrawlReport, if any, to the crawl report file, is set. |
protected void |
touchObject(String string)
|
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected DataSource source
protected DataAccessorRegistry accessorRegistry
protected AccessData accessData
protected File crawlReportFile
protected boolean stopRequested
Constructor Detail |
---|
public CrawlerBase()
Method Detail |
---|
public void setDataSource(DataSource source)
source
- the new data sourcepublic DataSource getDataSource()
getDataSource
in interface Crawler
Crawler.getDataSource()
public void setDataAccessorRegistry(DataAccessorRegistry accessorRegistry)
setDataAccessorRegistry
in interface Crawler
accessorRegistry
- the new data accessor registryCrawler.setDataAccessorRegistry(DataAccessorRegistry)
public DataAccessorRegistry getDataAccessorRegistry()
getDataAccessorRegistry
in interface Crawler
Crawler.getDataAccessorRegistry()
public void setAccessData(AccessData accessData)
setAccessData
in interface Crawler
accessData
- the AccessData instance to be used by the crawlerCrawler.setAccessData(AccessData)
public AccessData getAccessData()
getAccessData
in interface Crawler
Crawler.getAccessData()
public void setCrawlerHandler(CrawlerHandler handler)
setCrawlerHandler
in interface Crawler
handler
- the crawler handlerCrawler.setCrawlerHandler(CrawlerHandler)
public CrawlerHandler getCrawlerHandler()
getCrawlerHandler
in interface Crawler
Crawler.getCrawlerHandler()
public void crawl()
Crawler
crawl
in interface Crawler
Crawler.crawl()
protected abstract ExitCode crawlObjects()
public void stop()
Crawler
stop
in interface Crawler
Crawler.stop()
public boolean isStopRequested()
public void clear()
clear
in interface Crawler
protected void clear(String url)
public void setCrawlReportFile(File file)
file
- the file where the crawl report is to be savedpublic File getCrawlReportFile()
public CrawlReport getCrawlReport()
Crawler
getCrawlReport
in interface Crawler
Crawler.getCrawlReport()
protected void reportAccessingObject(String url)
protected void reportNewDataObject(DataObject object)
protected void touchObject(String string)
protected void reportModifiedDataObject(DataObject object)
protected void reportUnmodifiedDataObject(String url)
protected void reportDeletedDataObject(String url)
protected void reportUntouched()
protected ExitCode reportFatalErrorCause(String msg)
crawlObjects()
is to return ExitCode.FATAL_ERROR
t
-
ExitCode.FATAL_ERROR
protected ExitCode reportFatalErrorCause(String msg, Throwable cause)
crawlObjects()
is to return ExitCode.FATAL_ERROR
t
-
ExitCode.FATAL_ERROR
protected ExitCode reportFatalErrorCause(Throwable t)
crawlObjects()
is to return ExitCode.FATAL_ERROR
t
-
ExitCode.FATAL_ERROR
protected RDFContainerFactory getRDFContainerFactory(String url)
protected void storeCrawlReport()
protected boolean inDomain(String uri)
public void runSubCrawler(SubCrawler localSubCrawler, DataObject object, InputStream stream, Charset charset, String mimeType) throws SubCrawlerException
Crawler
SubCrawler
.subCrawl(...) method. DataObjects found by the SubCrawler will be reported to the
CrawlerHandler
registered with this crawler with the Crawler.setCrawlerHandler(CrawlerHandler)
method. The AccessData
and the internal data structures of this crawler will be updated
correctly. The SubCrawler will be stopped if the Crawler.stop()
method is invoked on this crawler.SubCrawler
.subCrawl(...) directly. The Crawler may behave unpredictably.CrawlerHandler.objectNew(Crawler, DataObject)
or
CrawlerHandler.objectChanged(Crawler, DataObject)
) that has been invoked by a running crawler,
it SHOULD be run on the same thread that called the CrawlerHandler method (i.e. the crawling thread).
Trying to run this method in a new thread may result in unpredictable behavior of the Crawler.CrawlerHandler
methods must be reentrant.
See wikipedia. It is recommended that processing any metadata in
a DataObject takes place before the invocation of a SubCrawler.
runSubCrawler
in interface Crawler
localSubCrawler
- the subcrawler to be usedobject
- the parent data object, its metadata may be augmented by the SubCrawlerstream
- the InputStream for the SubCrawler to work on. Note that even though there may be
additional resources stored in the DataObject itself (like an InputStream or a File) they
are not used.charset
- the charset in which the input stream is encoded (optional)mimeType
- the mime type of the input stream (optional)
SubCrawlerException
- if some error during SubCrawling occurs.Crawler.runSubCrawler(SubCrawler, DataObject, InputStream, Charset, String)
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |