|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.semanticdesktop.aperture.subcrawler.base.ThreadedSubCrawlerWrapper
public class ThreadedSubCrawlerWrapper
A ThreadedSubcrawlerWrapper wraps a SubCrawler
and executes it on a separate thread, bailing out if
the wrapped SubCrawler
appears to be hanging. The heuristic for determining whether the
SubCrawler
is hanging is by looking at whether the InputStream
is regularly accessed. Any
ThreadedSubCrawlerWrapper.SubCrawlingAbortedException
s thrown by the wrapped SubCrawler
are eventually thrown by the
ThreadedSubcrawlerWrapper.
Furthermore, a ThreadedSubCrawlerWrapper
Wrapper can be requested to stop processing, causing it to
throw an IOException
on the InputStream
the next time it is accessed by the wrapped
SubCrawler
. This allows for interrupting the subcrawling process upon user request, for example
because it has been processing a single file for a very long time. This implementation strategy is
preferred over interrupting the Thread
as that should only be used as a last resort to stop a
thread.
The class differs from ThreadedExtractorWrapper
in that it suspends the monitoring when control is
passed to the SubCrawlerHandler
. This allows subcrawlers to be nested arbitrarily. If the user has
many levels of subcrawlers (zip within a zip within a zip within a tar.gz attached to a mail etc.) - a
timeout on the lowest level won't break an otherwise perfectly OK crawling process. See
StreamMonitor.suspendMonitoring()
and StreamMonitor.resumeMonitoring()
for more details.
Nested Class Summary | |
---|---|
static class |
ThreadedSubCrawlerWrapper.SubCrawlingAbortedException
An exception that gets thrown if the underlying SubCrawler hangs. |
static class |
ThreadedSubCrawlerWrapper.SubCrawlingInterruptedException
An exception that gets thrown if the subcrawling is interrupted per user request i.e. when the stop() method is called. |
Field Summary | |
---|---|
static long |
DEFAULT_MAX_IDLE_READ_TIME
The maximum time between two reads that the wrapped SubCrawler allowed to work on the read data
before it is considered to be hanging. |
static long |
DEFAULT_MAX_PROCESSING_TIME_PER_MB
The maximum time per MB of data that the wrapped SubCrawler is allowed to work on the read data before
it is considered to be hanging, in milliseconds. |
static long |
DEFAULT_MINIMUM_MAX_PROCESSING_TIME
The minimum maximum processing time that the wrapped SubCrawler is allowed to work on the read data
before it is considered to be hanging. |
Constructor Summary | |
---|---|
ThreadedSubCrawlerWrapper(SubCrawler subcrawler)
Creates a new wrapper for the specified SubCrawler . |
|
ThreadedSubCrawlerWrapper(SubCrawler subCrawler,
long maxProcessingTimePerMb,
long minimumMaxProcessingTime,
long maxIdleReadTime)
Creates a new wrapper for the specified SubCrawler . |
Method Summary | |
---|---|
void |
stop()
Interrupts processing of the wrapped SubCrawler as soon as possible. |
void |
subCrawl(URI id,
InputStream input,
SubCrawlerHandler handler,
DataSource dataSource,
AccessData accessData,
Charset charset,
String mimeType,
RDFContainer parentMetadata)
Starts the subcrawling process using the wrapped SubCrawler on a separate thread. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final long DEFAULT_MAX_PROCESSING_TIME_PER_MB
SubCrawler
is allowed to work on the read data before
it is considered to be hanging, in milliseconds.
public static final long DEFAULT_MINIMUM_MAX_PROCESSING_TIME
SubCrawler
is allowed to work on the read data
before it is considered to be hanging. This minimum gives a lower bound for very small files.
public static final long DEFAULT_MAX_IDLE_READ_TIME
SubCrawler
allowed to work on the read data
before it is considered to be hanging.
Constructor Detail |
---|
public ThreadedSubCrawlerWrapper(SubCrawler subcrawler)
SubCrawler
. It uses default timeout values. It is
equivalent to
new ThreadedSubcrawlerWrapper(extractor, DEFAULT_MAX_PROCESSING_TIME_PER_MB, DEFAULT_MINIMUM_MAX_PROCESSING_TIME, DEFAULT_MAX_IDLE_READ_TIME);
subCrawler
- The Extractor to wrap.DEFAULT_MAX_PROCESSING_TIME_PER_MB
,
DEFAULT_MINIMUM_MAX_PROCESSING_TIME
,
DEFAULT_MAX_IDLE_READ_TIME
public ThreadedSubCrawlerWrapper(SubCrawler subCrawler, long maxProcessingTimePerMb, long minimumMaxProcessingTime, long maxIdleReadTime)
SubCrawler
. It allows the user to customize the timeout
values.
extractor
- The Extractor to wrap.maxProcessingTimePerMb
- see DEFAULT_MAX_PROCESSING_TIME_PER_MB
minimumMaxProcessingTime
- see DEFAULT_MINIMUM_MAX_PROCESSING_TIME
maxIdleReadTime
- see DEFAULT_MAX_IDLE_READ_TIME
Method Detail |
---|
public void stop()
SubCrawler
as soon as possible.
public void subCrawl(URI id, InputStream input, SubCrawlerHandler handler, DataSource dataSource, AccessData accessData, Charset charset, String mimeType, RDFContainer parentMetadata) throws SubCrawlerException
SubCrawler
on a separate thread. This Thread is
interrupted as soon as no progress is reported. In this case an ThreadedSubCrawlerWrapper.SubCrawlingAbortedException
will
be thrown.
SubCrawlerException
- if any problem with the extractor occurs, this is exactly the same Exception
instance as the one thrown by the extractor.
ThreadedSubCrawlerWrapper.SubCrawlingAbortedException
- if the extractor wrapper decided that the extractor has stalled and
the extraction has been aborted
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |