org.semanticdesktop.aperture.subcrawler.base
Class ThreadedSubCrawlerWrapper

java.lang.Object
  extended by org.semanticdesktop.aperture.subcrawler.base.ThreadedSubCrawlerWrapper

public class ThreadedSubCrawlerWrapper
extends Object

A ThreadedSubcrawlerWrapper wraps a SubCrawler and executes it on a separate thread, bailing out if the wrapped SubCrawler appears to be hanging. The heuristic for determining whether the SubCrawler is hanging is by looking at whether the InputStream is regularly accessed. Any ThreadedSubCrawlerWrapper.SubCrawlingAbortedExceptions thrown by the wrapped SubCrawler are eventually thrown by the ThreadedSubcrawlerWrapper.

Furthermore, a ThreadedSubCrawlerWrapper Wrapper can be requested to stop processing, causing it to throw an IOException on the InputStream the next time it is accessed by the wrapped SubCrawler. This allows for interrupting the subcrawling process upon user request, for example because it has been processing a single file for a very long time. This implementation strategy is preferred over interrupting the Thread as that should only be used as a last resort to stop a thread.

The class differs from ThreadedExtractorWrapper in that it suspends the monitoring when control is passed to the SubCrawlerHandler. This allows subcrawlers to be nested arbitrarily. If the user has many levels of subcrawlers (zip within a zip within a zip within a tar.gz attached to a mail etc.) - a timeout on the lowest level won't break an otherwise perfectly OK crawling process. See StreamMonitor.suspendMonitoring() and StreamMonitor.resumeMonitoring() for more details.


Nested Class Summary
static class ThreadedSubCrawlerWrapper.SubCrawlingAbortedException
          An exception that gets thrown if the underlying SubCrawler hangs.
static class ThreadedSubCrawlerWrapper.SubCrawlingInterruptedException
          An exception that gets thrown if the subcrawling is interrupted per user request i.e. when the stop() method is called.
 
Field Summary
static long DEFAULT_MAX_IDLE_READ_TIME
          The maximum time between two reads that the wrapped SubCrawler allowed to work on the read data before it is considered to be hanging.
static long DEFAULT_MAX_PROCESSING_TIME_PER_MB
          The maximum time per MB of data that the wrapped SubCrawler is allowed to work on the read data before it is considered to be hanging, in milliseconds.
static long DEFAULT_MINIMUM_MAX_PROCESSING_TIME
          The minimum maximum processing time that the wrapped SubCrawler is allowed to work on the read data before it is considered to be hanging.
 
Constructor Summary
ThreadedSubCrawlerWrapper(SubCrawler subcrawler)
          Creates a new wrapper for the specified SubCrawler.
ThreadedSubCrawlerWrapper(SubCrawler subCrawler, long maxProcessingTimePerMb, long minimumMaxProcessingTime, long maxIdleReadTime)
          Creates a new wrapper for the specified SubCrawler.
 
Method Summary
 void stop()
          Interrupts processing of the wrapped SubCrawler as soon as possible.
 void subCrawl(URI id, InputStream input, SubCrawlerHandler handler, DataSource dataSource, AccessData accessData, Charset charset, String mimeType, RDFContainer parentMetadata)
          Starts the subcrawling process using the wrapped SubCrawler on a separate thread.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DEFAULT_MAX_PROCESSING_TIME_PER_MB

public static final long DEFAULT_MAX_PROCESSING_TIME_PER_MB
The maximum time per MB of data that the wrapped SubCrawler is allowed to work on the read data before it is considered to be hanging, in milliseconds.

See Also:
Constant Field Values

DEFAULT_MINIMUM_MAX_PROCESSING_TIME

public static final long DEFAULT_MINIMUM_MAX_PROCESSING_TIME
The minimum maximum processing time that the wrapped SubCrawler is allowed to work on the read data before it is considered to be hanging. This minimum gives a lower bound for very small files.

See Also:
Constant Field Values

DEFAULT_MAX_IDLE_READ_TIME

public static final long DEFAULT_MAX_IDLE_READ_TIME
The maximum time between two reads that the wrapped SubCrawler allowed to work on the read data before it is considered to be hanging.

See Also:
Constant Field Values
Constructor Detail

ThreadedSubCrawlerWrapper

public ThreadedSubCrawlerWrapper(SubCrawler subcrawler)
Creates a new wrapper for the specified SubCrawler. It uses default timeout values. It is equivalent to
 new ThreadedSubcrawlerWrapper(extractor, DEFAULT_MAX_PROCESSING_TIME_PER_MB,
         DEFAULT_MINIMUM_MAX_PROCESSING_TIME, DEFAULT_MAX_IDLE_READ_TIME);
 

Parameters:
subCrawler - The Extractor to wrap.
See Also:
DEFAULT_MAX_PROCESSING_TIME_PER_MB, DEFAULT_MINIMUM_MAX_PROCESSING_TIME, DEFAULT_MAX_IDLE_READ_TIME

ThreadedSubCrawlerWrapper

public ThreadedSubCrawlerWrapper(SubCrawler subCrawler,
                                 long maxProcessingTimePerMb,
                                 long minimumMaxProcessingTime,
                                 long maxIdleReadTime)
Creates a new wrapper for the specified SubCrawler. It allows the user to customize the timeout values.

Parameters:
extractor - The Extractor to wrap.
maxProcessingTimePerMb - see DEFAULT_MAX_PROCESSING_TIME_PER_MB
minimumMaxProcessingTime - see DEFAULT_MINIMUM_MAX_PROCESSING_TIME
maxIdleReadTime - see DEFAULT_MAX_IDLE_READ_TIME
Method Detail

stop

public void stop()
Interrupts processing of the wrapped SubCrawler as soon as possible.


subCrawl

public void subCrawl(URI id,
                     InputStream input,
                     SubCrawlerHandler handler,
                     DataSource dataSource,
                     AccessData accessData,
                     Charset charset,
                     String mimeType,
                     RDFContainer parentMetadata)
              throws SubCrawlerException
Starts the subcrawling process using the wrapped SubCrawler on a separate thread. This Thread is interrupted as soon as no progress is reported. In this case an ThreadedSubCrawlerWrapper.SubCrawlingAbortedException will be thrown.

Throws:
SubCrawlerException - if any problem with the extractor occurs, this is exactly the same Exception instance as the one thrown by the extractor.
ThreadedSubCrawlerWrapper.SubCrawlingAbortedException - if the extractor wrapper decided that the extractor has stalled and the extraction has been aborted


Copyright © 2010 Aperture Development Team. All Rights Reserved.