AbstractArchiverSubCrawler (Aperture Core 1.5.0 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.semanticdesktop.aperture.subcrawler.base
Class AbstractArchiverSubCrawler

java.lang.Object
  org.semanticdesktop.aperture.subcrawler.base.AbstractSubCrawler
      org.semanticdesktop.aperture.subcrawler.base.AbstractArchiverSubCrawler

All Implemented Interfaces:: SubCrawler

Direct Known Subclasses:: TarSubCrawler, ZipSubCrawler

public abstract class AbstractArchiverSubCrawler
extends AbstractSubCrawler
extends AbstractSubCrawler

A SubCrawler Implementation working with archive files, i.e. files containing a number of other files. This tries to be an abstraction over all known archive systems (zip, tar etc.)

Nested Class Summary
`protected static class`	`AbstractArchiverSubCrawler.ArchiveEntry` Encapsulates an archive entry
`protected static class`	`AbstractArchiverSubCrawler.ArchiveInputStream` An input stream encapsulating an archive stream with compressed data

Field Summary
`static int`	`MAX_ZIP_BOMB_REPEAT_COUNT` The maximal number of times a path may repeat in the uri of a resource to consider it a zip-bomb of the kind similar to the well-known droste.zip.

Constructor Summary
`AbstractArchiverSubCrawler()`

Method Summary
`protected abstract AbstractArchiverSubCrawler.ArchiveInputStream`	`getArchiveInputStream(InputStream compressedStream)`
`DataObject`	`getDataObject(URI parentUri, String path, InputStream stream, DataSource dataSource, Charset charset, String mimeType, RDFContainerFactory factory)` Get a DataObject from the specified stream with the given path.
`void`	`stopSubCrawler()` Stops a running crawl as fast as possible.
`void`	`subCrawl(URI id, InputStream stream, SubCrawlerHandler handler, DataSource dataSource, AccessData accessData, Charset charset, String mimeType, RDFContainer parentMetadata)` Starts crawling the given stream and to report the encountered DataObjects to the given SubCrawlerHandler.

Methods inherited from class org.semanticdesktop.aperture.subcrawler.base.AbstractSubCrawler
`createChildUri, getUriPrefix`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

MAX_ZIP_BOMB_REPEAT_COUNT

public static final int MAX_ZIP_BOMB_REPEAT_COUNT

The maximal number of times a path may repeat in the uri of a resource to consider it a zip-bomb of the kind similar to the well-known droste.zip.

See Also:: Constant Field Values

Constructor Detail

AbstractArchiverSubCrawler

public AbstractArchiverSubCrawler()

Method Detail

getArchiveInputStream

protected abstract AbstractArchiverSubCrawler.ArchiveInputStream getArchiveInputStream(InputStream compressedStream)

Parameters:: compressedStream - the stream with the compressed archive data
Returns:: and ArchiveInputStream encapsulating the given compressed stream

subCrawl

public void subCrawl(URI id,
                     InputStream stream,
                     SubCrawlerHandler handler,
                     DataSource dataSource,
                     AccessData accessData,
                     Charset charset,
                     String mimeType,
                     RDFContainer parentMetadata)
              throws SubCrawlerException

Description copied from interface: SubCrawler

Starts crawling the given stream and to report the encountered DataObjects to the given SubCrawlerHandler. If an AccessData instance is passed, it is used to check if the data objects are to be reported as new, modified, or unmodified. Note that the SubCrawler will not report deleted objects.

Parameters:: id - the URI identifying the object (e.g. a file or web page) from which the stream was obtained. This URI is treated as the URI of the parent object, all objects encountered in the stream are considered to be contained within the parent object. (optional, the implementation may use this uri or the one returned from the RDFContainer.getDescribedUri() method of the parentMetadata); stream - the stream to be crawled. (obligatory); handler - The crawler handler that is to receive the notifications from the SubCrawler (obligatory); dataSource - the data source that will be returned by the DataObject.getDataSource() method of the returned data objects. Some implementations may require that this reference is not null and that it contains some particular information; accessData - the AccessData used to determine if the encountered objects are to be returned as new, modified, unmodified or deleted. Information about new or modified objects is stored within for use in future crawls. This parameter may be null if this functionality is not desired, in which case all DataObjects will be reported as new. (optional); charset - the charset in which the input stream is encoded (optional).; mimeType - the MIME type of the passed stream (optional).; parentMetadata - The 'parent' RDFContainer, that will contain the metadata about the top-level entity in the stream. A SubCrawler may (in some cases) limit itself to augmenting the metadata in this RDFContainer without delivering any additional DataObjects. (obligatory)
Throws:: SubCrawlerException - if any of the obligatory parameters is null or if any error during the crawling process occured
See Also:: SubCrawler.subCrawl(URI, InputStream, SubCrawlerHandler, DataSource, AccessData, Charset, String, RDFContainer)

getDataObject

public DataObject getDataObject(URI parentUri,
                                String path,
                                InputStream stream,
                                DataSource dataSource,
                                Charset charset,
                                String mimeType,
                                RDFContainerFactory factory)
                         throws SubCrawlerException,
                                PathNotFoundException

Description copied from interface: SubCrawler

Get a DataObject from the specified stream with the given path.

Specified by:: getDataObject in interface SubCrawler
Overrides:: getDataObject in class AbstractSubCrawler

Parameters:: parentUri - the URI of the parent object where the path will be looked for; path - the path of the requested resource; stream - the stream that contains the resource; dataSource - data source that will be returned by the DataObject.getDataSource() method of the returned data object. Some implementations may require that this reference is not null and that it contains some particular information; charset - the charset in which the input stream is encoded (optional).; mimeType - the MIME type of the passed stream (optional).; factory - An RDFContainerFactory that delivers the RDFContainer to which the metadata of the DataObject should be added. The provided RDFContainer can later be retrieved as the DataObject's metadata container.
Returns:: The DataObject extracted from the given stream with the given path
Throws:: SubCrawlerException - if any I/O error occurs; PathNotFoundException - if the requested path is not found

stopSubCrawler

public void stopSubCrawler()

Description copied from interface: SubCrawler

Stops a running crawl as fast as possible. This method may return before the crawling has actually stopped.

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.semanticdesktop.aperture.subcrawler.base Class AbstractArchiverSubCrawler

MAX_ZIP_BOMB_REPEAT_COUNT

AbstractArchiverSubCrawler

getArchiveInputStream

subCrawl

getDataObject

stopSubCrawler

org.semanticdesktop.aperture.subcrawler.base
Class AbstractArchiverSubCrawler