PlainTextExtractor (Aperture Core 1.5.0 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.semanticdesktop.aperture.extractor.plaintext
Class PlainTextExtractor

java.lang.Object
  org.semanticdesktop.aperture.extractor.plaintext.PlainTextExtractor

All Implemented Interfaces:: Extractor

public class PlainTextExtractor
extends Object
implements Extractor
extends Object
implements Extractor

Field Summary
`static int`	`BYTES_TEST_LENGTH`

Constructor Summary
`PlainTextExtractor()`

Method Summary
`void`	`extract(URI id, InputStream iStream, Charset charset, String mimeType, RDFContainer result)` Extracts full-text and metadata from the specified binary stream and stores the extracted information as RDF statements in the specified RDFContainer.
`static Charset`	`guessCharset(InputStream iStream)` Tries to guess the charset of the input stream.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

BYTES_TEST_LENGTH

public static final int BYTES_TEST_LENGTH

See Also:: Constant Field Values

Constructor Detail

PlainTextExtractor

public PlainTextExtractor()

Method Detail

guessCharset

public static Charset guessCharset(InputStream iStream)
                            throws IOException

Tries to guess the charset of the input stream. It assumes that the input stream contains plain text in some encoding. First it looks for the Byte Order Mark which (if present) allows for a highly probable guess of unicode encoding. If the BOM is missing, the method reverts to a heuristic approach (as implemented by the juniversalchardet library).

Parameters:: iStream, - must support InputStream.mark(int), the method first calls the InputStream.mark(int) and then resets the stream with InputStream.reset()
Returns:: the guessed charset or null if the algorithm couldn't decide on one (note that the guess is heuristic and a non-null return value may be wrong, especially if the document is shorter than BYTES_TEST_LENGTH
Throws:: IOException

extract

public void extract(URI id,
                    InputStream iStream,
                    Charset charset,
                    String mimeType,
                    RDFContainer result)
             throws ExtractorException

Description copied from interface: Extractor

Extracts full-text and metadata from the specified binary stream and stores the extracted information as RDF statements in the specified RDFContainer. The optionally specified Charset and MIME type can be used to direct how the stream should be parsed.

The specified InputStream is expected to already use some kind of buffering so that the Extractors are not required to internally buffer bytes to improve performance.

Specified by:: extract in interface Extractor

Parameters:: id - the URI identifying the object (e.g. a file or web page) from which the stream was obtained. The generated statements should describe this URI.; iStream - the InputStream delivering the raw bytes.; charset - the charset in which the inputstream is encoded (optional).; mimeType - the MIME type of the passed stream (optional).; result - the container in which this Extractor can put its created RDF statements.
Throws:: ExtractorException - in case of any error during the extraction process.