PdfExtractor (Aperture Core 1.5.0 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.semanticdesktop.aperture.extractor.pdf
Class PdfExtractor

java.lang.Object
  org.semanticdesktop.aperture.extractor.pdf.PdfExtractor

All Implemented Interfaces:: Extractor

public class PdfExtractor
extends Object
implements Extractor
extends Object
implements Extractor

Extracts full-text and metadata from Adobe Acrobat (PDF) files.

Constructor Summary
`PdfExtractor()`
`PdfExtractor(boolean closeDocument)` If the closeDocument flag is set to false, the `PDDocument` is not closed after extraction and it can be retrieved with `getPDDocument()`.

Method Summary
`void`	`extract(URI id, InputStream stream, Charset charset, String mimeType, RDFContainer result)` Extracts full-text and metadata from the specified binary stream and stores the extracted information as RDF statements in the specified RDFContainer.
`org.apache.pdfbox.pdmodel.PDDocument`	`getPDDocument()` Retrieves the `PDDocument` of the most recently processed PDFFile The document is usable only if 'true' has been passed to the constructor: `PdfExtractor(boolean)`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

PdfExtractor

public PdfExtractor()

PdfExtractor

public PdfExtractor(boolean closeDocument)

If the closeDocument flag is set to false, the PDDocument is not closed after extraction and it can be retrieved with getPDDocument(). Note that this is OBLIGATORY. If you specify false here you MUST always obtain the PDDocument instance and call PDDocument.close() on it. Otherwise PDFBox MAY create a temporary file which will not be deleted.

Method Detail

extract

public void extract(URI id,
                    InputStream stream,
                    Charset charset,
                    String mimeType,
                    RDFContainer result)
             throws ExtractorException

Description copied from interface: Extractor

Extracts full-text and metadata from the specified binary stream and stores the extracted information as RDF statements in the specified RDFContainer. The optionally specified Charset and MIME type can be used to direct how the stream should be parsed.

The specified InputStream is expected to already use some kind of buffering so that the Extractors are not required to internally buffer bytes to improve performance.

Specified by:: extract in interface Extractor

Parameters:: id - the URI identifying the object (e.g. a file or web page) from which the stream was obtained. The generated statements should describe this URI.; stream - the InputStream delivering the raw bytes.; charset - the charset in which the inputstream is encoded (optional).; mimeType - the MIME type of the passed stream (optional).; result - the container in which this Extractor can put its created RDF statements.
Throws:: ExtractorException - in case of any error during the extraction process.

getPDDocument

public org.apache.pdfbox.pdmodel.PDDocument getPDDocument()

Retrieves the PDDocument of the most recently processed PDFFile The document is usable only if 'true' has been passed to the constructor: PdfExtractor(boolean)

Returns:

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.semanticdesktop.aperture.extractor.pdf Class PdfExtractor

PdfExtractor

PdfExtractor

extract

getPDDocument

org.semanticdesktop.aperture.extractor.pdf
Class PdfExtractor