|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.semanticdesktop.aperture.extractor.pdf.PdfExtractor
public class PdfExtractor
Extracts full-text and metadata from Adobe Acrobat (PDF) files.
Constructor Summary | |
---|---|
PdfExtractor()
|
|
PdfExtractor(boolean closeDocument)
If the closeDocument flag is set to false, the PDDocument is not closed after extraction
and it can be retrieved
with getPDDocument() . |
Method Summary | |
---|---|
void |
extract(URI id,
InputStream stream,
Charset charset,
String mimeType,
RDFContainer result)
Extracts full-text and metadata from the specified binary stream and stores the extracted information as RDF statements in the specified RDFContainer. |
org.apache.pdfbox.pdmodel.PDDocument |
getPDDocument()
Retrieves the PDDocument of the most recently processed PDFFile
The document is usable only if 'true' has been passed to the constructor:
PdfExtractor(boolean) |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public PdfExtractor()
public PdfExtractor(boolean closeDocument)
PDDocument
is not closed after extraction
and it can be retrieved
with getPDDocument()
. Note that this is OBLIGATORY. If you specify false here
you MUST always obtain the PDDocument
instance and call PDDocument.close()
on it. Otherwise PDFBox MAY create a temporary file which will not be deleted.
Method Detail |
---|
public void extract(URI id, InputStream stream, Charset charset, String mimeType, RDFContainer result) throws ExtractorException
Extractor
The specified InputStream is expected to already use some kind of buffering so that the Extractors are not required to internally buffer bytes to improve performance.
extract
in interface Extractor
id
- the URI identifying the object (e.g. a file or web page) from which the stream was obtained.
The generated statements should describe this URI.stream
- the InputStream delivering the raw bytes.charset
- the charset in which the inputstream is encoded (optional).mimeType
- the MIME type of the passed stream (optional).result
- the container in which this Extractor can put its created RDF statements.
ExtractorException
- in case of any error during the extraction process.public org.apache.pdfbox.pdmodel.PDDocument getPDDocument()
PDDocument
of the most recently processed PDFFile
The document is usable only if 'true' has been passed to the constructor:
PdfExtractor(boolean)
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |