org.semanticdesktop.aperture.extractor.wordperfect
Class WordPerfectExtractor
java.lang.Object
org.semanticdesktop.aperture.extractor.wordperfect.WordPerfectExtractor
- All Implemented Interfaces:
- Extractor
public class WordPerfectExtractor
- extends Object
- implements Extractor
An Extractor implementation for WordPerfect documents.
This implementation uses heuristic string extraction algorithms, tuned for WordPerfect files but without
any intrinsic knowledge of the WordPerfect file format(s). Consequently, the extracted full-text may be
imperfect, e.g. contain some noise that's not part of the document text. Also, the document metadata is not
extracted.
The current status of this implementation is that the complete full-text is extracted from WordPerfect
documents from version 4.2 up to WordPerfect X3 (tested with 4.2, 5.0, 5.1/5.2 and X3, all created using
WordPerfect X3), except for the 5.1/5.2 Far East format for which our test did not return any text at all.
This is probably due to encoding issues. These tests showed that for WordPerfect 5.0 and 5.1 the document
metadata also ends up at the start of the extracted full-text.
Method Summary |
void |
extract(URI id,
InputStream stream,
Charset charset,
String mimeType,
RDFContainer result)
Extracts full-text and metadata from the specified binary stream and stores the extracted information
as RDF statements in the specified RDFContainer. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
WordPerfectExtractor
public WordPerfectExtractor()
extract
public void extract(URI id,
InputStream stream,
Charset charset,
String mimeType,
RDFContainer result)
throws ExtractorException
- Description copied from interface:
Extractor
- Extracts full-text and metadata from the specified binary stream and stores the extracted information
as RDF statements in the specified RDFContainer. The optionally specified Charset and MIME type can be
used to direct how the stream should be parsed.
The specified InputStream is expected to already use some kind of buffering so that the Extractors are
not required to internally buffer bytes to improve performance.
- Specified by:
extract
in interface Extractor
- Parameters:
id
- the URI identifying the object (e.g. a file or web page) from which the stream was obtained.
The generated statements should describe this URI.stream
- the InputStream delivering the raw bytes.charset
- the charset in which the inputstream is encoded (optional).mimeType
- the MIME type of the passed stream (optional).result
- the container in which this Extractor can put its created RDF statements.
- Throws:
ExtractorException
- in case of any error during the extraction process.
Copyright © 2010 Aperture Development Team. All Rights Reserved.