- All Implemented Interfaces:
public class PlainTextExtractor
- extends Object
- implements Extractor
|Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
public static final int BYTES_TEST_LENGTH
- See Also:
- Constant Field Values
public static Charset guessCharset(InputStream iStream)
- Tries to guess the charset of the input stream. It assumes that the input stream contains plain text in
some encoding. First it looks for the Byte Order Mark which (if present) allows for a highly probable
guess of unicode encoding. If the BOM is missing, the method reverts to a heuristic approach (as
implemented by the juniversalchardet library).
iStream, - must support
InputStream.mark(int), the method first calls the
InputStream.mark(int) and then resets the stream with
- the guessed charset or null if the algorithm couldn't decide on one (note that the guess is
heuristic and a non-null return value may be wrong, especially if the document is shorter
public void extract(URI id,
- Description copied from interface:
- Extracts full-text and metadata from the specified binary stream and stores the extracted information
as RDF statements in the specified RDFContainer. The optionally specified Charset and MIME type can be
used to direct how the stream should be parsed.
The specified InputStream is expected to already use some kind of buffering so that the Extractors are
not required to internally buffer bytes to improve performance.
- Specified by:
extract in interface
id - the URI identifying the object (e.g. a file or web page) from which the stream was obtained.
The generated statements should describe this URI.
iStream - the InputStream delivering the raw bytes.
charset - the charset in which the inputstream is encoded (optional).
mimeType - the MIME type of the passed stream (optional).
result - the container in which this Extractor can put its created RDF statements.
ExtractorException - in case of any error during the extraction process.
Copyright © 2010 Aperture Development Team. All Rights Reserved.