|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.semanticdesktop.aperture.util.StringExtractor
public class StringExtractor
StringExtractor uses a set of heuristics to extract as much human-readable text as possible from a binary stream. This is useful for binary document formats that often or always contain the document text as ascii characters (e.g. MS Office files), intermixed with binary parts. When such a document could not be parsed using the appropriate library (e.g., Apache POI), a StringExtractor might still be able to produce some meaningful content and can thus serve as a fallback.
The output of StringExtractor is suited for text indexing but less for human consumption, as any formatting will most likely be lost and some amount of unwanted characters slipping through can also not be prevented.
Field Summary | |
---|---|
static String[] |
COMMON_FONT_NAMES
|
Constructor Summary | |
---|---|
StringExtractor()
|
Method Summary | |
---|---|
String |
extract(InputStream stream)
Extract all human-readable text from an InputStream. |
protected boolean |
isNormalWord(String word)
|
protected boolean |
isStartLine(String lineLowerCase)
Determines whether the supplied line indicates the start of the textual contents. |
protected boolean |
isTextCharacter(int charNumber)
Checks whether the supplied character is a text character. |
protected boolean |
isValidLine(String lineLowerCase)
Determines whether the supplied line should be included in the end result. |
protected String |
postProcessLine(String line)
|
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final String[] COMMON_FONT_NAMES
Constructor Detail |
---|
public StringExtractor()
Method Detail |
---|
public String extract(InputStream stream) throws IOException
stream
- The InputStream to read the bytes from. The stream will be fully consumed but not closed.
IOException
- When reading characters from the InputStream caused an IOException.protected boolean isStartLine(String lineLowerCase)
protected boolean isValidLine(String lineLowerCase)
protected boolean isTextCharacter(int charNumber)
protected String postProcessLine(String line)
protected boolean isNormalWord(String word)
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |