org.semanticdesktop.aperture.util
Class StringExtractor

java.lang.Object
  extended by org.semanticdesktop.aperture.util.StringExtractor
Direct Known Subclasses:
WPStringExtractor

public class StringExtractor
extends Object

StringExtractor uses a set of heuristics to extract as much human-readable text as possible from a binary stream. This is useful for binary document formats that often or always contain the document text as ascii characters (e.g. MS Office files), intermixed with binary parts. When such a document could not be parsed using the appropriate library (e.g., Apache POI), a StringExtractor might still be able to produce some meaningful content and can thus serve as a fallback.

The output of StringExtractor is suited for text indexing but less for human consumption, as any formatting will most likely be lost and some amount of unwanted characters slipping through can also not be prevented.


Field Summary
static String[] COMMON_FONT_NAMES
           
 
Constructor Summary
StringExtractor()
           
 
Method Summary
 String extract(InputStream stream)
          Extract all human-readable text from an InputStream.
protected  boolean isNormalWord(String word)
           
protected  boolean isStartLine(String lineLowerCase)
          Determines whether the supplied line indicates the start of the textual contents.
protected  boolean isTextCharacter(int charNumber)
          Checks whether the supplied character is a text character.
protected  boolean isValidLine(String lineLowerCase)
          Determines whether the supplied line should be included in the end result.
protected  String postProcessLine(String line)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

COMMON_FONT_NAMES

public static final String[] COMMON_FONT_NAMES
Constructor Detail

StringExtractor

public StringExtractor()
Method Detail

extract

public String extract(InputStream stream)
               throws IOException
Extract all human-readable text from an InputStream.

Parameters:
stream - The InputStream to read the bytes from. The stream will be fully consumed but not closed.
Returns:
The resulting, heuristically determined text. A String is always returned, although it can be empty.
Throws:
IOException - When reading characters from the InputStream caused an IOException.

isStartLine

protected boolean isStartLine(String lineLowerCase)
Determines whether the supplied line indicates the start of the textual contents. If 'true', all text extracted up to this point will be ignored, i.e. text extraction will start again from scratch but at the current location in the stream. The specified line is expected to be fully lowercased. This default implementation returns 'false'.


isValidLine

protected boolean isValidLine(String lineLowerCase)
Determines whether the supplied line should be included in the end result. The specified line is expected to be fully lowercased.


isTextCharacter

protected boolean isTextCharacter(int charNumber)
Checks whether the supplied character is a text character. By default, this method returns true for letters and single quotes.


postProcessLine

protected String postProcessLine(String line)

isNormalWord

protected boolean isNormalWord(String word)


Copyright © 2010 Aperture Development Team. All Rights Reserved.