org.semanticdesktop.aperture.extractor.microsoft.util
Class PoiUtil

java.lang.Object
  extended by org.semanticdesktop.aperture.extractor.microsoft.util.PoiUtil

public class PoiUtil
extends Object

Features Apache POI-specific utility methods for text and metadata extraction purposes.

Some methods use a buffer to be able to reset the InputStream to its start. The buffer size can be altered by giving the "aperture.poiUtil.bufferSize" system property a value holding the number of bytes that the buffer may use.


Nested Class Summary
static class PoiUtil.NonCloseableStream
           
static interface PoiUtil.TextExtractor
          A TextExtractor is a delegate that extracts the full-text from an MS Office document using a POIFSFileSystem.
 
Constructor Summary
PoiUtil()
           
 
Method Summary
static InputStream extractAll(InputStream stream, PoiUtil.TextExtractor textExtractor, RDFContainer container, Logger logger)
          Extract full-text and metadata from an MS Office document contained in the specified stream.
static void extractMetadata(org.apache.poi.poifs.filesystem.DirectoryNode dirNode, RDFContainer container)
           
static InputStream extractMetadata(InputStream stream, boolean resetStream, RDFContainer container)
          Extract all metadata from an OLE document.
static void extractMetadata(org.apache.poi.poifs.filesystem.POIFSFileSystem poiFileSystem, RDFContainer container)
          Extracts all metadata from the POIFSFileSystem's SummaryInformation and transforms it to RDF statements that are stored in the specified RDFContainer.
static int getBufferSize()
          Returns the buffer size to use when buffering the contents of a document.
static org.apache.poi.hpsf.DocumentSummaryInformation getDocumentSummaryInformation(org.apache.poi.poifs.filesystem.POIFSFileSystem poiFileSystem)
          Returns the SummaryInformation holding the document metadata from a POIFSFileSystem.
static org.apache.poi.hpsf.SummaryInformation getSummaryInformation(org.apache.poi.poifs.filesystem.POIFSFileSystem poiFileSystem)
          Returns the SummaryInformation holding the document metadata from a POIFSFileSystem.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

PoiUtil

public PoiUtil()
Method Detail

getSummaryInformation

public static org.apache.poi.hpsf.SummaryInformation getSummaryInformation(org.apache.poi.poifs.filesystem.POIFSFileSystem poiFileSystem)
Returns the SummaryInformation holding the document metadata from a POIFSFileSystem. Any POI-related or I/O Exceptions that may occur during this operation are ignored and 'null' is returned in those cases.

Parameters:
poiFileSystem - The POI file system to obtain the metadata from.
Returns:
A populated SummaryInformation, or 'null' when the relevant document parts could not be located.

getDocumentSummaryInformation

public static org.apache.poi.hpsf.DocumentSummaryInformation getDocumentSummaryInformation(org.apache.poi.poifs.filesystem.POIFSFileSystem poiFileSystem)
Returns the SummaryInformation holding the document metadata from a POIFSFileSystem. Any POI-related or I/O Exceptions that may occur during this operation are ignored and 'null' is returned in those cases.

Parameters:
poiFileSystem - The POI file system to obtain the metadata from.
Returns:
A populated SummaryInformation, or 'null' when the relevant document parts could not be located.

extractMetadata

public static InputStream extractMetadata(InputStream stream,
                                          boolean resetStream,
                                          RDFContainer container)
                                   throws IOException
Extract all metadata from an OLE document.

Parameters:
stream - The stream containing the OLE document.
resetStream - Specified whether the stream should be buffered and reset. The buffer size can be determined by the system property described in the class documentation.
container - The RDFContainer to store the metadata in.
Returns:
If the stream passed as the input parameter supported mark() it is returned, otherwise the stream is wrapped in a BufferedInputStream which supports mark/reset and the BufferedInputStream is returned
Throws:
IOException - When resetting of the buffer resulted in an IOException.

extractMetadata

public static void extractMetadata(org.apache.poi.poifs.filesystem.POIFSFileSystem poiFileSystem,
                                   RDFContainer container)
Extracts all metadata from the POIFSFileSystem's SummaryInformation and transforms it to RDF statements that are stored in the specified RDFContainer.

Parameters:
poiFileSystem - The POI file system to obtain the metadata from.
container - The RDFContainer to store the created RDF statements in.

extractMetadata

public static void extractMetadata(org.apache.poi.poifs.filesystem.DirectoryNode dirNode,
                                   RDFContainer container)

extractAll

public static InputStream extractAll(InputStream stream,
                                     PoiUtil.TextExtractor textExtractor,
                                     RDFContainer container,
                                     Logger logger)
Extract full-text and metadata from an MS Office document contained in the specified stream. A TextExtractor is specified to handle the specifics of full-text extraction for this particular MS Office document type.

Returns:
If the stream passed as the input parameter supported mark() it is returned, otherwise the stream is wrapped in a BufferedInputStream which supports mark/reset and the BufferedInputStream is returned

getBufferSize

public static int getBufferSize()
Returns the buffer size to use when buffering the contents of a document.

Parameters:
systemProperty - The system property that contains the buffer size.
defaultSize - The default buffer size, in case the system property is not set or does not contain a valid value.
Returns:
The specified buffer size to use, or the default size when the indicated system property is not set or has an illegal value.


Copyright © 2010 Aperture Development Team. All Rights Reserved.