Extractors extract the full-text and/or metadata of a particular document type (one or more MIME types). They operate on an InputStream, optionally accompanied by a MIME type and/or a Charset to tune the processing, and produce a set of RDF statements describing the full-text and metadata.
The extractor classes are summarized in the following diagram:
The current set of Extractors focus on typical document-like formats, such as word processor documents, spreadsheets and presentations. Future implementations are planned that also target images, videos and sound files.
Below is a list of available Extractors, their external dependencies and remarks.
These implementations may vary in performance and extraction quality. Some use dedicated external libraries for handing a specific document format or family of document formats, others merely use a heuristic algorithm to extract readable text from a binary stream.
| Extractor | Dependencies | Remarks |
|---|---|---|
| ExcelExtractor | Poi libraries | Tests indicate that Excel 97 and higher are supported. Both the full-text and metadata are retrieved. |
| HtmlExtractor | Htmlparser | -- |
| MimeExtractor | JavaMail | An Extractor for message/rfc822 and message/news documents. Both the most significant headers and the body are extracted. Any attachments are ignored. |
| OfficeExtractor | Poi libraries | An Extractor that can be used as a fall-back when the MIME type identifier was able to identify a document as an MS Office document but was not able to further classify it, e.g. as an MS Word file. Both text and metadata are extracted. |
| OpenDocumentExtractor | -- | Extracts full-text and metadata from OpenDocument files and is backwards compatible with older OpenOffice (1.x) and StarOffice (6.x and 7.x) documents. |
| PdfExtractor | PDFBox | Extracts full-text and metadata from all PDF versions. |
| PlainTextExtractor | -- | -- |
| PowerPointExtractor | Poi libraries | Text and metadata are extracted. Text extraction is noisy but sufficient for text indexing, if you're willing to accept that your index will contain non-word symbols. |
| PresentationsExtractor | Poi libraries | Apparently Presentations files can have an OLE structure similar to MS Office files or use a document structure similar to WordPerfect. In both cases text can be extracted. In the first case metadata is also extracted. |
| PublisherExtractor | Poi libraries | Poi is only used for document metadata retrieval, text retrieval uses a heuristic string extraction algorithm. |
| QuattroExtractor | Poi libraries | Only recent Quattro Formats, as used by Quattro Pro 7 and Quattro Pro X3, are supported as these have a structure similar to MS Office documents. Older versions are not supported. Poi is only used for metadata retrieval, text retrieval uses a heuristic string extraction algorithm in both cases. |
| RtfExtractor | None (uses the JRE's RTFEditorKit) | Only document text is extracted. Otherwise no known issues. |
| VisioExtractor | Poi libraries | Poi is only used for document metadata retrieval, text retrieval uses a heuristic string extraction algorithm. |
| WordExtractor | Poi libraries | Tests indicate that Word 97 and higher are supported. Both text and metadata are extracted. |
| WordPerfectExtractor | -- | Implementation only extracts full-text. Text is extracted from WordPerfect documents from version 4.2 up to WordPerfect X3 (tested with 4.2, 5.0, 5.1/5.2 and X3, all created using WordPerfect X3), except for the 5.1/5.2 Far East format. Tests revealed that for WordPerfect 5.0 and 5.1 the document metadata also ends up at the start of the extracted full-text. |
| WorksExtractor | -- | Implementation only extracts text and apparently only works well on Works 3.0 and 4.0 documents and Works 4.0/2000 spreadsheets. Other versions typically produce garbage, if anything at all. |
| XmlExtractor | -- | Extracts all PCDATA and attribute values in the order in which they appear in the document. |
Note regarding the Poi libraries: in accidental cases Poi cannot process a document correctly, leading to some sort of Exception. In that case the Extractor typically catches and disposes of the Exception and falls back to applying a heuristic string extraction algorithm on the binary stream, which very often works surprisingly well on MS Office formats.
The following code demonstrates how to apply an Extractor on a given File and dump the extraction results to System.out (using NTriples encoding):
1 public class ExtractorExample {
2 public static void main(String[] args) throws Exception {
3 // create a MimeTypeIdentifier
4 MimeTypeIdentifier identifier = new MagicMimeTypeIdentifier();
5
6 // create an ExtractorRegistry containing all available
7 // ExtractorFactories
8 ExtractorRegistry extractorRegistry = new DefaultExtractorRegistry();
9
10 // read as many bytes of the file as desired by the MIME type identifier
11 File file = new File("somefile.someextension");
12 FileInputStream stream = new FileInputStream(file);
13 BufferedInputStream buffer = new BufferedInputStream(stream);
14 byte[] bytes = IOUtil.readBytes(buffer, identifier.getMinArrayLength());
15 stream.close();
16
17 // let the MimeTypeIdentifier determine the MIME type of this file
18 String mimeType = identifier.identify(bytes, file.getPath(), null);
19
20 // skip when the MIME type could not be determined
21 if (mimeType == null) {
22 System.err.println("MIME type could not be established.");
23 return;
24 }
25
26 // create the RDFContainer that will hold the RDF model
27 URI uri = URIImpl.create(file.toURI().toString());
28 Model model = new RepositoryModel(false);
29 model.open();
30 RDFContainer container = new RDFContainerImpl(model, uri);
31
32 // determine and apply an Extractor that can handle this MIME type
33 Set factories = extractorRegistry.get(mimeType);
34 if (factories != null && !factories.isEmpty()) {
35 // just fetch the first available Extractor
36 ExtractorFactory factory = (ExtractorFactory) factories.iterator()
37 .next();
38 Extractor extractor = factory.get();
39
40 // apply the extractor on the specified file
41 // (just open a new stream rather than buffer the previous stream)
42 stream = new FileInputStream(file);
43 buffer = new BufferedInputStream(stream, 8192);
44 extractor.extract(uri, buffer, null, mimeType, container);
45 stream.close();
46 }
47
48 // add the MIME type as an additional statement to the RDF model
49 container.add(DATA.mimeType, mimeType);
50
51 // report the output to System.out
52
53 container.getModel().writeTo(new PrintWriter(System.out),
54 Syntax.Ntriples);
55
56 }
57 }
58