This guide is dedicated to all those that would simply like to do what aperture is for - that is to crawl a filesystem and extract everything there is to be extracted: file metadata and contents. All of this can be accomplished in a single class. This class is available in src/examples folder. It is called TutorialCrawlingExample.
Basically in order to extract some rdf information from a data source we need a ... well a data source, that is an instance of a DataSource class. DataSources come in many flavours. (see DataSources). The one we'll be interested in is a FileSystemDataSource. The configuration of a data source is stored in an RDFContainer. It is an interface that makes access to the RDF store easy. It works more or less like a hash map. We can operate on it directly, or through a convenience class called ConfigurationUtil... (for details see RDF usage in Aperture) We'll choose the second approach. The configuration of a data source boils down to five lines of code:
1 Model model = RDF2Go.getModelFactory().createModel(); 2 RDFContainer configuration = new RDFContainerImpl(model,new URIImpl("source:testSource")); 3 ConfigurationUtil.setRootFolder(rootFile.getAbsolutePath(), configuration); 4 DataSource source = new FileSystemDataSource(); 5 source.setConfiguration(configuration);This snippet does following things
The second stage: the setting up and firing a crawler is done in another five lines:
1 FileSystemCrawler crawler = new FileSystemCrawler(); 2 crawler.setDataSource(source); 3 crawler.setDataAccessorRegistry(new DefaultDataAccessorRegistry()); 4 crawler.setCrawlerHandler(new TutorialCrawlerHandler()); 5 crawler.crawl();This piece of code has following meaning
The crawler handler is actually very simple. Aperture provides a class called CrawlerHandlerBase. Note that it is not available in the aperture jar itself. You need the examples jar file to use it. It encapsulates the default methods. The simplest use case of a crawler needs only five methods to be provided. They are summarized in this snippet:
1 private class TutorialCrawlerHandler extends CrawlerHandlerBase { 2 3 // our 'persistent' modelSet 4 private ModelSet modelSet;Constructor - initializes the underlying modelSet - the rdf store that will contain all generated RDF statements. In this example we use the default createModelSet() method. It creates a model set backed by an in-memory repository with no inference. We could just as well use a persistent model, whose content is stored in a file on in a relational database.
6 public TutorialCrawlerHandler() throws ModelException { 7 modelSet = RDF2Go.getModelFactory().createModelSet(); 8 }crawlStopped - the method called by the crawler, when it has finished the crawling process. At that point the Repository will contain all data that has been extracted from a file system, that is the file metadata (names, sizes, dates of last modification etc...) and contents (extracted from files that have been recognized as being of one of the supported file types. See extractors for details on this process. Don't forget to close the modelSet after you're done with it (line 18).
10 public void crawlStopped(Crawler crawler, ExitCode exitCode) { 11 try { 12 modelSet.writeTo(System.out, Syntax.Trix); 13 } 14 catch (Exception e) { 15 throw new RuntimeException(e); 16 } 17 18 modelSet.close(); 19 }getRDFContainer - every time a new data object (in this case a file) is encountered, the crawler has to store the rdf data in some rdf container. He asks the handler to provide him with one. This approach gives us some flexibility. In this particular program we use this flexibility to make every container a new fresh one, backed by a new empty in-memory model. As such we will have the information about different DataObject nicely divided. They won't interfere with each other, and we will be able to decide by ourselves what to do with each DataObject.
21 public RDFContainer getRDFContainer(URI uri) { 22 // we create a new in-memory temporary model for each data source 23 Model model = RDF2Go.getModelFactory().createModel(uri); 24 // note that the model is opened when passed to an rdfcontainer 25 return new RDFContainerImpl(model, uri); 26 }
Now we see the power of aperture. The most important method in every application that uses aperture. ObjectNew. This method is called by the crawler whenever a new data object is found. For applications that don't keep information about previous crawls and are thus unable to tell if an object has been encountered before or not - this will be the only method that really matters. In this example we simply move the metadata from the data object (backed by an in-memory mode, we created in the getRdfContainer method) to our 'persistent' modelSet. We could just as well do just anything we like with the data, analyze it in any way, show it to the user, serialize to a file, feed to Lucene for later searching. The sky is the limit.
Note that processBinary method from the CrawlerHandlerBase is used. It tries to find and use an extractor to augment the metadata provided by the crawler itself. It has been ommitted from this example but the reader is heartily advised to acquaint him- or herself with it, since using extractors is a common task for all Aperture users.
28 public void objectNew(Crawler crawler, DataObject object) { 29 // first we try to extract the information from the binary file 30 processBinary(object); 31 // then we add this information to our persistent model 32 modelSet.addModel(object.getMetadata().getModel()); 33 // don't forget to dipose of the DataObject 34 object.dispose(); 35 } 36This method is a variant of the previous one. It is used by crawlers that keep track of their crawling and can distinguish between objects that have been encountered before or not. See Incremental Crawling for details.
37 public void objectChanged(Crawler crawler, DataObject object) { 38 // first we remove old information about the data object 39 modelSet.removeModel(object.getID()); 40 // then we try to extract metadata and fulltext from the file 41 processBinary(object); 42 // an then we add the information from the temporary model to our 43 // 'persistent' model 44 modelSet.addModel(object.getMetadata().getModel()); 45 // don't forget to dispose of the DataObject 46 object.dispose(); 47 } 48At lastly, another method often called by crawlers that use Incremental Crawling facilities. It is called whenever the crawler finds out that a data object has been deleted from the data source (e.g. a file was deleted). This method lets us update the 'persistent' rdf store to reflect the deletion.
49 public void objectRemoved(Crawler crawler, URI uri) { 50 // we simply remove the information 51 modelSet.removeModel(uri); 52 } 53 }
If this short demonstration got you interested - see the entire working example in org.semanticdesktop.aperture.examples.TutorialCrawlingExample. There are also numerous other examples in this package. Apart from examples, there is still plenty to read in the rest of this documentation. Enjoy aperture!