The Use of RDF in Aperture

Aperture makes heavy use of RDF graphs to communicate information between components. For example, Extractors return the full-text and metadata they extract as an RDF model and Crawlers do the same for the source-specific content and metadata they obtain through crawling.

The rationale for using RDF is that we want an expressive and flexible way to let these components communicate their results. Compared to e.g. a Java Map, an RDF model allows for much more expressive information models, enabling future API implementations to export metadata with a high level of granularity without making any assumptions upfront on what it will look like.

To streamline metadata extraction and communication, we've developed two Aperture namespaces. One is meant for modelling data sources, e.g. a root dir on a file system with include and exclude patterns. The other namespace is for describing the actual set of crawled resources, e.g. their names, content and other properties determined through crawling and content extraction.

RDFContainer

The power of RDF is also its main weakness: RDF models can be complex to handle when compared to, e.g., Java Maps. They are also a relatively new standard for the Java community at large, which may be a burden for new developers considering to extent and improve Aperture. There is no uniform API for handling RDF models, as there are for XML documents (DOM, SAX...). Instead a number of competing and developing de facto standards exist that each have their pros and cons. These systems also tend to focus on specific use cases, such as ontology engineering or storage and querying, leading to a complex and specialized API. For example, Sesame's Repository API deals with concepts such as transactions, contexts, etc. ( RDF2Go tries to solve these problems but they won't fade away overnight.) For a simple use case like an Extractor implementation you don't want your implementors having to even consider these issues. These should be application-level matters, not framework-level matters.

Therefore we have developed a simple API for RDF models called RDFContainer. It contains a set of methods mimicking the Map API to ease understanding while still allowing for addition of arbitrary RDF statements in a simple way. We tried to find a balance between what we envisioned we would need in our systems without overcomplicating things. RDFContainer instances are passed to e.g. Extractor and DataAccessor implementations to put all metadata in.

RDFContainer has build-in support for a number of Java core types, such as integers, booleans and dates. This relieves developers from the burden of having to figure out how to convert such data types to their RDF counterparts while also enforcing a uniform way in which these data types are converted accross the entire system.

Simple add methods

Often we don't need to store complicated graphs to express necessary information. In many cases what we really need is a functionality of a simple hash-map, that would enable us to store simple name-value pairs that describe a central resource. This is exactly what the RDFContainer is all about. Conceptually it is a star-like data structure. It has a central point - a universal resource identifier (URI), that identifies what we want to describe. Around this central point there are other points, representing property values. They are connected to the central uri with links, that are labeled with appropriate property names. A visualisation of a typical, simple RDFContainer might look like this.

Here we have information about a guy named John. We know, that he is a human, more specifically a man. We know that his full name is John Smith. We also know, that he lives in USA, at 14 Blue Ridge Circle, Beverly Hills, USA. What's more, he's been born on the 19.01.1984.

The code that would create an RDFContainer, that would enable us to store such information in an RDF store, might look as follows.

Model model = RDF2Go.getModelFactory().createModel();
RDFContainer container = new RDFContainerImpl(model,MY_NS + "John");
container.add(RDF.TYPE,new URIImpl(MY_NS + "human"));
container.add(RDF.TYPE,new URIImpl(MY_NS + "man"));
container.add(new URIImpl(MY_NS +  "name"), "John Smith");
Calendar dateOfBirth = new GregorianCalendar();
dateOfBirth.add(Calendar.YEAR, 1984);
dateOfBirth.add(Calendar.MONTH, 1);
dateOfBirth.add(Calendar.DAY_OF_MONTH, 19);
container.add(new URIImpl(MY_NS + "dateOfBirth"), dateOfBirth);
container.add(new URIImpl(MY_NS + "weightInKg"), 85);
container.add(new URIImpl(MY_NS + "address"), "14 Blue Ridge Circle, Beverly Hills, USA");

Every variant of add method has two arguments. The first one is the URI of the property. The second one is the value. Usually property URI's are taken from some predefined vocabulary class (like RDF.TYPE in this example) but nothing can stop us from devising our own. In this example we made the URI's belong to our own namespace (the MY_NS constant). The second argument can be of one of many types. The container implementation will take care of any conversions. You don't need to care about XSD datatypes, date formats etc. The transaction and context handling is also done behind the scenes. If you're new to RDF and those concepts hardly ring the bell at all - don't worry. You don't need to know them to use aperture in your applications.

The information in a RDFContainer can be accessed with get... methods. They come up in many flavours. One for every datatype supported by add... . An example that would extract information about John would look as follows:

System.out.println(container.getString(new URIImpl(MY_NS + "name")));
System.out.println(container.getString(new URIImpl(MY_NS + "dateOfBirth")));
System.out.println(container.getString(new URIImpl(MY_NS + "weightInKg")));
System.out.println(container.getString(new URIImpl(MY_NS + "address")))

Notice how we used a single getString method. It worked even though we inserted many different datatypes. The conversions are done automatically. (If they are possible, if not, you will have to deal with an exception). Even aperture wouldn't know how to convert 'true' to a date :-).

Hash map functionality with put methods

The RDFContainer contains a second set of methods whose names start with put... They differ from the add... methods in one important respect. They enforce the hash-map semantics, that is every property can have at most one value. Storing a second value with put will replace the existing one.

On the other hand, there is a problem with multiple values. If we specify that a property should have more values (like RDF.TYPE in the previous example) and try to get one, the RDFContainer would have to decide which one to return. The designers of aperture decided that it is up to the user to know that such a situation could occur, and stated, that get methods will throw a MultipleValuesException in such case. If you need to work with properties that have multiple values, you need to use the getAll() method. It works as follows:

Collection types = container.getAll(RDF.TYPE);
for (Iterator iterator = types.iterator(); iterator.hasNext(); ) {
        Node node = (Node)iterator.next();
        System.out.println(node);
}

This collection contains Node objects. Value is a superinterface, that encompasses all datatypes that can occur in an RDF store - URIS, Literals and Blank Nodes. Unfortunately you will have to dig a bit deeper into the RDF documentation and the javadocs for org.ontoware.rdf2go.model package to get to know how to work with them. The container won't aid you in this aspect.

During development of some Extractors and Crawlers we experienced that the use of the put and get methods, which enforce and ensure that a property maximally occurs once for a given subject, may not always be desirable. In general, it makes a lot of sense to use them in configuration-like contexts, e.g. the definition of a DataSource. Here it can be vital that a property has a single value or else the definition becomes ambiguous or unprocessable. However, in the context of Crawlers and Extractors we found that taking such a closed world assumption on the RDF graph is less ideal: you never know who else is working on the same RDF graph and whether it makes logically sense to allow a property to have multiple values or not. It is up to you to decide which approach to use in your particular application.

Direct access to statements

Even though the abstraction of RDFContainer is enough to perform many tasks, sometimes you simply have to resort to the full expressive power of RDF and model the knowledge with a fully-fledged graph. Let's imagine a simple case that John has a car. This car is a green Ford. Trying to express it in a simple RDFContainer with a 'star' of properties would be unnatural. It would have to look somehow like this:

A much more natural way to express this fact would be to have a separate node that would represent a car. It could have it's own identifier and it's own properties. In such case, it would be much easier to update our knowledge base when John buys himself another car :-). Let's try to build a following structure:

The first stage would be to describe John using 'normal' methods, as already described in previous chapters. The link that states that John has a Car, can also be inserted with 'normal' rdfContainer method. For the description of the car itself, we'll have to do everything by ourselves.

Every statement represents a single link in the knowledge graph. It consists of three parts - subject, predicate and object. RDFContainer in itself does nothing more than setting the subject to the central URI. It would be possible not to use the add/put methods at all. Everything could be expressed with Statements, albeit in a less readable and more complicated way. Nevertheless, if we want to describe an URI that is not the central one - we have no other option. We can use the ValueFactory interface to get help.

Here's an example how it can be accomplished.

ValueFactory factory = container.getValueFactory();
URI carURI = factory.createURI(MY_NS + "GreenFord");
container.add(new URIImpl(MY_NS + "hasACar"),carURI);

Statement statement1 = factory.createStatement(
        carURI,
        RDF.TYPE,
        new URIImpl(MY_NS + "Car"));
Statement statement2 = factory.createStatement(
        carURI,
        new URIImpl(MY_NS + "brand"),
        new LiteralImpl("Ford"));
Statement statement3 = factory.createStatement(
        carURI,
        new URIImpl(MY_NS + "color"),
        new LiteralImpl("green"));
container.add(statement1);
container.add(statement2);
container.add(statement3)

Implementation

A single RDFContainer implementation is currently available - RDFContainetImpl. It works with an RDF2Go model. All matters regarding context and transactions are solved in/behind this implementation.