org.semanticdesktop.aperture.subcrawler
Class SubCrawlerUtil

java.lang.Object
  extended by org.semanticdesktop.aperture.subcrawler.SubCrawlerUtil

public class SubCrawlerUtil
extends Object

A utility class containing some methods useful when working with subcrawlers and subcrawled resources.


Constructor Summary
SubCrawlerUtil()
           
 
Method Summary
static URI createChildUri(URI objectUri, String childPath, String prefix)
          Creates a URI for a subcrawled entity.
static DataObject getDataObject(URI uri, InputStream stream, DataSource dataSource, Charset charset, String mimeType, RDFContainerFactory containerFactory, SubCrawlerRegistry registry)
           Tries to access a DataObject that is hidden in a stream.
static DataObject getDataObject(URI parentUri, String path, InputStream stream, DataSource dataSource, Charset charset, String mimeType, RDFContainerFactory factory, String prefix, SubCrawler sc)
           
static URI getParentObjectUri(URI subCrawledObjectUri)
           Returns the URI of the parent data object, from the URI of a subcrawled object.
static URI getRootObjectUri(URI subCrawledObjectUri)
           Returns the URI of the root object, from the URI of a subcrawled object.
static String getSubCrawledObjectPath(URI subCrawledObjectUri)
           Returns the the path of the subcrawled object within the parent object.
static String getSubCrawlerPrefix(URI subCrawledObjectUri)
           Returns the subcrawler prefix from the URI of a subcrawled object.
static boolean isSubcrawledObjectUri(URI subCrawledObjectUri)
          Returns true if the given uri is an URI of the subcrawled object, false otherwise.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

SubCrawlerUtil

public SubCrawlerUtil()
Method Detail

getDataObject

public static DataObject getDataObject(URI uri,
                                       InputStream stream,
                                       DataSource dataSource,
                                       Charset charset,
                                       String mimeType,
                                       RDFContainerFactory containerFactory,
                                       SubCrawlerRegistry registry)
                                throws SubCrawlerException,
                                       PathNotFoundException,
                                       IOException

Tries to access a DataObject that is hidden in a stream. This method can get the desired object through multiple levels of nesting. E.g. for an uri:

"zip:mime:file:/C:/Users/Chris/Desktop/docx%20problem/Useful%20documents1.eml!/#1!/Board+paper.docx"

This method will assume that the given stream points at the root data object. i.e.:

"file:/C:/Users/Chris/Desktop/docx%20problem/Useful%20documents1.eml"

Then it will apply a MimeSubCrawler on that stream, to get the first attachment, and afterwards it will apply the ZipSubCrawler on that attachment to get the desired file.

Parameters:
uri - the uri of the subcrawled object
stream - the stream pointing at the root data object of the uri
dataSource - the data source that will be returned from the DataObject.getDataSource() method of the returned object
charset - a charset (optional)
mimeType - the mime type of the stream (optional)
containerFactory - the factory of RDFContainers
registry - a SubCrawlerRegistry, from which all the necessary SubCrawlerFactories will be obtained
Returns:
a DataObject for the given URI
Throws:
SubCrawlerException
PathNotFoundException
IOException

getRootObjectUri

public static URI getRootObjectUri(URI subCrawledObjectUri)

Returns the URI of the root object, from the URI of a subcrawled object. E.g. for

"zip:mime:file:/C:/Users/Chris/Desktop/docx%20problem/Useful%20documents1.eml!/86b313dc282850fef1762fb400171750%2540amrapali.com#1!/Board+paper.docx"

This method will return

"file:/C:/Users/Chris/Desktop/docx%20problem/Useful%20documents1.eml"

... that is the portion of the uri between the last 'scheme' part (regex: '\w{2,}:') and the first exclamation mark. The regex is constructed in a way to allow for windows drive names (single letter and a semicolon), an uri scheme cannot have a single letter.

Parameters:
subCrawledObjectUri -
Returns:
the URI of the root object from which the sub crawled object has been obtained (possibily) by many nested subcrawlers

getParentObjectUri

public static URI getParentObjectUri(URI subCrawledObjectUri)

Returns the URI of the parent data object, from the URI of a subcrawled object. E.g. for

"zip:mime:file:/C:/Users/Chris/Desktop/docx%20problem/Useful%20documents1.eml!/86b313dc282850fef1762fb400171750%2540amrapali.com#1!/Board+paper.docx"

This method will return

"mime:file:/C:/Users/Chris/Desktop/docx%20problem/Useful%20documents1.eml!/86b313dc282850fef1762fb400171750%2540amrapali.com#1"

If this object already denotes a root data object (i.e. not a subcrawled data object) this method will return null. For example given a uri of a normal file:

"file:/C:/Users/Chris/Desktop/docx%20problem/Useful%20documents1.eml"

This method will return null.

Parameters:
subCrawledObjectUri -
Returns:
URI of the parent data object from which the sub crawled object has been obtained

getSubCrawlerPrefix

public static String getSubCrawlerPrefix(URI subCrawledObjectUri)

Returns the subcrawler prefix from the URI of a subcrawled object. This means the immediate 'topmost' data object. E.g. for

"zip:mime:file:/C:/Users/Chris/Desktop/docx%20problem/Useful%20documents1.eml!/86b313dc282850fef1762fb400171750%2540amrapali.com#1!/Board+paper.docx"

This method will return "zip"

If this object already denotes a root data object (i.e. not a subcrawled data object) this method will return null. For example given a uri of a normal file:

"file:/C:/Users/Chris/Desktop/docx%20problem/Useful%20documents1.eml"

This method will return null.

Parameters:
subCrawledObjectUri -
Returns:
the subcrawler prefix from the URI of a subcrawled Object

getSubCrawledObjectPath

public static String getSubCrawledObjectPath(URI subCrawledObjectUri)

Returns the the path of the subcrawled object within the parent object. This means the immediate 'topmost' data object. E.g. for

"zip:mime:file:/C:/Users/Chris/Desktop/docx%20problem/Useful%20documents1.eml!/86b313dc282850fef1762fb400171750%2540amrapali.com#1!/Board+paper.docx"

This method will return "/Board+paper.docx"

If this object already denotes a root data object (i.e. not a subcrawled data object) this method will return null. For example given a uri of a normal file:

"file:/C:/Users/Chris/Desktop/docx%20problem/Useful%20documents1.eml"

This method will return null.

Parameters:
subCrawledObjectUri -
Returns:
the path of the subcrawled object within the parent object

isSubcrawledObjectUri

public static boolean isSubcrawledObjectUri(URI subCrawledObjectUri)
Returns true if the given uri is an URI of the subcrawled object, false otherwise. A proper URI for a subcrawled object consists of a proper URI of the root object with uri prefixes and subcrawled object paths at the end. The number of prefixes should be equal to the number of subcrawled object paths.

Parameters:
subCrawledObjectUri -
Returns:
true if the given URI is a valid URI of a subcrawled resource, false otherwise

createChildUri

public static URI createChildUri(URI objectUri,
                                 String childPath,
                                 String prefix)
Creates a URI for a subcrawled entity. Uses a scheme invented within the apache commons VFS project.

Parameters:
objectUri - the uri of the parent data object
childPath - the path within the the child object
Returns:
a uri for a subcrawled entity.
See Also:
VFS Filesystems Documentation

getDataObject

public static DataObject getDataObject(URI parentUri,
                                       String path,
                                       InputStream stream,
                                       DataSource dataSource,
                                       Charset charset,
                                       String mimeType,
                                       RDFContainerFactory factory,
                                       String prefix,
                                       SubCrawler sc)
                                throws SubCrawlerException,
                                       PathNotFoundException
Throws:
SubCrawlerException
PathNotFoundException


Copyright © 2010 Aperture Development Team. All Rights Reserved.