org.semanticdesktop.aperture.crawler.mail
Class AbstractJavaMailCrawler

java.lang.Object
  extended by org.semanticdesktop.aperture.crawler.base.CrawlerBase
      extended by org.semanticdesktop.aperture.crawler.mail.AbstractJavaMailCrawler
All Implemented Interfaces:
DataAccessor, Crawler, DataObjectFactory.PartStreamFactory
Direct Known Subclasses:
ImapCrawler, MboxCrawler

public abstract class AbstractJavaMailCrawler
extends CrawlerBase
implements DataObjectFactory.PartStreamFactory, DataAccessor

An abstract crawler implementation that works with an email store implementation hidden behind the Java Mail API.

The details about the connection management, authentication and security are the responsibility of the concrete subclasses.


Field Summary
static String ACCESSED_KEY
          The key used in the access data to mark if a given data object has been accessed or not
protected  ArrayList baseFolders
          List of base folders - roots of the crawling
protected  javax.mail.Folder currentFolder
          The folder currently crawled by the crawler.
protected  URI currentFolderURI
          The URI of the current folder.
protected  int maxDepth
          Maximum depth below the base folders the crawler will crawl
protected  long maximumByteSize
          Maximum size of the message accepted by the crawler, bigger messages will be ignored
protected  javax.mail.Store store
          The underlying Store instance.
protected static String SUBFOLDERS_KEY
           
 
Fields inherited from class org.semanticdesktop.aperture.crawler.base.CrawlerBase
accessData, accessorRegistry, crawlReportFile, source, stopRequested
 
Constructor Summary
AbstractJavaMailCrawler()
           
 
Method Summary
protected  void applySpecificProcessing(DataObject object)
          This method can be overridden by subclasses wishing to perform some specific processing on the data objects reported as new or modified, before they are passed to the CrawlerHandler
protected abstract  boolean checkIfCurrentFolderHasBeenChanged(AccessData newAccessData)
          Applies source-specific methods to determine if the current folder has been changed since it has last been crawled.
protected  boolean checkSubfoldersChanged(AccessData ad)
          Checks if the list of subfolders of the current folder has been changed in comparison with the list stored in the accessData instance.
protected  void closeConnection()
          Closes the Store.
protected  void crawlFolder(javax.mail.Folder folder, int depth)
          Crawls a subfolder tree starting at the given folder up until the given depth.
protected  void crawlMessages(javax.mail.Folder folder, URI folderUri)
           
protected  void crawlSingleFolder(javax.mail.Folder folder)
           
protected  void crawlSingleMessage(javax.mail.internet.MimeMessage message, String uri, URI folderUri)
           
protected  void crawlSubFolders(javax.mail.Folder folder, int depth)
           
 MessageDataObject createDataObject(URI dataObjectId, DataSource dataSource, RDFContainer metadata, javax.mail.internet.MimeMessage msg)
          Creates a message data object for the given parameters
protected abstract  void ensureConnectedStore()
          Ensures that the crawler is connected to the underlying mail storage system and can perform the crawl.
 Map<URI,DataObject> getAllRelatedDataObjects(String url, DataSource dataSource, Map params, RDFContainerFactory containerFactory)
           
protected  int getCurrentFolderMessageCount()
          Returns the amount of messages in the current folder.
protected  FolderDataObject getCurrentFolderObject(DataSource dataSource, AccessData newAccessData, RDFContainerFactory containerFactory)
          Returns a DataObject for the current JavaMail folder.
 DataObject getDataObject(String url, DataSource dataSource, Map params, RDFContainerFactory containerFactory)
          Get a DataObject for the specified url.
protected  Object getDataObjectByMessageURI(String url, DataSource dataSource, RDFContainerFactory containerFactory, javax.mail.Folder folder, boolean all)
           
 DataObject getDataObjectIfModified(String url, DataSource dataSource, AccessData newAccessData, Map params, RDFContainerFactory containerFactory)
          Get a DataObject for the specified url.
protected  Object getDataObjectOrAllObjects(String url, DataSource dataSource, AccessData newAccessData, Map params, RDFContainerFactory containerFactory, boolean all)
           
protected abstract  String getFolderName(String url)
          Extracts the name of the folder from the data object URI.
protected abstract  URI getFolderURI(javax.mail.Folder folder)
          Returns the URI of the folder, using the URI scheme appropriate for the current crawler.
protected  javax.mail.internet.MimeMessage getMessageByURI(String url, javax.mail.Folder folder)
          Returns a MimeMessage instance based on the URI of a data object.
protected  int getMessageCount(javax.mail.Message[] messages)
          Returns the amount of non-removed messages in the given array.
protected  javax.mail.Message getMessageFromCurrentFolder(int index)
          Returns the message from the current folder available at the given index.
protected  long getMessageUid(javax.mail.Folder folder, javax.mail.Message message)
          Returns the UID of the message.
protected abstract  String getMessageUri(javax.mail.Folder folder, javax.mail.Message message)
          Returns the URI of the message, using the URI scheme appropriate for the current crawler.
 InputStream getPartStream(javax.mail.Part part)
          Returns an input stream with the part content.
protected  String getSubFoldersString(javax.mail.Folder folder)
          Returns a string with the names of the subfolders of the given folder.
static boolean holdsFolders(javax.mail.Folder folder)
          Does this folder hold any subfolders?
static boolean holdsMessages(javax.mail.Folder folder)
          Does this folder hold any messages?
protected  boolean isAcceptable(javax.mail.Message message)
          Returns true if this message can be crawled, according to the criteria defined for this d data source.
protected  boolean isRemoved(javax.mail.Message message)
          Returns true if the given message has been marked as expunged or deleted
protected  boolean isTooLarge(javax.mail.Message message)
          Returns true if this message is larger than the maximum size defined for the current data source.
protected abstract  void recordCurrentFolderInAccessData(AccessData newAccessData)
          Records source-specific information about the current folder that will enable the crawler to detect if the crawler has been changed on a future crawl.
protected  void reportNotModified(String uri)
          Reports the given uri as unmodified.
protected abstract  void retrieveConfigurationData(DataSource source)
          Performs any necessary initialization of internal data structures before any crawling can commence.
protected  void setCurrentFolder(javax.mail.Folder folder)
          Sets the current folder.
 
Methods inherited from class org.semanticdesktop.aperture.crawler.base.CrawlerBase
clear, clear, crawl, crawlObjects, getAccessData, getCrawlerHandler, getCrawlReport, getCrawlReportFile, getDataAccessorRegistry, getDataSource, getRDFContainerFactory, inDomain, isStopRequested, reportAccessingObject, reportDeletedDataObject, reportFatalErrorCause, reportFatalErrorCause, reportFatalErrorCause, reportModifiedDataObject, reportNewDataObject, reportUnmodifiedDataObject, reportUntouched, runSubCrawler, setAccessData, setCrawlerHandler, setCrawlReportFile, setDataAccessorRegistry, setDataSource, stop, storeCrawlReport, touchObject
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

maxDepth

protected int maxDepth
Maximum depth below the base folders the crawler will crawl


maximumByteSize

protected long maximumByteSize
Maximum size of the message accepted by the crawler, bigger messages will be ignored


baseFolders

protected ArrayList baseFolders
List of base folders - roots of the crawling


ACCESSED_KEY

public static final String ACCESSED_KEY
The key used in the access data to mark if a given data object has been accessed or not

See Also:
Constant Field Values

SUBFOLDERS_KEY

protected static final String SUBFOLDERS_KEY
See Also:
Constant Field Values

currentFolder

protected javax.mail.Folder currentFolder
The folder currently crawled by the crawler.

See Also:
setCurrentFolder(Folder)

currentFolderURI

protected URI currentFolderURI
The URI of the current folder. It is set by the setCurrentFolder(Folder) using the getFolderURI(Folder) implementation.

See Also:
setCurrentFolder(Folder)

store

protected javax.mail.Store store
The underlying Store instance.

Constructor Detail

AbstractJavaMailCrawler

public AbstractJavaMailCrawler()
Method Detail

getFolderURI

protected abstract URI getFolderURI(javax.mail.Folder folder)
                             throws javax.mail.MessagingException
Returns the URI of the folder, using the URI scheme appropriate for the current crawler.

Parameters:
folder - the Folder whose URI we'd like to obtain.
Returns:
the uri of the folder
Throws:
javax.mail.MessagingException

getMessageUri

protected abstract String getMessageUri(javax.mail.Folder folder,
                                        javax.mail.Message message)
                                 throws javax.mail.MessagingException
Returns the URI of the message, using the URI scheme appropriate for the current crawler.

Parameters:
folder - the folder where the message resides
message - the message itself
Returns:
the uri of the message
Throws:
javax.mail.MessagingException

getFolderName

protected abstract String getFolderName(String url)
                                 throws UrlNotFoundException
Extracts the name of the folder from the data object URI. The result should be a string that can be passed to the Store.getFolder(String) method to obtain the corresponding Folder instance which directly contains the data object (message or attachment) with the given url. This method can be called ONLY when all confguration has been read from the DataObject, that is AFTER retrieveConfigurationData(DataSource).

Parameters:
url -
Returns:
the folder name
Throws:
UrlNotFoundException - if the given url does not belong to the current Store

checkIfCurrentFolderHasBeenChanged

protected abstract boolean checkIfCurrentFolderHasBeenChanged(AccessData newAccessData)
                                                       throws javax.mail.MessagingException
Applies source-specific methods to determine if the current folder has been changed since it has last been crawled.

Parameters:
newAccessData - the AccessData instance that is to be consulted
Returns:
false if the information stored in the accessData instance indictates that the folder hasn't been changed, false otherwise
Throws:
javax.mail.MessagingException

recordCurrentFolderInAccessData

protected abstract void recordCurrentFolderInAccessData(AccessData newAccessData)
                                                 throws javax.mail.MessagingException
Records source-specific information about the current folder that will enable the crawler to detect if the crawler has been changed on a future crawl.

Parameters:
newAccessData - the access data where the information should be stored
Throws:
javax.mail.MessagingException

retrieveConfigurationData

protected abstract void retrieveConfigurationData(DataSource source)
Performs any necessary initialization of internal data structures before any crawling can commence.

Parameters:
source -

ensureConnectedStore

protected abstract void ensureConnectedStore()
                                      throws javax.mail.MessagingException
Ensures that the crawler is connected to the underlying mail storage system and can perform the crawl. This method may be called at any time, it shouldn't do anything if a connection is already present and should reestablish it if it's not.

Parameters:
source -
Throws:
javax.mail.MessagingException

setCurrentFolder

protected void setCurrentFolder(javax.mail.Folder folder)
                         throws javax.mail.MessagingException
Sets the current folder. Implementations are free to perform any optimizations at this point (like prefetching). This method is called AFTER the folder is opened (Folder.open(int)) but before any messages are actually crawled.

Parameters:
folder - the folder that is to become the current folder
Throws:
javax.mail.MessagingException

getMessageFromCurrentFolder

protected javax.mail.Message getMessageFromCurrentFolder(int index)
                                                  throws javax.mail.MessagingException
Returns the message from the current folder available at the given index. Note that the exact semantics of the index may be overridden by the subclasses of this class, but it will always follow the javamail convention that folder indexes are one-based

Parameters:
index - a one-based index. The lowest valid value is one (obviously) the highest valid value is the one returned by getCurrentFolderMessageCount()
Returns:
the message placed under the given index
Throws:
javax.mail.MessagingException

getCurrentFolderMessageCount

protected int getCurrentFolderMessageCount()
                                    throws javax.mail.MessagingException
Returns the amount of messages in the current folder.

Returns:
the amount of messages in the current folder.
Throws:
javax.mail.MessagingException

getMessageByURI

protected javax.mail.internet.MimeMessage getMessageByURI(String url,
                                                          javax.mail.Folder folder)
                                                   throws javax.mail.MessagingException
Returns a MimeMessage instance based on the URI of a data object. The URI may point at a message itself, or at one of its attachments. This method is invoked by the default implementation of getDataObjectByMessageURI(String, DataSource, RDFContainerFactory, Folder, boolean) and if it returns non-null the resulting MimeMessage instance is used, if it returns null, the method will revert to the default strategy of iterating over all messages in the folder in order to find the desired one. If the underlying mail store allows - this may drastically improve the performance of the DataAccessor methods.

Parameters:
url -
folder -
Returns:
Throws:
javax.mail.MessagingException

closeConnection

protected void closeConnection()
Closes the Store. Implementations should ensure that this method does not invalidate any undisposed data objects that might have been obtained from the store and can still be used (for instance the InputStreams obtained from FileDataObject.getContent() or MimeMessages obtained from MessageDataObject.getMimeMessage() should still be readable.


getPartStream

public InputStream getPartStream(javax.mail.Part part)
                          throws javax.mail.MessagingException,
                                 IOException
Description copied from interface: DataObjectFactory.PartStreamFactory
Returns an input stream with the part content. It's conceptually a wrapper around the Part.getInputStream() method, designed to allow for customization of the returned input stream.

Specified by:
getPartStream in interface DataObjectFactory.PartStreamFactory
Returns:
an InputStream with the content of the part
Throws:
javax.mail.MessagingException
IOException
See Also:
DataObjectFactory.PartStreamFactory.getPartStream(Part)

createDataObject

public MessageDataObject createDataObject(URI dataObjectId,
                                          DataSource dataSource,
                                          RDFContainer metadata,
                                          javax.mail.internet.MimeMessage msg)
                                   throws javax.mail.MessagingException
Description copied from interface: DataObjectFactory.PartStreamFactory
Creates a message data object for the given parameters

Specified by:
createDataObject in interface DataObjectFactory.PartStreamFactory
Returns:
a message data object for the given parameters
Throws:
javax.mail.MessagingException
See Also:
PartStreamFactory#createDataObject(URI, DataSource, RDFContainer, MimeMessage, ExecutorService)

getDataObject

public DataObject getDataObject(String url,
                                DataSource dataSource,
                                Map params,
                                RDFContainerFactory containerFactory)
                         throws UrlNotFoundException,
                                IOException
Description copied from interface: DataAccessor
Get a DataObject for the specified url.

The resulting DataObject's ID may differ from the specified url due to normalization schemes, following of redirected URLs, etc. It is required though to provide a URI through which this DataAccessor can later on also access the same resource, i.e. the URI should also be a URL.

Specific DataAccessor implementations may accept additional parameters through the params Map, e.g. to speed up this method with ready-made datastructures it can reuse. See the documentation of these implementations for information on the type of parameters they accept. However, implementations should not rely on the contents of this Map to work properly.

Specified by:
getDataObject in interface DataAccessor
Parameters:
url - The url of the requested resource.
dataSource - The DataSource to be registered as the source of the DataObject (optional).
params - Additional parameters facilitating access to the physical resource (optional).
containerFactory - An RDFContainerFactory that delivers the RDFContainer to which the metadata of the DataObject should be added. The provided RDFContainer can later be retrieved as the DataObject's metadata container.
Returns:
A DataObject for the specified URI.
Throws:
UrlNotFoundException - When the specified url did not point to an existing resource.
IOException - When any kind of I/O error occurs.

getDataObjectIfModified

public DataObject getDataObjectIfModified(String url,
                                          DataSource dataSource,
                                          AccessData newAccessData,
                                          Map params,
                                          RDFContainerFactory containerFactory)
                                   throws UrlNotFoundException,
                                          IOException
Description copied from interface: DataAccessor
Get a DataObject for the specified url.

The resulting DataObject's ID may differ from the specified url due to normalization schemes, following of redirected URLs, etc. It is required though to provide a URI through which this DataAccessor can later on also access the same resource, i.e. the URI should also be a URL.

The optionally passed AccessData can be used to let the DataAccessor store information about the created DataSource. The next time it is invoked with the same URL, it can then use this information to determine whether the resource has changed or not. The DataAccessor should return null when the resource has not changed. This facilitates fast incremental crawling of DataSources. When no AccessData is specified, no change detection takes place and an AccessData is always returned.

Specific DataAccessor implementations may accept additional parameters through the params Map, e.g. to speed up this method with ready-made datastructures it can reuse. See the documentation of these implementations for information on the type of parameters they accept. However, implementations should not rely on the contents of this Map to work properly.

Specified by:
getDataObjectIfModified in interface DataAccessor
Parameters:
url - The url of the requested resource.
dataSource - The DataSource to be registered as the source of the DataObject (optional).
newAccessData - Any access data obtained during the previous access to this DataObject (optional).
params - Additional parameters facilitating access to the physical resource (optional).
containerFactory - An RDFContainerFactory that delivers the RDFContainer to which the metadata of the DataObject should be added. The provided RDFContainer can later be retrieved as the DataObject's metadata container.
Returns:
A DataObject for the specified URI, or null when the binary resource has not been modified since the last access.
Throws:
UrlNotFoundException - When the specified url did not point to an existing resource.
IOException - When any kind of I/O error occurs.

getAllRelatedDataObjects

public Map<URI,DataObject> getAllRelatedDataObjects(String url,
                                                    DataSource dataSource,
                                                    Map params,
                                                    RDFContainerFactory containerFactory)
                                             throws UrlNotFoundException,
                                                    IOException
Throws:
UrlNotFoundException
IOException

getDataObjectOrAllObjects

protected Object getDataObjectOrAllObjects(String url,
                                           DataSource dataSource,
                                           AccessData newAccessData,
                                           Map params,
                                           RDFContainerFactory containerFactory,
                                           boolean all)
                                    throws UrlNotFoundException,
                                           IOException
Throws:
UrlNotFoundException
IOException

getDataObjectByMessageURI

protected Object getDataObjectByMessageURI(String url,
                                           DataSource dataSource,
                                           RDFContainerFactory containerFactory,
                                           javax.mail.Folder folder,
                                           boolean all)
                                    throws javax.mail.MessagingException,
                                           UrlNotFoundException,
                                           IOException
Throws:
javax.mail.MessagingException
UrlNotFoundException
IOException

crawlFolder

protected final void crawlFolder(javax.mail.Folder folder,
                                 int depth)
                          throws javax.mail.MessagingException
Crawls a subfolder tree starting at the given folder up until the given depth. This method is to be called by the subclasses after setting up all connection parameters.

Parameters:
folder - the folder where the crawl should be started
depth - how deep should the crawl proceed
  • -1 - unlimited depth
  • 0 or 1 - only the given folder will be crawled
  • 2 - only the given folder and it's direct subfolders
Throws:
javax.mail.MessagingException

crawlSingleFolder

protected void crawlSingleFolder(javax.mail.Folder folder)
                          throws javax.mail.MessagingException
Throws:
javax.mail.MessagingException

crawlSubFolders

protected void crawlSubFolders(javax.mail.Folder folder,
                               int depth)

crawlMessages

protected void crawlMessages(javax.mail.Folder folder,
                             URI folderUri)
                      throws javax.mail.MessagingException
Throws:
javax.mail.MessagingException

crawlSingleMessage

protected void crawlSingleMessage(javax.mail.internet.MimeMessage message,
                                  String uri,
                                  URI folderUri)
                           throws javax.mail.MessagingException,
                                  IOException
Throws:
javax.mail.MessagingException
IOException

applySpecificProcessing

protected void applySpecificProcessing(DataObject object)
This method can be overridden by subclasses wishing to perform some specific processing on the data objects reported as new or modified, before they are passed to the CrawlerHandler

Parameters:
object -

getCurrentFolderObject

protected FolderDataObject getCurrentFolderObject(DataSource dataSource,
                                                  AccessData newAccessData,
                                                  RDFContainerFactory containerFactory)
                                           throws javax.mail.MessagingException
Returns a DataObject for the current JavaMail folder.

Parameters:
dataSource -
newAccessData -
containerFactory -
Returns:
a FolderDataObject instance for the currentFolder
Throws:
javax.mail.MessagingException

getMessageUid

protected long getMessageUid(javax.mail.Folder folder,
                             javax.mail.Message message)
                      throws javax.mail.MessagingException
Returns the UID of the message.

Parameters:
folder - the folder where the message is located
message - the message whose UID we want to fetch
Returns:
the UID of the message. This method may return -1 if the message doesn't have an UID or if the UID could not be obtained.
Throws:
javax.mail.MessagingException

getMessageCount

protected int getMessageCount(javax.mail.Message[] messages)
                       throws javax.mail.MessagingException
Returns the amount of non-removed messages in the given array. Each message in the array is checked with the isRemoved(Message) method.

Parameters:
messages - the array of messages we'd like to check
Returns:
the number of messages that have not been marked as removed on the server
Throws:
javax.mail.MessagingException

getSubFoldersString

protected String getSubFoldersString(javax.mail.Folder folder)
                              throws javax.mail.MessagingException
Returns a string with the names of the subfolders of the given folder. This string may be used to record the state of the folder in the AccessData instance.

Parameters:
folder -
Returns:
a string with the names of the subfolders of the given folder separated with the @ sign
Throws:
javax.mail.MessagingException

reportNotModified

protected void reportNotModified(String uri)
Reports the given uri as unmodified. This method calls CrawlerBase.reportUnmodifiedDataObject(String) and updates data structures internal to the AbstractJavaMailCrawler

Parameters:
uri - the uri to be reported as unmodified

isRemoved

protected boolean isRemoved(javax.mail.Message message)
                     throws javax.mail.MessagingException
Returns true if the given message has been marked as expunged or deleted

Parameters:
message - the message to check
Returns:
true if the given message has been marked as expunged or deleted
Throws:
javax.mail.MessagingException

isTooLarge

protected boolean isTooLarge(javax.mail.Message message)
                      throws javax.mail.MessagingException
Returns true if this message is larger than the maximum size defined for the current data source.

Parameters:
message - the message to check
Returns:
true if this message is larger than the maximum size defined for the current data source.
Throws:
javax.mail.MessagingException

isAcceptable

protected boolean isAcceptable(javax.mail.Message message)
                        throws javax.mail.MessagingException
Returns true if this message can be crawled, according to the criteria defined for this d data source.

Parameters:
message - the message to check
Returns:
true if this message can be crawled, according to the criteria defined for this d data source.
Throws:
javax.mail.MessagingException

checkSubfoldersChanged

protected boolean checkSubfoldersChanged(AccessData ad)
                                  throws javax.mail.MessagingException
Checks if the list of subfolders of the current folder has been changed in comparison with the list stored in the accessData instance.

Returns:
true if the subfolder list has been changed, false otherwise
Throws:
javax.mail.MessagingException

holdsFolders

public static boolean holdsFolders(javax.mail.Folder folder)
                            throws javax.mail.MessagingException
Does this folder hold any subfolders?

Parameters:
folder - the folder to be checked
Returns:
true if this folder has any subfolders, false otherwise
Throws:
javax.mail.MessagingException - if it prooves impossible to find out

holdsMessages

public static boolean holdsMessages(javax.mail.Folder folder)
                             throws javax.mail.MessagingException
Does this folder hold any messages?

Parameters:
folder - the folder to be checked
Returns:
true if this folder has any messages, false otherwise
Throws:
javax.mail.MessagingException - if it prooves impossible to find out


Copyright © 2010 Aperture Development Team. All Rights Reserved.