org.semanticdesktop.aperture.crawler.imap
Class ImapCrawler

java.lang.Object
  extended by org.semanticdesktop.aperture.crawler.base.CrawlerBase
      extended by org.semanticdesktop.aperture.crawler.mail.AbstractJavaMailCrawler
          extended by org.semanticdesktop.aperture.crawler.imap.ImapCrawler
All Implemented Interfaces:
DataAccessor, Crawler, DataObjectFactory.PartStreamFactory

public class ImapCrawler
extends AbstractJavaMailCrawler
implements DataAccessor

A Combined Crawler and DataAccessor implementation for IMAP.

Note that the same instance of ImapCrawler cannot be used as a crawler and as a DataAccessor at the same time. Please use separate instances, or use the appropriate factory, which will enforce this for you.

A known issue: the incremental crawling only works correctly for servers that persist UIDs for each folder. Otherwise each crawl will start from scratch and report all objects as new. This occurs on IMAP servers backed by the 'mh' message storage mechanism. See this email to aperture-dev and this post from Marc Crispin - inventor of IMAP for more details.

A workaround for the above issue has been implemented. For each folder, the crawler performs a check for the UIDNOTSTICKY flag. If the flag is returned, then the crawler reverts to a different URI scheme, with message ids. This is slower, but it seems to work.


Nested Class Summary
static class ImapCrawler.SimpleSocketFactory
          This is a socket factory that ignores ssl certificates.
 
Field Summary
 
Fields inherited from class org.semanticdesktop.aperture.crawler.mail.AbstractJavaMailCrawler
ACCESSED_KEY, baseFolders, currentFolder, currentFolderURI, maxDepth, maximumByteSize, store, SUBFOLDERS_KEY
 
Fields inherited from class org.semanticdesktop.aperture.crawler.base.CrawlerBase
accessData, accessorRegistry, crawlReportFile, source, stopRequested
 
Constructor Summary
ImapCrawler()
           
 
Method Summary
protected  boolean checkIfCurrentFolderHasBeenChanged(AccessData newAccessData)
          Applies source-specific methods to determine if the current folder has been changed since it has last been crawled.
 void closeConnection()
          Requests the streamPool to close the connection to the store.
protected  ExitCode crawlObjects()
          Method called by crawl() that should implement the actual crawling of the DataSource.
 MessageDataObject createDataObject(URI dataObjectId, DataSource dataSource, RDFContainer metadata, javax.mail.internet.MimeMessage msg)
          Creates a message data object for the given parameters
protected  void ensureConnectedStore()
          Ensures that the crawler is connected to the underlying mail storage system and can perform the crawl.
protected  int getCurrentFolderMessageCount()
          Returns the amount of messages in the current folder.
protected  Object getDataObjectOrAllObjects(String url, DataSource dataSource, AccessData newAccessData, Map params, RDFContainerFactory containerFactory, boolean all)
           
 String getFolderName(String url)
          Returns the name of the folder with the given URL
protected  URI getFolderURI(javax.mail.Folder folder)
          Returns the URI of the folder, using the URI scheme appropriate for the current crawler.
protected  javax.mail.Message getMessageFromCurrentFolder(int index)
          Returns the message from the current folder available at the given index.
protected  String getMessageUri(javax.mail.Folder folder, javax.mail.Message message)
          Returns the URI of the message, using the URI scheme appropriate for the current crawler.
 InputStream getPartStream(javax.mail.Part part)
          Returns an input stream with the part content.
 Properties getSessionProperties()
          Returns the session properties
protected  void recordCurrentFolderInAccessData(AccessData newAccessData)
          Records source-specific information about the current folder that will enable the crawler to detect if the crawler has been changed on a future crawl.
protected  void retrieveConfigurationData(DataSource dataSource)
          Prepare for accessing the specified DataSource by fetching all properties from it that are required to connect to the mail box.
protected  void setCurrentFolder(javax.mail.Folder folder)
          This method implements the incremental crawling strategy described by Chris Fluit in the sourceforge issue 1531657.
 void setSessionProperties(Properties sessionProperties)
          Sets the session properties
 
Methods inherited from class org.semanticdesktop.aperture.crawler.mail.AbstractJavaMailCrawler
applySpecificProcessing, checkSubfoldersChanged, crawlFolder, crawlMessages, crawlSingleFolder, crawlSingleMessage, crawlSubFolders, getAllRelatedDataObjects, getCurrentFolderObject, getDataObject, getDataObjectByMessageURI, getDataObjectIfModified, getMessageByURI, getMessageCount, getMessageUid, getSubFoldersString, holdsFolders, holdsMessages, isAcceptable, isRemoved, isTooLarge, reportNotModified
 
Methods inherited from class org.semanticdesktop.aperture.crawler.base.CrawlerBase
clear, clear, crawl, getAccessData, getCrawlerHandler, getCrawlReport, getCrawlReportFile, getDataAccessorRegistry, getDataSource, getRDFContainerFactory, inDomain, isStopRequested, reportAccessingObject, reportDeletedDataObject, reportFatalErrorCause, reportFatalErrorCause, reportFatalErrorCause, reportModifiedDataObject, reportNewDataObject, reportUnmodifiedDataObject, reportUntouched, runSubCrawler, setAccessData, setCrawlerHandler, setCrawlReportFile, setDataAccessorRegistry, setDataSource, stop, storeCrawlReport, touchObject
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.semanticdesktop.aperture.accessor.DataAccessor
getDataObject, getDataObjectIfModified
 

Constructor Detail

ImapCrawler

public ImapCrawler()
Method Detail

crawlObjects

protected ExitCode crawlObjects()
Description copied from class: CrawlerBase
Method called by crawl() that should implement the actual crawling of the DataSource. The return value of this method should indicate whether the scanning was completed successfully (i.e. it wasn't interrupted or anything). Also this method is expected to update the deprecatedUrls set, as any remaining URLs in this set will be removed as being removed after this method completes.

Specified by:
crawlObjects in class CrawlerBase
Returns:
An ExitCode indicating how the crawl procedure terminated.

retrieveConfigurationData

protected void retrieveConfigurationData(DataSource dataSource)
Prepare for accessing the specified DataSource by fetching all properties from it that are required to connect to the mail box.

Specified by:
retrieveConfigurationData in class AbstractJavaMailCrawler

ensureConnectedStore

protected void ensureConnectedStore()
                             throws javax.mail.MessagingException
Description copied from class: AbstractJavaMailCrawler
Ensures that the crawler is connected to the underlying mail storage system and can perform the crawl. This method may be called at any time, it shouldn't do anything if a connection is already present and should reestablish it if it's not.

Specified by:
ensureConnectedStore in class AbstractJavaMailCrawler
Throws:
javax.mail.MessagingException

closeConnection

public void closeConnection()
Requests the streamPool to close the connection to the store.

Overrides:
closeConnection in class AbstractJavaMailCrawler

getDataObjectOrAllObjects

protected Object getDataObjectOrAllObjects(String url,
                                           DataSource dataSource,
                                           AccessData newAccessData,
                                           Map params,
                                           RDFContainerFactory containerFactory,
                                           boolean all)
                                    throws UrlNotFoundException,
                                           IOException
Overrides:
getDataObjectOrAllObjects in class AbstractJavaMailCrawler
Throws:
UrlNotFoundException
IOException

setCurrentFolder

protected void setCurrentFolder(javax.mail.Folder folder)
                         throws javax.mail.MessagingException
This method implements the incremental crawling strategy described by Chris Fluit in the sourceforge issue 1531657.

See http://sourceforge.net/tracker/index.php?func=detail&aid=1531657&group_id=150969&atid=779500

Overrides:
setCurrentFolder in class AbstractJavaMailCrawler
Parameters:
folder - the folder that is to become the current folder
Throws:
javax.mail.MessagingException

getCurrentFolderMessageCount

protected int getCurrentFolderMessageCount()
                                    throws javax.mail.MessagingException
Description copied from class: AbstractJavaMailCrawler
Returns the amount of messages in the current folder.

Overrides:
getCurrentFolderMessageCount in class AbstractJavaMailCrawler
Returns:
the amount of messages in the current folder.
Throws:
javax.mail.MessagingException

getMessageFromCurrentFolder

protected javax.mail.Message getMessageFromCurrentFolder(int index)
                                                  throws javax.mail.MessagingException
Description copied from class: AbstractJavaMailCrawler
Returns the message from the current folder available at the given index. Note that the exact semantics of the index may be overridden by the subclasses of this class, but it will always follow the javamail convention that folder indexes are one-based

Overrides:
getMessageFromCurrentFolder in class AbstractJavaMailCrawler
Parameters:
index - a one-based index. The lowest valid value is one (obviously) the highest valid value is the one returned by AbstractJavaMailCrawler.getCurrentFolderMessageCount()
Returns:
the message placed under the given index
Throws:
javax.mail.MessagingException

getPartStream

public InputStream getPartStream(javax.mail.Part part)
                          throws javax.mail.MessagingException,
                                 IOException
Description copied from interface: DataObjectFactory.PartStreamFactory
Returns an input stream with the part content. It's conceptually a wrapper around the Part.getInputStream() method, designed to allow for customization of the returned input stream.

Specified by:
getPartStream in interface DataObjectFactory.PartStreamFactory
Overrides:
getPartStream in class AbstractJavaMailCrawler
Returns:
an InputStream with the content of the part
Throws:
javax.mail.MessagingException
IOException
See Also:
AbstractJavaMailCrawler.getPartStream(javax.mail.Part)

createDataObject

public MessageDataObject createDataObject(URI dataObjectId,
                                          DataSource dataSource,
                                          RDFContainer metadata,
                                          javax.mail.internet.MimeMessage msg)
                                   throws javax.mail.MessagingException
Description copied from interface: DataObjectFactory.PartStreamFactory
Creates a message data object for the given parameters

Specified by:
createDataObject in interface DataObjectFactory.PartStreamFactory
Overrides:
createDataObject in class AbstractJavaMailCrawler
Returns:
a message data object for the given parameters
Throws:
javax.mail.MessagingException
See Also:
org.semanticdesktop.aperture.crawler.mail.AbstractJavaMailCrawler#createDataObject(org.ontoware.rdf2go.model.node.URI, org.semanticdesktop.aperture.datasource.DataSource, org.semanticdesktop.aperture.rdf.RDFContainer, javax.mail.internet.MimeMessage, java.util.concurrent.ExecutorService)

recordCurrentFolderInAccessData

protected void recordCurrentFolderInAccessData(AccessData newAccessData)
                                        throws javax.mail.MessagingException
Description copied from class: AbstractJavaMailCrawler
Records source-specific information about the current folder that will enable the crawler to detect if the crawler has been changed on a future crawl.

Specified by:
recordCurrentFolderInAccessData in class AbstractJavaMailCrawler
Parameters:
newAccessData - the access data where the information should be stored
Throws:
javax.mail.MessagingException

checkIfCurrentFolderHasBeenChanged

protected boolean checkIfCurrentFolderHasBeenChanged(AccessData newAccessData)
                                              throws javax.mail.MessagingException
Description copied from class: AbstractJavaMailCrawler
Applies source-specific methods to determine if the current folder has been changed since it has last been crawled.

Specified by:
checkIfCurrentFolderHasBeenChanged in class AbstractJavaMailCrawler
Parameters:
newAccessData - the AccessData instance that is to be consulted
Returns:
false if the information stored in the accessData instance indictates that the folder hasn't been changed, false otherwise
Throws:
javax.mail.MessagingException

getFolderName

public String getFolderName(String url)
Returns the name of the folder with the given URL

Specified by:
getFolderName in class AbstractJavaMailCrawler
Parameters:
url - the url of the folder
Returns:
the name of the folder with the given URL

getFolderURI

protected URI getFolderURI(javax.mail.Folder folder)
                    throws javax.mail.MessagingException
Description copied from class: AbstractJavaMailCrawler
Returns the URI of the folder, using the URI scheme appropriate for the current crawler.

Specified by:
getFolderURI in class AbstractJavaMailCrawler
Parameters:
folder - the Folder whose URI we'd like to obtain.
Returns:
the uri of the folder
Throws:
javax.mail.MessagingException

getMessageUri

protected String getMessageUri(javax.mail.Folder folder,
                               javax.mail.Message message)
                        throws javax.mail.MessagingException
Description copied from class: AbstractJavaMailCrawler
Returns the URI of the message, using the URI scheme appropriate for the current crawler.

Specified by:
getMessageUri in class AbstractJavaMailCrawler
Parameters:
folder - the folder where the message resides
message - the message itself
Returns:
the uri of the message
Throws:
javax.mail.MessagingException

setSessionProperties

public void setSessionProperties(Properties sessionProperties)
Sets the session properties

Parameters:
sessionProperties - the new session properties

getSessionProperties

public Properties getSessionProperties()
Returns the session properties

Returns:
the session properties


Copyright © 2010 Aperture Development Team. All Rights Reserved.