|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectorg.semanticdesktop.aperture.crawler.base.CrawlerBase
org.semanticdesktop.aperture.crawler.mail.AbstractJavaMailCrawler
org.semanticdesktop.aperture.crawler.imap.ImapCrawler
public class ImapCrawler
A Combined Crawler and DataAccessor implementation for IMAP.
Note that the same instance of ImapCrawler cannot be used as a crawler and as a DataAccessor at the same time. Please use separate instances, or use the appropriate factory, which will enforce this for you.
A known issue: the incremental crawling only works correctly for servers that persist UIDs for each folder. Otherwise each crawl will start from scratch and report all objects as new. This occurs on IMAP servers backed by the 'mh' message storage mechanism. See this email to aperture-dev and this post from Marc Crispin - inventor of IMAP for more details.
A workaround for the above issue has been implemented. For each folder, the crawler performs a check for the UIDNOTSTICKY flag. If the flag is returned, then the crawler reverts to a different URI scheme, with message ids. This is slower, but it seems to work.
| Nested Class Summary | |
|---|---|
static class |
ImapCrawler.SimpleSocketFactory
This is a socket factory that ignores ssl certificates. |
| Field Summary |
|---|
| Fields inherited from class org.semanticdesktop.aperture.crawler.mail.AbstractJavaMailCrawler |
|---|
ACCESSED_KEY, baseFolders, currentFolder, currentFolderURI, maxDepth, maximumByteSize, store, SUBFOLDERS_KEY |
| Fields inherited from class org.semanticdesktop.aperture.crawler.base.CrawlerBase |
|---|
accessData, accessorRegistry, crawlReportFile, source, stopRequested |
| Constructor Summary | |
|---|---|
ImapCrawler()
|
|
| Method Summary | |
|---|---|
protected boolean |
checkIfCurrentFolderHasBeenChanged(AccessData newAccessData)
Applies source-specific methods to determine if the current folder has been changed since it has last been crawled. |
void |
closeConnection()
Requests the streamPool to close the connection to the store. |
protected ExitCode |
crawlObjects()
Method called by crawl() that should implement the actual crawling of the DataSource. |
MessageDataObject |
createDataObject(URI dataObjectId,
DataSource dataSource,
RDFContainer metadata,
javax.mail.internet.MimeMessage msg)
Creates a message data object for the given parameters |
protected void |
ensureConnectedStore()
Ensures that the crawler is connected to the underlying mail storage system and can perform the crawl. |
protected int |
getCurrentFolderMessageCount()
Returns the amount of messages in the current folder. |
protected Object |
getDataObjectOrAllObjects(String url,
DataSource dataSource,
AccessData newAccessData,
Map params,
RDFContainerFactory containerFactory,
boolean all)
|
String |
getFolderName(String url)
Returns the name of the folder with the given URL |
protected URI |
getFolderURI(javax.mail.Folder folder)
Returns the URI of the folder, using the URI scheme appropriate for the current crawler. |
protected javax.mail.Message |
getMessageFromCurrentFolder(int index)
Returns the message from the current folder available at the given index. |
protected String |
getMessageUri(javax.mail.Folder folder,
javax.mail.Message message)
Returns the URI of the message, using the URI scheme appropriate for the current crawler. |
InputStream |
getPartStream(javax.mail.Part part)
Returns an input stream with the part content. |
Properties |
getSessionProperties()
Returns the session properties |
protected void |
recordCurrentFolderInAccessData(AccessData newAccessData)
Records source-specific information about the current folder that will enable the crawler to detect if the crawler has been changed on a future crawl. |
protected void |
retrieveConfigurationData(DataSource dataSource)
Prepare for accessing the specified DataSource by fetching all properties from it that are required to connect to the mail box. |
protected void |
setCurrentFolder(javax.mail.Folder folder)
This method implements the incremental crawling strategy described by Chris Fluit in the sourceforge issue 1531657. |
void |
setSessionProperties(Properties sessionProperties)
Sets the session properties |
| Methods inherited from class org.semanticdesktop.aperture.crawler.mail.AbstractJavaMailCrawler |
|---|
applySpecificProcessing, checkSubfoldersChanged, crawlFolder, crawlMessages, crawlSingleFolder, crawlSingleMessage, crawlSubFolders, getAllRelatedDataObjects, getCurrentFolderObject, getDataObject, getDataObjectByMessageURI, getDataObjectIfModified, getMessageByURI, getMessageCount, getMessageUid, getSubFoldersString, holdsFolders, holdsMessages, isAcceptable, isRemoved, isTooLarge, reportNotModified |
| Methods inherited from class org.semanticdesktop.aperture.crawler.base.CrawlerBase |
|---|
clear, clear, crawl, getAccessData, getCrawlerHandler, getCrawlReport, getCrawlReportFile, getDataAccessorRegistry, getDataSource, getRDFContainerFactory, inDomain, isStopRequested, reportAccessingObject, reportDeletedDataObject, reportFatalErrorCause, reportFatalErrorCause, reportFatalErrorCause, reportModifiedDataObject, reportNewDataObject, reportUnmodifiedDataObject, reportUntouched, runSubCrawler, setAccessData, setCrawlerHandler, setCrawlReportFile, setDataAccessorRegistry, setDataSource, stop, storeCrawlReport, touchObject |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Methods inherited from interface org.semanticdesktop.aperture.accessor.DataAccessor |
|---|
getDataObject, getDataObjectIfModified |
| Constructor Detail |
|---|
public ImapCrawler()
| Method Detail |
|---|
protected ExitCode crawlObjects()
CrawlerBase
crawlObjects in class CrawlerBaseprotected void retrieveConfigurationData(DataSource dataSource)
retrieveConfigurationData in class AbstractJavaMailCrawler
protected void ensureConnectedStore()
throws javax.mail.MessagingException
AbstractJavaMailCrawler
ensureConnectedStore in class AbstractJavaMailCrawlerjavax.mail.MessagingExceptionpublic void closeConnection()
closeConnection in class AbstractJavaMailCrawler
protected Object getDataObjectOrAllObjects(String url,
DataSource dataSource,
AccessData newAccessData,
Map params,
RDFContainerFactory containerFactory,
boolean all)
throws UrlNotFoundException,
IOException
getDataObjectOrAllObjects in class AbstractJavaMailCrawlerUrlNotFoundException
IOException
protected void setCurrentFolder(javax.mail.Folder folder)
throws javax.mail.MessagingException
setCurrentFolder in class AbstractJavaMailCrawlerfolder - the folder that is to become the current folder
javax.mail.MessagingException
protected int getCurrentFolderMessageCount()
throws javax.mail.MessagingException
AbstractJavaMailCrawler
getCurrentFolderMessageCount in class AbstractJavaMailCrawlerjavax.mail.MessagingException
protected javax.mail.Message getMessageFromCurrentFolder(int index)
throws javax.mail.MessagingException
AbstractJavaMailCrawler
getMessageFromCurrentFolder in class AbstractJavaMailCrawlerindex - a one-based index. The lowest valid value is one (obviously) the highest valid value is
the one returned by AbstractJavaMailCrawler.getCurrentFolderMessageCount()
javax.mail.MessagingException
public InputStream getPartStream(javax.mail.Part part)
throws javax.mail.MessagingException,
IOException
DataObjectFactory.PartStreamFactoryPart.getInputStream() method, designed to allow for customization of the returned input
stream.
getPartStream in interface DataObjectFactory.PartStreamFactorygetPartStream in class AbstractJavaMailCrawlerjavax.mail.MessagingException
IOExceptionAbstractJavaMailCrawler.getPartStream(javax.mail.Part)
public MessageDataObject createDataObject(URI dataObjectId,
DataSource dataSource,
RDFContainer metadata,
javax.mail.internet.MimeMessage msg)
throws javax.mail.MessagingException
DataObjectFactory.PartStreamFactory
createDataObject in interface DataObjectFactory.PartStreamFactorycreateDataObject in class AbstractJavaMailCrawlerjavax.mail.MessagingExceptionorg.semanticdesktop.aperture.crawler.mail.AbstractJavaMailCrawler#createDataObject(org.ontoware.rdf2go.model.node.URI, org.semanticdesktop.aperture.datasource.DataSource, org.semanticdesktop.aperture.rdf.RDFContainer, javax.mail.internet.MimeMessage, java.util.concurrent.ExecutorService)
protected void recordCurrentFolderInAccessData(AccessData newAccessData)
throws javax.mail.MessagingException
AbstractJavaMailCrawler
recordCurrentFolderInAccessData in class AbstractJavaMailCrawlernewAccessData - the access data where the information should be stored
javax.mail.MessagingException
protected boolean checkIfCurrentFolderHasBeenChanged(AccessData newAccessData)
throws javax.mail.MessagingException
AbstractJavaMailCrawler
checkIfCurrentFolderHasBeenChanged in class AbstractJavaMailCrawlernewAccessData - the AccessData instance that is to be consulted
javax.mail.MessagingExceptionpublic String getFolderName(String url)
getFolderName in class AbstractJavaMailCrawlerurl - the url of the folder
protected URI getFolderURI(javax.mail.Folder folder)
throws javax.mail.MessagingException
AbstractJavaMailCrawler
getFolderURI in class AbstractJavaMailCrawlerfolder - the Folder whose URI we'd like to obtain.
javax.mail.MessagingException
protected String getMessageUri(javax.mail.Folder folder,
javax.mail.Message message)
throws javax.mail.MessagingException
AbstractJavaMailCrawler
getMessageUri in class AbstractJavaMailCrawlerfolder - the folder where the message residesmessage - the message itself
javax.mail.MessagingExceptionpublic void setSessionProperties(Properties sessionProperties)
sessionProperties - the new session propertiespublic Properties getSessionProperties()
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||