|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.semanticdesktop.aperture.crawler.base.CrawlerBase
org.semanticdesktop.aperture.crawler.mail.AbstractJavaMailCrawler
org.semanticdesktop.aperture.crawler.imap.ImapCrawler
public class ImapCrawler
A Combined Crawler and DataAccessor implementation for IMAP.
Note that the same instance of ImapCrawler cannot be used as a crawler and as a DataAccessor at the same time. Please use separate instances, or use the appropriate factory, which will enforce this for you.
A known issue: the incremental crawling only works correctly for servers that persist UIDs for each folder. Otherwise each crawl will start from scratch and report all objects as new. This occurs on IMAP servers backed by the 'mh' message storage mechanism. See this email to aperture-dev and this post from Marc Crispin - inventor of IMAP for more details.
A workaround for the above issue has been implemented. For each folder, the crawler performs a check for the UIDNOTSTICKY flag. If the flag is returned, then the crawler reverts to a different URI scheme, with message ids. This is slower, but it seems to work.
Nested Class Summary | |
---|---|
static class |
ImapCrawler.SimpleSocketFactory
This is a socket factory that ignores ssl certificates. |
Field Summary |
---|
Fields inherited from class org.semanticdesktop.aperture.crawler.mail.AbstractJavaMailCrawler |
---|
ACCESSED_KEY, baseFolders, currentFolder, currentFolderURI, maxDepth, maximumByteSize, store, SUBFOLDERS_KEY |
Fields inherited from class org.semanticdesktop.aperture.crawler.base.CrawlerBase |
---|
accessData, accessorRegistry, crawlReportFile, source, stopRequested |
Constructor Summary | |
---|---|
ImapCrawler()
|
Method Summary | |
---|---|
protected boolean |
checkIfCurrentFolderHasBeenChanged(AccessData newAccessData)
Applies source-specific methods to determine if the current folder has been changed since it has last been crawled. |
void |
closeConnection()
Requests the streamPool to close the connection to the store. |
protected ExitCode |
crawlObjects()
Method called by crawl() that should implement the actual crawling of the DataSource. |
MessageDataObject |
createDataObject(URI dataObjectId,
DataSource dataSource,
RDFContainer metadata,
javax.mail.internet.MimeMessage msg)
Creates a message data object for the given parameters |
protected void |
ensureConnectedStore()
Ensures that the crawler is connected to the underlying mail storage system and can perform the crawl. |
protected int |
getCurrentFolderMessageCount()
Returns the amount of messages in the current folder. |
protected Object |
getDataObjectOrAllObjects(String url,
DataSource dataSource,
AccessData newAccessData,
Map params,
RDFContainerFactory containerFactory,
boolean all)
|
String |
getFolderName(String url)
Returns the name of the folder with the given URL |
protected URI |
getFolderURI(javax.mail.Folder folder)
Returns the URI of the folder, using the URI scheme appropriate for the current crawler. |
protected javax.mail.Message |
getMessageFromCurrentFolder(int index)
Returns the message from the current folder available at the given index. |
protected String |
getMessageUri(javax.mail.Folder folder,
javax.mail.Message message)
Returns the URI of the message, using the URI scheme appropriate for the current crawler. |
InputStream |
getPartStream(javax.mail.Part part)
Returns an input stream with the part content. |
Properties |
getSessionProperties()
Returns the session properties |
protected void |
recordCurrentFolderInAccessData(AccessData newAccessData)
Records source-specific information about the current folder that will enable the crawler to detect if the crawler has been changed on a future crawl. |
protected void |
retrieveConfigurationData(DataSource dataSource)
Prepare for accessing the specified DataSource by fetching all properties from it that are required to connect to the mail box. |
protected void |
setCurrentFolder(javax.mail.Folder folder)
This method implements the incremental crawling strategy described by Chris Fluit in the sourceforge issue 1531657. |
void |
setSessionProperties(Properties sessionProperties)
Sets the session properties |
Methods inherited from class org.semanticdesktop.aperture.crawler.mail.AbstractJavaMailCrawler |
---|
applySpecificProcessing, checkSubfoldersChanged, crawlFolder, crawlMessages, crawlSingleFolder, crawlSingleMessage, crawlSubFolders, getAllRelatedDataObjects, getCurrentFolderObject, getDataObject, getDataObjectByMessageURI, getDataObjectIfModified, getMessageByURI, getMessageCount, getMessageUid, getSubFoldersString, holdsFolders, holdsMessages, isAcceptable, isRemoved, isTooLarge, reportNotModified |
Methods inherited from class org.semanticdesktop.aperture.crawler.base.CrawlerBase |
---|
clear, clear, crawl, getAccessData, getCrawlerHandler, getCrawlReport, getCrawlReportFile, getDataAccessorRegistry, getDataSource, getRDFContainerFactory, inDomain, isStopRequested, reportAccessingObject, reportDeletedDataObject, reportFatalErrorCause, reportFatalErrorCause, reportFatalErrorCause, reportModifiedDataObject, reportNewDataObject, reportUnmodifiedDataObject, reportUntouched, runSubCrawler, setAccessData, setCrawlerHandler, setCrawlReportFile, setDataAccessorRegistry, setDataSource, stop, storeCrawlReport, touchObject |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Methods inherited from interface org.semanticdesktop.aperture.accessor.DataAccessor |
---|
getDataObject, getDataObjectIfModified |
Constructor Detail |
---|
public ImapCrawler()
Method Detail |
---|
protected ExitCode crawlObjects()
CrawlerBase
crawlObjects
in class CrawlerBase
protected void retrieveConfigurationData(DataSource dataSource)
retrieveConfigurationData
in class AbstractJavaMailCrawler
protected void ensureConnectedStore() throws javax.mail.MessagingException
AbstractJavaMailCrawler
ensureConnectedStore
in class AbstractJavaMailCrawler
javax.mail.MessagingException
public void closeConnection()
closeConnection
in class AbstractJavaMailCrawler
protected Object getDataObjectOrAllObjects(String url, DataSource dataSource, AccessData newAccessData, Map params, RDFContainerFactory containerFactory, boolean all) throws UrlNotFoundException, IOException
getDataObjectOrAllObjects
in class AbstractJavaMailCrawler
UrlNotFoundException
IOException
protected void setCurrentFolder(javax.mail.Folder folder) throws javax.mail.MessagingException
setCurrentFolder
in class AbstractJavaMailCrawler
folder
- the folder that is to become the current folder
javax.mail.MessagingException
protected int getCurrentFolderMessageCount() throws javax.mail.MessagingException
AbstractJavaMailCrawler
getCurrentFolderMessageCount
in class AbstractJavaMailCrawler
javax.mail.MessagingException
protected javax.mail.Message getMessageFromCurrentFolder(int index) throws javax.mail.MessagingException
AbstractJavaMailCrawler
getMessageFromCurrentFolder
in class AbstractJavaMailCrawler
index
- a one-based index. The lowest valid value is one (obviously) the highest valid value is
the one returned by AbstractJavaMailCrawler.getCurrentFolderMessageCount()
javax.mail.MessagingException
public InputStream getPartStream(javax.mail.Part part) throws javax.mail.MessagingException, IOException
DataObjectFactory.PartStreamFactory
Part.getInputStream()
method, designed to allow for customization of the returned input
stream.
getPartStream
in interface DataObjectFactory.PartStreamFactory
getPartStream
in class AbstractJavaMailCrawler
javax.mail.MessagingException
IOException
AbstractJavaMailCrawler.getPartStream(javax.mail.Part)
public MessageDataObject createDataObject(URI dataObjectId, DataSource dataSource, RDFContainer metadata, javax.mail.internet.MimeMessage msg) throws javax.mail.MessagingException
DataObjectFactory.PartStreamFactory
createDataObject
in interface DataObjectFactory.PartStreamFactory
createDataObject
in class AbstractJavaMailCrawler
javax.mail.MessagingException
org.semanticdesktop.aperture.crawler.mail.AbstractJavaMailCrawler#createDataObject(org.ontoware.rdf2go.model.node.URI, org.semanticdesktop.aperture.datasource.DataSource, org.semanticdesktop.aperture.rdf.RDFContainer, javax.mail.internet.MimeMessage, java.util.concurrent.ExecutorService)
protected void recordCurrentFolderInAccessData(AccessData newAccessData) throws javax.mail.MessagingException
AbstractJavaMailCrawler
recordCurrentFolderInAccessData
in class AbstractJavaMailCrawler
newAccessData
- the access data where the information should be stored
javax.mail.MessagingException
protected boolean checkIfCurrentFolderHasBeenChanged(AccessData newAccessData) throws javax.mail.MessagingException
AbstractJavaMailCrawler
checkIfCurrentFolderHasBeenChanged
in class AbstractJavaMailCrawler
newAccessData
- the AccessData instance that is to be consulted
javax.mail.MessagingException
public String getFolderName(String url)
getFolderName
in class AbstractJavaMailCrawler
url
- the url of the folder
protected URI getFolderURI(javax.mail.Folder folder) throws javax.mail.MessagingException
AbstractJavaMailCrawler
getFolderURI
in class AbstractJavaMailCrawler
folder
- the Folder whose URI we'd like to obtain.
javax.mail.MessagingException
protected String getMessageUri(javax.mail.Folder folder, javax.mail.Message message) throws javax.mail.MessagingException
AbstractJavaMailCrawler
getMessageUri
in class AbstractJavaMailCrawler
folder
- the folder where the message residesmessage
- the message itself
javax.mail.MessagingException
public void setSessionProperties(Properties sessionProperties)
sessionProperties
- the new session propertiespublic Properties getSessionProperties()
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |