|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.semanticdesktop.aperture.crawler.base.CrawlerBase
org.semanticdesktop.aperture.crawler.mail.AbstractJavaMailCrawler
public abstract class AbstractJavaMailCrawler
An abstract crawler implementation that works with an email store implementation hidden behind the Java
Mail API.
The details about the connection management, authentication and security are the responsibility of the
concrete subclasses.
Field Summary | |
---|---|
static String |
ACCESSED_KEY
The key used in the access data to mark if a given data object has been accessed or not |
protected ArrayList |
baseFolders
List of base folders - roots of the crawling |
protected javax.mail.Folder |
currentFolder
The folder currently crawled by the crawler. |
protected URI |
currentFolderURI
The URI of the current folder. |
protected int |
maxDepth
Maximum depth below the base folders the crawler will crawl |
protected long |
maximumByteSize
Maximum size of the message accepted by the crawler, bigger messages will be ignored |
protected javax.mail.Store |
store
The underlying Store instance. |
protected static String |
SUBFOLDERS_KEY
|
Fields inherited from class org.semanticdesktop.aperture.crawler.base.CrawlerBase |
---|
accessData, accessorRegistry, crawlReportFile, source, stopRequested |
Constructor Summary | |
---|---|
AbstractJavaMailCrawler()
|
Method Summary | |
---|---|
protected void |
applySpecificProcessing(DataObject object)
This method can be overridden by subclasses wishing to perform some specific processing on the data objects reported as new or modified, before they are passed to the CrawlerHandler |
protected abstract boolean |
checkIfCurrentFolderHasBeenChanged(AccessData newAccessData)
Applies source-specific methods to determine if the current folder has been changed since it has last been crawled. |
protected boolean |
checkSubfoldersChanged(AccessData ad)
Checks if the list of subfolders of the current folder has been changed in comparison with the list stored in the accessData instance. |
protected void |
closeConnection()
Closes the Store . |
protected void |
crawlFolder(javax.mail.Folder folder,
int depth)
Crawls a subfolder tree starting at the given folder up until the given depth. |
protected void |
crawlMessages(javax.mail.Folder folder,
URI folderUri)
|
protected void |
crawlSingleFolder(javax.mail.Folder folder)
|
protected void |
crawlSingleMessage(javax.mail.internet.MimeMessage message,
String uri,
URI folderUri)
|
protected void |
crawlSubFolders(javax.mail.Folder folder,
int depth)
|
MessageDataObject |
createDataObject(URI dataObjectId,
DataSource dataSource,
RDFContainer metadata,
javax.mail.internet.MimeMessage msg)
Creates a message data object for the given parameters |
protected abstract void |
ensureConnectedStore()
Ensures that the crawler is connected to the underlying mail storage system and can perform the crawl. |
Map<URI,DataObject> |
getAllRelatedDataObjects(String url,
DataSource dataSource,
Map params,
RDFContainerFactory containerFactory)
|
protected int |
getCurrentFolderMessageCount()
Returns the amount of messages in the current folder. |
protected FolderDataObject |
getCurrentFolderObject(DataSource dataSource,
AccessData newAccessData,
RDFContainerFactory containerFactory)
Returns a DataObject for the current JavaMail folder. |
DataObject |
getDataObject(String url,
DataSource dataSource,
Map params,
RDFContainerFactory containerFactory)
Get a DataObject for the specified url. |
protected Object |
getDataObjectByMessageURI(String url,
DataSource dataSource,
RDFContainerFactory containerFactory,
javax.mail.Folder folder,
boolean all)
|
DataObject |
getDataObjectIfModified(String url,
DataSource dataSource,
AccessData newAccessData,
Map params,
RDFContainerFactory containerFactory)
Get a DataObject for the specified url. |
protected Object |
getDataObjectOrAllObjects(String url,
DataSource dataSource,
AccessData newAccessData,
Map params,
RDFContainerFactory containerFactory,
boolean all)
|
protected abstract String |
getFolderName(String url)
Extracts the name of the folder from the data object URI. |
protected abstract URI |
getFolderURI(javax.mail.Folder folder)
Returns the URI of the folder, using the URI scheme appropriate for the current crawler. |
protected javax.mail.internet.MimeMessage |
getMessageByURI(String url,
javax.mail.Folder folder)
Returns a MimeMessage instance based on the URI of a data object. |
protected int |
getMessageCount(javax.mail.Message[] messages)
Returns the amount of non-removed messages in the given array. |
protected javax.mail.Message |
getMessageFromCurrentFolder(int index)
Returns the message from the current folder available at the given index. |
protected long |
getMessageUid(javax.mail.Folder folder,
javax.mail.Message message)
Returns the UID of the message. |
protected abstract String |
getMessageUri(javax.mail.Folder folder,
javax.mail.Message message)
Returns the URI of the message, using the URI scheme appropriate for the current crawler. |
InputStream |
getPartStream(javax.mail.Part part)
Returns an input stream with the part content. |
protected String |
getSubFoldersString(javax.mail.Folder folder)
Returns a string with the names of the subfolders of the given folder. |
static boolean |
holdsFolders(javax.mail.Folder folder)
Does this folder hold any subfolders? |
static boolean |
holdsMessages(javax.mail.Folder folder)
Does this folder hold any messages? |
protected boolean |
isAcceptable(javax.mail.Message message)
Returns true if this message can be crawled, according to the criteria defined for this d data source. |
protected boolean |
isRemoved(javax.mail.Message message)
Returns true if the given message has been marked as expunged or deleted |
protected boolean |
isTooLarge(javax.mail.Message message)
Returns true if this message is larger than the maximum size defined for the current data source. |
protected abstract void |
recordCurrentFolderInAccessData(AccessData newAccessData)
Records source-specific information about the current folder that will enable the crawler to detect if the crawler has been changed on a future crawl. |
protected void |
reportNotModified(String uri)
Reports the given uri as unmodified. |
protected abstract void |
retrieveConfigurationData(DataSource source)
Performs any necessary initialization of internal data structures before any crawling can commence. |
protected void |
setCurrentFolder(javax.mail.Folder folder)
Sets the current folder. |
Methods inherited from class org.semanticdesktop.aperture.crawler.base.CrawlerBase |
---|
clear, clear, crawl, crawlObjects, getAccessData, getCrawlerHandler, getCrawlReport, getCrawlReportFile, getDataAccessorRegistry, getDataSource, getRDFContainerFactory, inDomain, isStopRequested, reportAccessingObject, reportDeletedDataObject, reportFatalErrorCause, reportFatalErrorCause, reportFatalErrorCause, reportModifiedDataObject, reportNewDataObject, reportUnmodifiedDataObject, reportUntouched, runSubCrawler, setAccessData, setCrawlerHandler, setCrawlReportFile, setDataAccessorRegistry, setDataSource, stop, storeCrawlReport, touchObject |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected int maxDepth
protected long maximumByteSize
protected ArrayList baseFolders
public static final String ACCESSED_KEY
protected static final String SUBFOLDERS_KEY
protected javax.mail.Folder currentFolder
setCurrentFolder(Folder)
protected URI currentFolderURI
setCurrentFolder(Folder)
using the
getFolderURI(Folder)
implementation.
setCurrentFolder(Folder)
protected javax.mail.Store store
Store
instance.
Constructor Detail |
---|
public AbstractJavaMailCrawler()
Method Detail |
---|
protected abstract URI getFolderURI(javax.mail.Folder folder) throws javax.mail.MessagingException
folder
- the Folder whose URI we'd like to obtain.
javax.mail.MessagingException
protected abstract String getMessageUri(javax.mail.Folder folder, javax.mail.Message message) throws javax.mail.MessagingException
folder
- the folder where the message residesmessage
- the message itself
javax.mail.MessagingException
protected abstract String getFolderName(String url) throws UrlNotFoundException
Store.getFolder(String)
method to obtain the corresponding Folder
instance which directly contains the data object (message or attachment) with the given url.
This method can be called ONLY when all confguration has been read from the DataObject
, that is
AFTER retrieveConfigurationData(DataSource)
.
url
-
UrlNotFoundException
- if the given url does not belong to the current Store
protected abstract boolean checkIfCurrentFolderHasBeenChanged(AccessData newAccessData) throws javax.mail.MessagingException
newAccessData
- the AccessData instance that is to be consulted
javax.mail.MessagingException
protected abstract void recordCurrentFolderInAccessData(AccessData newAccessData) throws javax.mail.MessagingException
newAccessData
- the access data where the information should be stored
javax.mail.MessagingException
protected abstract void retrieveConfigurationData(DataSource source)
source
- protected abstract void ensureConnectedStore() throws javax.mail.MessagingException
source
-
javax.mail.MessagingException
protected void setCurrentFolder(javax.mail.Folder folder) throws javax.mail.MessagingException
Folder.open(int)
) but
before any messages are actually crawled.
folder
- the folder that is to become the current folder
javax.mail.MessagingException
protected javax.mail.Message getMessageFromCurrentFolder(int index) throws javax.mail.MessagingException
index
- a one-based index. The lowest valid value is one (obviously) the highest valid value is
the one returned by getCurrentFolderMessageCount()
javax.mail.MessagingException
protected int getCurrentFolderMessageCount() throws javax.mail.MessagingException
javax.mail.MessagingException
protected javax.mail.internet.MimeMessage getMessageByURI(String url, javax.mail.Folder folder) throws javax.mail.MessagingException
MimeMessage
instance based on the URI of a data object. The URI may point at a
message itself, or at one of its attachments. This method is invoked by the default implementation of
getDataObjectByMessageURI(String, DataSource, RDFContainerFactory, Folder, boolean)
and if it
returns non-null the resulting MimeMessage
instance is used, if it returns null, the method
will revert to the default strategy of iterating over all messages in the folder in order to find the
desired one. If the underlying mail store allows - this may drastically improve the performance of the
DataAccessor
methods.
url
- folder
-
javax.mail.MessagingException
protected void closeConnection()
Store
. Implementations should ensure that this method does not invalidate
any undisposed data objects that might have been obtained from the store and can still be used
(for instance the InputStream
s obtained from FileDataObject.getContent()
or
MimeMessage
s obtained from MessageDataObject.getMimeMessage()
should still be
readable.
public InputStream getPartStream(javax.mail.Part part) throws javax.mail.MessagingException, IOException
DataObjectFactory.PartStreamFactory
Part.getInputStream()
method, designed to allow for customization of the returned input
stream.
getPartStream
in interface DataObjectFactory.PartStreamFactory
javax.mail.MessagingException
IOException
DataObjectFactory.PartStreamFactory.getPartStream(Part)
public MessageDataObject createDataObject(URI dataObjectId, DataSource dataSource, RDFContainer metadata, javax.mail.internet.MimeMessage msg) throws javax.mail.MessagingException
DataObjectFactory.PartStreamFactory
createDataObject
in interface DataObjectFactory.PartStreamFactory
javax.mail.MessagingException
PartStreamFactory#createDataObject(URI, DataSource, RDFContainer, MimeMessage, ExecutorService)
public DataObject getDataObject(String url, DataSource dataSource, Map params, RDFContainerFactory containerFactory) throws UrlNotFoundException, IOException
DataAccessor
The resulting DataObject's ID may differ from the specified url due to normalization schemes, following of redirected URLs, etc. It is required though to provide a URI through which this DataAccessor can later on also access the same resource, i.e. the URI should also be a URL.
Specific DataAccessor implementations may accept additional parameters through the params Map, e.g. to speed up this method with ready-made datastructures it can reuse. See the documentation of these implementations for information on the type of parameters they accept. However, implementations should not rely on the contents of this Map to work properly.
getDataObject
in interface DataAccessor
url
- The url of the requested resource.dataSource
- The DataSource to be registered as the source of the DataObject (optional).params
- Additional parameters facilitating access to the physical resource (optional).containerFactory
- An RDFContainerFactory that delivers the RDFContainer to which the
metadata of the DataObject should be added. The provided RDFContainer can later be
retrieved as the DataObject's metadata container.
UrlNotFoundException
- When the specified url did not point to an existing resource.
IOException
- When any kind of I/O error occurs.public DataObject getDataObjectIfModified(String url, DataSource dataSource, AccessData newAccessData, Map params, RDFContainerFactory containerFactory) throws UrlNotFoundException, IOException
DataAccessor
The resulting DataObject's ID may differ from the specified url due to normalization schemes, following of redirected URLs, etc. It is required though to provide a URI through which this DataAccessor can later on also access the same resource, i.e. the URI should also be a URL.
The optionally passed AccessData can be used to let the DataAccessor store information about the created DataSource. The next time it is invoked with the same URL, it can then use this information to determine whether the resource has changed or not. The DataAccessor should return null when the resource has not changed. This facilitates fast incremental crawling of DataSources. When no AccessData is specified, no change detection takes place and an AccessData is always returned.
Specific DataAccessor implementations may accept additional parameters through the params Map, e.g. to speed up this method with ready-made datastructures it can reuse. See the documentation of these implementations for information on the type of parameters they accept. However, implementations should not rely on the contents of this Map to work properly.
getDataObjectIfModified
in interface DataAccessor
url
- The url of the requested resource.dataSource
- The DataSource to be registered as the source of the DataObject (optional).newAccessData
- Any access data obtained during the previous access to this DataObject
(optional).params
- Additional parameters facilitating access to the physical resource (optional).containerFactory
- An RDFContainerFactory that delivers the RDFContainer to which the
metadata of the DataObject should be added. The provided RDFContainer can later be
retrieved as the DataObject's metadata container.
UrlNotFoundException
- When the specified url did not point to an existing resource.
IOException
- When any kind of I/O error occurs.public Map<URI,DataObject> getAllRelatedDataObjects(String url, DataSource dataSource, Map params, RDFContainerFactory containerFactory) throws UrlNotFoundException, IOException
UrlNotFoundException
IOException
protected Object getDataObjectOrAllObjects(String url, DataSource dataSource, AccessData newAccessData, Map params, RDFContainerFactory containerFactory, boolean all) throws UrlNotFoundException, IOException
UrlNotFoundException
IOException
protected Object getDataObjectByMessageURI(String url, DataSource dataSource, RDFContainerFactory containerFactory, javax.mail.Folder folder, boolean all) throws javax.mail.MessagingException, UrlNotFoundException, IOException
javax.mail.MessagingException
UrlNotFoundException
IOException
protected final void crawlFolder(javax.mail.Folder folder, int depth) throws javax.mail.MessagingException
folder
- the folder where the crawl should be starteddepth
- how deep should the crawl proceed javax.mail.MessagingException
protected void crawlSingleFolder(javax.mail.Folder folder) throws javax.mail.MessagingException
javax.mail.MessagingException
protected void crawlSubFolders(javax.mail.Folder folder, int depth)
protected void crawlMessages(javax.mail.Folder folder, URI folderUri) throws javax.mail.MessagingException
javax.mail.MessagingException
protected void crawlSingleMessage(javax.mail.internet.MimeMessage message, String uri, URI folderUri) throws javax.mail.MessagingException, IOException
javax.mail.MessagingException
IOException
protected void applySpecificProcessing(DataObject object)
CrawlerHandler
object
- protected FolderDataObject getCurrentFolderObject(DataSource dataSource, AccessData newAccessData, RDFContainerFactory containerFactory) throws javax.mail.MessagingException
dataSource
- newAccessData
- containerFactory
-
javax.mail.MessagingException
protected long getMessageUid(javax.mail.Folder folder, javax.mail.Message message) throws javax.mail.MessagingException
folder
- the folder where the message is locatedmessage
- the message whose UID we want to fetch
javax.mail.MessagingException
protected int getMessageCount(javax.mail.Message[] messages) throws javax.mail.MessagingException
isRemoved(Message)
method.
messages
- the array of messages we'd like to check
javax.mail.MessagingException
protected String getSubFoldersString(javax.mail.Folder folder) throws javax.mail.MessagingException
folder
-
javax.mail.MessagingException
protected void reportNotModified(String uri)
CrawlerBase.reportUnmodifiedDataObject(String)
and
updates data structures internal to the AbstractJavaMailCrawler
uri
- the uri to be reported as unmodifiedprotected boolean isRemoved(javax.mail.Message message) throws javax.mail.MessagingException
message
- the message to check
javax.mail.MessagingException
protected boolean isTooLarge(javax.mail.Message message) throws javax.mail.MessagingException
message
- the message to check
javax.mail.MessagingException
protected boolean isAcceptable(javax.mail.Message message) throws javax.mail.MessagingException
message
- the message to check
javax.mail.MessagingException
protected boolean checkSubfoldersChanged(AccessData ad) throws javax.mail.MessagingException
javax.mail.MessagingException
public static boolean holdsFolders(javax.mail.Folder folder) throws javax.mail.MessagingException
folder
- the folder to be checked
javax.mail.MessagingException
- if it prooves impossible to find outpublic static boolean holdsMessages(javax.mail.Folder folder) throws javax.mail.MessagingException
folder
- the folder to be checked
javax.mail.MessagingException
- if it prooves impossible to find out
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |