org.semanticdesktop.aperture.crawler.web
Class WebCrawler
java.lang.Object
org.semanticdesktop.aperture.crawler.base.CrawlerBase
org.semanticdesktop.aperture.crawler.web.WebCrawler
- All Implemented Interfaces:
- Crawler
public class WebCrawler
- extends CrawlerBase
A Crawler implementation for WebDataSources.
Due to the large amount of information needed to be stored for incremental crawling, the use of a
non-in-memory AccessData implementation is advised. Please note that the entire hypertext graph will be
stored in this AccessData.
Implementation note: this WebCrawler fetches URLs one-by-one in a single-threaded manner. Previous
implementations used a configurable number of threads to fetch the URLs. However, it turned out that even
when running with a single thread, the bandwidth was by far the biggest bottle-neck for crawling websites,
rather than processing of documents or network latency. In other words: there was no performance gain in
using multiple fetch threads but the implementation was a lot more complicated, especially because the
listeners handling the results assumed to be running in a single thread. Therefore we've decided to keep
this implementation simple and single-threaded.
Methods inherited from class org.semanticdesktop.aperture.crawler.base.CrawlerBase |
clear, clear, crawl, getAccessData, getCrawlerHandler, getCrawlReport, getCrawlReportFile, getDataAccessorRegistry, getDataSource, getRDFContainerFactory, inDomain, isStopRequested, reportAccessingObject, reportDeletedDataObject, reportFatalErrorCause, reportFatalErrorCause, reportFatalErrorCause, reportModifiedDataObject, reportNewDataObject, reportUnmodifiedDataObject, reportUntouched, runSubCrawler, setAccessData, setCrawlerHandler, setCrawlReportFile, setDataAccessorRegistry, setDataSource, stop, storeCrawlReport, touchObject |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
WebCrawler
public WebCrawler()
setMimeTypeIdentifier
public void setMimeTypeIdentifier(MimeTypeIdentifier mimeTypeIdentifier)
getMimeTypeIdentifier
public MimeTypeIdentifier getMimeTypeIdentifier()
setLinkExtractorRegistry
public void setLinkExtractorRegistry(LinkExtractorRegistry linkExtractorRegistry)
getLinkExtractorRegistry
public LinkExtractorRegistry getLinkExtractorRegistry()
crawlObjects
protected ExitCode crawlObjects()
- Description copied from class:
CrawlerBase
- Method called by crawl() that should implement the actual crawling of the DataSource. The return value
of this method should indicate whether the scanning was completed successfully (i.e. it wasn't
interrupted or anything). Also this method is expected to update the deprecatedUrls set, as any
remaining URLs in this set will be removed as being removed after this method completes.
- Specified by:
crawlObjects
in class CrawlerBase
- Returns:
- An ExitCode indicating how the crawl procedure terminated.
Copyright © 2010 Aperture Development Team. All Rights Reserved.