- All Implemented Interfaces:
public class WebCrawler
- extends CrawlerBase
A Crawler implementation for WebDataSources.
Due to the large amount of information needed to be stored for incremental crawling, the use of a
non-in-memory AccessData implementation is advised. Please note that the entire hypertext graph will be
stored in this AccessData.
Implementation note: this WebCrawler fetches URLs one-by-one in a single-threaded manner. Previous
implementations used a configurable number of threads to fetch the URLs. However, it turned out that even
when running with a single thread, the bandwidth was by far the biggest bottle-neck for crawling websites,
rather than processing of documents or network latency. In other words: there was no performance gain in
using multiple fetch threads but the implementation was a lot more complicated, especially because the
listeners handling the results assumed to be running in a single thread. Therefore we've decided to keep
this implementation simple and single-threaded.
|Methods inherited from class org.semanticdesktop.aperture.crawler.base.CrawlerBase
clear, clear, crawl, getAccessData, getCrawlerHandler, getCrawlReport, getCrawlReportFile, getDataAccessorRegistry, getDataSource, getRDFContainerFactory, inDomain, isStopRequested, reportAccessingObject, reportDeletedDataObject, reportFatalErrorCause, reportFatalErrorCause, reportFatalErrorCause, reportModifiedDataObject, reportNewDataObject, reportUnmodifiedDataObject, reportUntouched, runSubCrawler, setAccessData, setCrawlerHandler, setCrawlReportFile, setDataAccessorRegistry, setDataSource, stop, storeCrawlReport, touchObject
|Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
public void setMimeTypeIdentifier(MimeTypeIdentifier mimeTypeIdentifier)
public MimeTypeIdentifier getMimeTypeIdentifier()
public void setLinkExtractorRegistry(LinkExtractorRegistry linkExtractorRegistry)
public LinkExtractorRegistry getLinkExtractorRegistry()
protected ExitCode crawlObjects()
- Description copied from class:
- Method called by crawl() that should implement the actual crawling of the DataSource. The return value
of this method should indicate whether the scanning was completed successfully (i.e. it wasn't
interrupted or anything). Also this method is expected to update the deprecatedUrls set, as any
remaining URLs in this set will be removed as being removed after this method completes.
- Specified by:
crawlObjects in class
- An ExitCode indicating how the crawl procedure terminated.
Copyright © 2010 Aperture Development Team. All Rights Reserved.