Class WebCrawler

  extended by org.semanticdesktop.aperture.crawler.base.CrawlerBase
      extended by org.semanticdesktop.aperture.crawler.web.WebCrawler
All Implemented Interfaces:

public class WebCrawler
extends CrawlerBase

A Crawler implementation for WebDataSources.

Due to the large amount of information needed to be stored for incremental crawling, the use of a non-in-memory AccessData implementation is advised. Please note that the entire hypertext graph will be stored in this AccessData.

Implementation note: this WebCrawler fetches URLs one-by-one in a single-threaded manner. Previous implementations used a configurable number of threads to fetch the URLs. However, it turned out that even when running with a single thread, the bandwidth was by far the biggest bottle-neck for crawling websites, rather than processing of documents or network latency. In other words: there was no performance gain in using multiple fetch threads but the implementation was a lot more complicated, especially because the listeners handling the results assumed to be running in a single thread. Therefore we've decided to keep this implementation simple and single-threaded.

Field Summary
Fields inherited from class org.semanticdesktop.aperture.crawler.base.CrawlerBase
accessData, accessorRegistry, crawlReportFile, source, stopRequested
Constructor Summary
Method Summary
protected  ExitCode crawlObjects()
          Method called by crawl() that should implement the actual crawling of the DataSource.
 LinkExtractorRegistry getLinkExtractorRegistry()
 MimeTypeIdentifier getMimeTypeIdentifier()
 void setLinkExtractorRegistry(LinkExtractorRegistry linkExtractorRegistry)
 void setMimeTypeIdentifier(MimeTypeIdentifier mimeTypeIdentifier)
Methods inherited from class org.semanticdesktop.aperture.crawler.base.CrawlerBase
clear, clear, crawl, getAccessData, getCrawlerHandler, getCrawlReport, getCrawlReportFile, getDataAccessorRegistry, getDataSource, getRDFContainerFactory, inDomain, isStopRequested, reportAccessingObject, reportDeletedDataObject, reportFatalErrorCause, reportFatalErrorCause, reportFatalErrorCause, reportModifiedDataObject, reportNewDataObject, reportUnmodifiedDataObject, reportUntouched, runSubCrawler, setAccessData, setCrawlerHandler, setCrawlReportFile, setDataAccessorRegistry, setDataSource, stop, storeCrawlReport, touchObject
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Detail


public WebCrawler()
Method Detail


public void setMimeTypeIdentifier(MimeTypeIdentifier mimeTypeIdentifier)


public MimeTypeIdentifier getMimeTypeIdentifier()


public void setLinkExtractorRegistry(LinkExtractorRegistry linkExtractorRegistry)


public LinkExtractorRegistry getLinkExtractorRegistry()


protected ExitCode crawlObjects()
Description copied from class: CrawlerBase
Method called by crawl() that should implement the actual crawling of the DataSource. The return value of this method should indicate whether the scanning was completed successfully (i.e. it wasn't interrupted or anything). Also this method is expected to update the deprecatedUrls set, as any remaining URLs in this set will be removed as being removed after this method completes.

Specified by:
crawlObjects in class CrawlerBase
An ExitCode indicating how the crawl procedure terminated.

Copyright © 2010 Aperture Development Team. All Rights Reserved.