org.semanticdesktop.aperture.crawler.web
Class WebCrawler

java.lang.Object
  extended by org.semanticdesktop.aperture.crawler.base.CrawlerBase
      extended by org.semanticdesktop.aperture.crawler.web.WebCrawler
All Implemented Interfaces:
Crawler

public class WebCrawler
extends CrawlerBase

A Crawler implementation for WebDataSources.

Due to the large amount of information needed to be stored for incremental crawling, the use of a non-in-memory AccessData implementation is advised. Please note that the entire hypertext graph will be stored in this AccessData.

Implementation note: this WebCrawler fetches URLs one-by-one in a single-threaded manner. Previous implementations used a configurable number of threads to fetch the URLs. However, it turned out that even when running with a single thread, the bandwidth was by far the biggest bottle-neck for crawling websites, rather than processing of documents or network latency. In other words: there was no performance gain in using multiple fetch threads but the implementation was a lot more complicated, especially because the listeners handling the results assumed to be running in a single thread. Therefore we've decided to keep this implementation simple and single-threaded.


Field Summary
 
Fields inherited from class org.semanticdesktop.aperture.crawler.base.CrawlerBase
accessData, accessorRegistry, crawlReportFile, source, stopRequested
 
Constructor Summary
WebCrawler()
           
 
Method Summary
protected  ExitCode crawlObjects()
          Method called by crawl() that should implement the actual crawling of the DataSource.
 LinkExtractorRegistry getLinkExtractorRegistry()
           
 MimeTypeIdentifier getMimeTypeIdentifier()
           
 void setLinkExtractorRegistry(LinkExtractorRegistry linkExtractorRegistry)
           
 void setMimeTypeIdentifier(MimeTypeIdentifier mimeTypeIdentifier)
           
 
Methods inherited from class org.semanticdesktop.aperture.crawler.base.CrawlerBase
clear, clear, crawl, getAccessData, getCrawlerHandler, getCrawlReport, getCrawlReportFile, getDataAccessorRegistry, getDataSource, getRDFContainerFactory, inDomain, isStopRequested, reportAccessingObject, reportDeletedDataObject, reportFatalErrorCause, reportFatalErrorCause, reportFatalErrorCause, reportModifiedDataObject, reportNewDataObject, reportUnmodifiedDataObject, reportUntouched, runSubCrawler, setAccessData, setCrawlerHandler, setCrawlReportFile, setDataAccessorRegistry, setDataSource, stop, storeCrawlReport, touchObject
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

WebCrawler

public WebCrawler()
Method Detail

setMimeTypeIdentifier

public void setMimeTypeIdentifier(MimeTypeIdentifier mimeTypeIdentifier)

getMimeTypeIdentifier

public MimeTypeIdentifier getMimeTypeIdentifier()

setLinkExtractorRegistry

public void setLinkExtractorRegistry(LinkExtractorRegistry linkExtractorRegistry)

getLinkExtractorRegistry

public LinkExtractorRegistry getLinkExtractorRegistry()

crawlObjects

protected ExitCode crawlObjects()
Description copied from class: CrawlerBase
Method called by crawl() that should implement the actual crawling of the DataSource. The return value of this method should indicate whether the scanning was completed successfully (i.e. it wasn't interrupted or anything). Also this method is expected to update the deprecatedUrls set, as any remaining URLs in this set will be removed as being removed after this method completes.

Specified by:
crawlObjects in class CrawlerBase
Returns:
An ExitCode indicating how the crawl procedure terminated.


Copyright © 2010 Aperture Development Team. All Rights Reserved.