WebCrawler (Aperture Core 1.5.0 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.semanticdesktop.aperture.crawler.web
Class WebCrawler

java.lang.Object
  org.semanticdesktop.aperture.crawler.base.CrawlerBase
      org.semanticdesktop.aperture.crawler.web.WebCrawler

All Implemented Interfaces:: Crawler

public class WebCrawler
extends CrawlerBase
extends CrawlerBase

A Crawler implementation for WebDataSources.

Due to the large amount of information needed to be stored for incremental crawling, the use of a non-in-memory AccessData implementation is advised. Please note that the entire hypertext graph will be stored in this AccessData.

Implementation note: this WebCrawler fetches URLs one-by-one in a single-threaded manner. Previous implementations used a configurable number of threads to fetch the URLs. However, it turned out that even when running with a single thread, the bandwidth was by far the biggest bottle-neck for crawling websites, rather than processing of documents or network latency. In other words: there was no performance gain in using multiple fetch threads but the implementation was a lot more complicated, especially because the listeners handling the results assumed to be running in a single thread. Therefore we've decided to keep this implementation simple and single-threaded.

Field Summary

Fields inherited from class org.semanticdesktop.aperture.crawler.base.CrawlerBase
`accessData, accessorRegistry, crawlReportFile, source, stopRequested`

Constructor Summary
`WebCrawler()`

Method Summary
`protected ExitCode`	`crawlObjects()` Method called by crawl() that should implement the actual crawling of the DataSource.
`LinkExtractorRegistry`	`getLinkExtractorRegistry()`
`MimeTypeIdentifier`	`getMimeTypeIdentifier()`
`void`	`setLinkExtractorRegistry(LinkExtractorRegistry linkExtractorRegistry)`
`void`	`setMimeTypeIdentifier(MimeTypeIdentifier mimeTypeIdentifier)`

Methods inherited from class org.semanticdesktop.aperture.crawler.base.CrawlerBase
clear, clear, crawl, getAccessData, getCrawlerHandler, getCrawlReport, getCrawlReportFile, getDataAccessorRegistry, getDataSource, getRDFContainerFactory, inDomain, isStopRequested, reportAccessingObject, reportDeletedDataObject, reportFatalErrorCause, reportFatalErrorCause, reportFatalErrorCause, reportModifiedDataObject, reportNewDataObject, reportUnmodifiedDataObject, reportUntouched, runSubCrawler, setAccessData, setCrawlerHandler, setCrawlReportFile, setDataAccessorRegistry, setDataSource, stop, storeCrawlReport, touchObject

Methods inherited from class org.semanticdesktop.aperture.crawler.base.CrawlerBase

clear, clear, crawl, getAccessData, getCrawlerHandler, getCrawlReport, getCrawlReportFile, getDataAccessorRegistry, getDataSource, getRDFContainerFactory, inDomain, isStopRequested, reportAccessingObject, reportDeletedDataObject, reportFatalErrorCause, reportFatalErrorCause, reportFatalErrorCause, reportModifiedDataObject, reportNewDataObject, reportUnmodifiedDataObject, reportUntouched, runSubCrawler, setAccessData, setCrawlerHandler, setCrawlReportFile, setDataAccessorRegistry, setDataSource, stop, storeCrawlReport, touchObject

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

WebCrawler

public WebCrawler()

Method Detail

setMimeTypeIdentifier

public void setMimeTypeIdentifier(MimeTypeIdentifier mimeTypeIdentifier)

getMimeTypeIdentifier

public MimeTypeIdentifier getMimeTypeIdentifier()

setLinkExtractorRegistry

public void setLinkExtractorRegistry(LinkExtractorRegistry linkExtractorRegistry)

getLinkExtractorRegistry

public LinkExtractorRegistry getLinkExtractorRegistry()

crawlObjects

protected ExitCode crawlObjects()

Description copied from class: CrawlerBase

Method called by crawl() that should implement the actual crawling of the DataSource. The return value of this method should indicate whether the scanning was completed successfully (i.e. it wasn't interrupted or anything). Also this method is expected to update the deprecatedUrls set, as any remaining URLs in this set will be removed as being removed after this method completes.

Specified by:: crawlObjects in class CrawlerBase

Returns:: An ExitCode indicating how the crawl procedure terminated.