Java Doc for AbstractFrontier.java in » Web-Crawler » heritrix » org » archive » crawler » frontier » Java Source Code / Java DocumentationJava Source Code and Java Documentation

Java Source Code / Java Documentation

1.	6.0 JDK Core
2.	6.0 JDK Modules
3.	6.0 JDK Modules com.sun
4.	6.0 JDK Modules com.sun.java
5.	6.0 JDK Modules sun
6.	6.0 JDK Platform
7.	Ajax
8.	Apache Harmony Java SE
9.	Aspect oriented
10.	Authentication Authorization
11.	Blogger System
12.	Build
13.	Byte Code
14.	Cache
15.	Chart
16.	Chat
17.	Code Analyzer
18.	Collaboration
19.	Content Management System
20.	Database Client
21.	Database DBMS
22.	Database JDBC Connection Pool
23.	Database ORM
24.	Development
25.	EJB Server geronimo
26.	EJB Server GlassFish
27.	EJB Server JBoss 4.2.1
28.	EJB Server resin 3.1.5
29.	ERP CRM Financial
30.	ESB
31.	Forum
32.	GIS
33.	Graphic Library
34.	Groupware
35.	HTML Parser
36.	IDE
37.	IDE Eclipse
38.	IDE Netbeans
39.	Installer
40.	Internationalization Localization
41.	Inversion of Control
42.	Issue Tracking
43.	J2EE
44.	JBoss
45.	JMS
46.	JMX
47.	Library
48.	Mail Clients
49.	Net
50.	Parser
51.	PDF
52.	Portal
53.	Profiler
54.	Project Management
55.	Report
56.	RSS RDF
57.	Rule Engine
58.	Science
59.	Scripting
60.	Search Engine
61.	Security
62.	Sevlet Container
63.	Source Control
64.	Swing Library
65.	Template Engine
66.	Test Coverage
67.	Testing
68.	UML
69.	Web Crawler
70.	Web Framework
71.	Web Mail
72.	Web Server
73.	Web Services
74.	Web Services apache cxf 2.0.1
75.	Web Services AXIS2
76.	Wiki Engine
77.	Workflow Engines
78.	XML
79.	XML UI

Java

Java Tutorial

Illustrator Tutorials

GIMP Tutorials

C# / C Sharp

C# / CSharp Tutorial

C# / CSharp Open Source

SQL Server / T-SQL Tutorial

Oracle PL / SQL

Oracle PL/SQL Tutorial

Flash / Flex / ActionScript

VBA / Excel / Access / Word

XML

XML Tutorial

Microsoft Office PowerPoint 2007 Tutorial

Microsoft Office Excel 2007 Tutorial

Microsoft Office Word 2007 Tutorial

Java Source Code / Java Documentation » Web Crawler » heritrix » org.archive.crawler.frontier

Source Cross Reference Class Diagram Java Document (Java Doc)

org.archive.crawler.settings .ModuleType

org.archive.crawler.frontier .AbstractFrontier

All known Subclasses:   org.archive.crawler.frontier .WorkQueueFrontier,
AbstractFrontier
abstract public class AbstractFrontier extends ModuleType implements CrawlStatusListener,Frontier,FetchStatusCodes,CoreAttributeConstants,Serializable(Code)
Shared facilities for Frontier implementations.
author:
   gojomo

Field Summary
final protected static String ACCEPTABLE_FORCE_QUEUE

final public static String ATTR_DELAY_FACTOR

final public static String ATTR_FORCE_QUEUE

final public static String ATTR_MAX_DELAY

final public static String ATTR_MAX_HOST_BANDWIDTH_USAGE

final public static String ATTR_MAX_OVERALL_BANDWIDTH_USAGE

final public static String ATTR_MAX_RETRIES

final public static String ATTR_MIN_DELAY

final public static String ATTR_PAUSE_AT_FINISH

final public static String ATTR_PAUSE_AT_START

final public static String ATTR_PREFERENCE_EMBED_HOPS

final public static String ATTR_QUEUE_ASSIGNMENT_POLICY

final protected static String ATTR_RECOVERY_ENABLED
     Recover log on or off attribute.
final public static String ATTR_RETRY_DELAY

final public static String ATTR_SOURCE_TAG_SEEDS

final protected static Boolean DEFAULT_ATTR_RECOVERY_ENABLED

final protected static Float DEFAULT_DELAY_FACTOR

final protected static String DEFAULT_FORCE_QUEUE

final protected static Integer DEFAULT_MAX_DELAY

final protected static Integer DEFAULT_MAX_HOST_BANDWIDTH_USAGE

final protected static Integer DEFAULT_MAX_OVERALL_BANDWIDTH_USAGE

final protected static Integer DEFAULT_MAX_RETRIES

final protected static Integer DEFAULT_MIN_DELAY

final protected static Boolean DEFAULT_PAUSE_AT_FINISH

final protected static Boolean DEFAULT_PAUSE_AT_START

final protected static Integer DEFAULT_PREFERENCE_EMBED_HOPS

final protected static Long DEFAULT_RETRY_DELAY

final protected static Boolean DEFAULT_SOURCE_TAG_SEEDS

final public static String IGNORED_SEEDS_FILENAME

protected transient CrawlController controller

protected long disregardedUriCount

protected long failedFetchCount

protected int lastMaxBandwidthKB

protected AtomicLong nextOrdinal

protected long processedBytesAfterLastEmittedURI

protected transient QueueAssignmentPolicy queueAssignmentPolicy

protected long queuedUriCount

protected boolean shouldPause

protected transient boolean shouldTerminate

protected long succeededFetchCount

protected long totalProcessedBytes
     Used when bandwidth constraint are used.

Constructor Summary
public AbstractFrontier(String name, String description)


Method Summary
protected void applySpecialHandling(CrawlURI curi)
     Perform any special handling of the CrawlURI, such as promoting its URI to seed-status, or preferencing it because it is an embed.
protected CrawlURI asCrawlUri(CandidateURI caUri)

protected String canonicalize(UURI uuri)
     Canonicalize passed uuri.
protected String canonicalize(CandidateURI cauri)
     Canonicalize passed CandidateURI.
public void crawlCheckpoint(File checkpointDir)

public void crawlEnded(String sExitMessage)

public void crawlEnding(String sExitMessage)

public void crawlPaused(String statusMessage)

public void crawlPausing(String statusMessage)

public void crawlResuming(String statusMessage)

public void crawlStarted(String message)

protected synchronized void decrementQueuedCount(long numberOfDeletes)
     Note that a number of queued Uris have been deleted.
public long disregardedUriCount()

protected void doJournalAdded(CrawlURI c)

protected void doJournalEmitted(CrawlURI c)

protected void doJournalFinishedFailure(CrawlURI c)

protected void doJournalFinishedSuccess(CrawlURI c)

protected void doJournalRescheduled(CrawlURI c)

public long failedFetchCount()

public long finishedUriCount()

public String getClassKey(CandidateURI cauri)

Parameters:
  cauri - CrawlURI we're to get a key for.
public FrontierJournal getFrontierJournal()
     RecoveryJournal instance.
protected CrawlServer getServer(CrawlURI curi)

public void importRecoverLog(String pathToLog, boolean retainFailures)

protected synchronized void incrementDisregardedUriCount()
     Increment the running count of disregarded URIs.
protected synchronized void incrementFailedFetchCount()
     Increment the running count of failed URIs.
protected synchronized void incrementQueuedUriCount()
     Increment the running count of queued URIs.
protected synchronized void incrementQueuedUriCount(long increment)
     Increment the running count of queued URIs.
protected synchronized void incrementSucceededFetchCount()
     Increment the running count of successfully fetched URIs.
public void initialize(CrawlController c)

protected boolean isDisregarded(CrawlURI curi)

public synchronized boolean isEmpty()

public void kickUpdate()

public void loadSeeds()
     Load up the seeds.
protected void log(CrawlURI curi)

protected void logLocalizedErrors(CrawlURI curi)
     Take note of any processor-local errors that have been entered into the CrawlURI.
protected boolean needsRetrying(CrawlURI curi)

protected void noteAboutToEmit(CrawlURI curi, WorkQueue q)
     Perform fixups on a CrawlURI about to be returned via next().
protected boolean overMaxRetries(CrawlURI curi)

public synchronized void pause()

protected long politenessDelayFor(CrawlURI curi)
     Update any scheduling structures with the new information in this CrawlURI.
protected synchronized void preNext(long now)

public long queuedUriCount()

public void reportTo(PrintWriter writer)

protected long retryDelayFor(CrawlURI curi)
     Return a suitable value to wait before retrying the given URI.
public static void saveIgnoredItems(String ignoredItems, File dir)
     Dump ignored seed items (if any) to disk; delete file otherwise.
protected File scratchDirFor(String key)
     Utility method to return a scratch dir for the given key's temp files. Every key gets its own subdir.
public String singleLineReport()

public void start()

public long succeededFetchCount()

public synchronized void terminate()

public long totalBytesWritten()

public synchronized void unpause()


Field Detail
ACCEPTABLE_FORCE_QUEUE
final protected static String ACCEPTABLE_FORCE_QUEUE(Code)

ATTR_DELAY_FACTOR
final public static String ATTR_DELAY_FACTOR(Code)
how many multiples of last fetch elapsed time to wait before recontacting same server

ATTR_FORCE_QUEUE
final public static String ATTR_FORCE_QUEUE(Code)
queue assignment to force onto CrawlURIs; intended to be overridden

ATTR_MAX_DELAY
final public static String ATTR_MAX_DELAY(Code)
never wait more than this long, regardless of multiple

ATTR_MAX_HOST_BANDWIDTH_USAGE
final public static String ATTR_MAX_HOST_BANDWIDTH_USAGE(Code)
maximum per-host bandwidth usage

ATTR_MAX_OVERALL_BANDWIDTH_USAGE
final public static String ATTR_MAX_OVERALL_BANDWIDTH_USAGE(Code)
maximum overall bandwidth usage

ATTR_MAX_RETRIES
final public static String ATTR_MAX_RETRIES(Code)
maximum times to emit a CrawlURI without final disposition

ATTR_MIN_DELAY
final public static String ATTR_MIN_DELAY(Code)
always wait this long after one completion before recontacting same server, regardless of multiple

ATTR_PAUSE_AT_FINISH
final public static String ATTR_PAUSE_AT_FINISH(Code)
whether pause, rather than finish, when crawl appears done

ATTR_PAUSE_AT_START
final public static String ATTR_PAUSE_AT_START(Code)
whether to pause at crawl start

ATTR_PREFERENCE_EMBED_HOPS
final public static String ATTR_PREFERENCE_EMBED_HOPS(Code)
number of hops of embeds (ERX) to bump to front of host queue

ATTR_QUEUE_ASSIGNMENT_POLICY
final public static String ATTR_QUEUE_ASSIGNMENT_POLICY(Code)

ATTR_RECOVERY_ENABLED
final protected static String ATTR_RECOVERY_ENABLED(Code)
Recover log on or off attribute.

ATTR_RETRY_DELAY
final public static String ATTR_RETRY_DELAY(Code)
for retryable problems, seconds to wait before a retry

ATTR_SOURCE_TAG_SEEDS
final public static String ATTR_SOURCE_TAG_SEEDS(Code)
whether to pause at crawl start

DEFAULT_ATTR_RECOVERY_ENABLED
final protected static Boolean DEFAULT_ATTR_RECOVERY_ENABLED(Code)

DEFAULT_DELAY_FACTOR
final protected static Float DEFAULT_DELAY_FACTOR(Code)

DEFAULT_FORCE_QUEUE
final protected static String DEFAULT_FORCE_QUEUE(Code)

DEFAULT_MAX_DELAY
final protected static Integer DEFAULT_MAX_DELAY(Code)

DEFAULT_MAX_HOST_BANDWIDTH_USAGE
final protected static Integer DEFAULT_MAX_HOST_BANDWIDTH_USAGE(Code)

DEFAULT_MAX_OVERALL_BANDWIDTH_USAGE
final protected static Integer DEFAULT_MAX_OVERALL_BANDWIDTH_USAGE(Code)

DEFAULT_MAX_RETRIES
final protected static Integer DEFAULT_MAX_RETRIES(Code)

DEFAULT_MIN_DELAY
final protected static Integer DEFAULT_MIN_DELAY(Code)

DEFAULT_PAUSE_AT_FINISH
final protected static Boolean DEFAULT_PAUSE_AT_FINISH(Code)

DEFAULT_PAUSE_AT_START
final protected static Boolean DEFAULT_PAUSE_AT_START(Code)

DEFAULT_PREFERENCE_EMBED_HOPS
final protected static Integer DEFAULT_PREFERENCE_EMBED_HOPS(Code)

DEFAULT_RETRY_DELAY
final protected static Long DEFAULT_RETRY_DELAY(Code)

DEFAULT_SOURCE_TAG_SEEDS
final protected static Boolean DEFAULT_SOURCE_TAG_SEEDS(Code)

IGNORED_SEEDS_FILENAME
final public static String IGNORED_SEEDS_FILENAME(Code)
file collecting report of ignored seed-file entries (if any)

controller
protected transient CrawlController controller(Code)

disregardedUriCount
protected long disregardedUriCount(Code)

failedFetchCount
protected long failedFetchCount(Code)

lastMaxBandwidthKB
protected int lastMaxBandwidthKB(Code)

nextOrdinal
protected AtomicLong nextOrdinal(Code)
ordinal numbers to assign to created CrawlURIs

processedBytesAfterLastEmittedURI
protected long processedBytesAfterLastEmittedURI(Code)

queueAssignmentPolicy
protected transient QueueAssignmentPolicy queueAssignmentPolicy(Code)
Policy for assigning CrawlURIs to named queues

queuedUriCount
protected long queuedUriCount(Code)

shouldPause
protected boolean shouldPause(Code)
should the frontier hold any threads asking for URIs?

shouldTerminate
protected transient boolean shouldTerminate(Code)
should the frontier send an EndedException to any threads asking for URIs?

succeededFetchCount
protected long succeededFetchCount(Code)

totalProcessedBytes
protected long totalProcessedBytes(Code)
Used when bandwidth constraint are used.

Constructor Detail
AbstractFrontier
public AbstractFrontier(String name, String description)(Code)

Parameters:
  name - Name of this frontier.
Parameters:
  description - Description for this frontier.

Method Detail
applySpecialHandling
protected void applySpecialHandling(CrawlURI curi)(Code)
Perform any special handling of the CrawlURI, such as promoting its URI to seed-status, or preferencing it because it is an embed.
Parameters:
  curi -

asCrawlUri
protected CrawlURI asCrawlUri(CandidateURI caUri)(Code)

canonicalize
protected String canonicalize(UURI uuri)(Code)
Canonicalize passed uuri. Its would be sweeter if this canonicalize function was encapsulated by that which it canonicalizes but because settings change with context -- i.e. there may be overrides in operation for a particular URI -- its not so easy; Each CandidateURI would need a reference to the settings system. That's awkward to pass in.
Parameters:
  uuri - Candidate URI to canonicalize. Canonicalized version of passed uuri.

canonicalize
protected String canonicalize(CandidateURI cauri)(Code)
Canonicalize passed CandidateURI. This method differs from AbstractFrontier.canonicalize(UURI) in that it takes a look at the CandidateURI context possibly overriding any canonicalization effect if it could make us miss content. If canonicalization produces an URL that was 'alreadyseen', but the entry in the 'alreadyseen' database did nothing but redirect to the current URL, we won't get the current URL; we'll think we've already see it. Examples would be archive.org redirecting to www.archive.org or the inverse, www.netarkivet.net redirecting to netarkivet.net (assuming stripWWW rule enabled).
Note, this method under circumstance sets the forceFetch flag.
Parameters:
  cauri - CandidateURI to examine. Canonicalized cacuri.

crawlCheckpoint
public void crawlCheckpoint(File checkpointDir) throws Exception(Code)

crawlEnded
public void crawlEnded(String sExitMessage)(Code)

crawlEnding
public void crawlEnding(String sExitMessage)(Code)

crawlPaused
public void crawlPaused(String statusMessage)(Code)

crawlPausing
public void crawlPausing(String statusMessage)(Code)

crawlResuming
public void crawlResuming(String statusMessage)(Code)

crawlStarted
public void crawlStarted(String message)(Code)

decrementQueuedCount
protected synchronized void decrementQueuedCount(long numberOfDeletes)(Code)
Note that a number of queued Uris have been deleted.
Parameters:
  numberOfDeletes -

disregardedUriCount
public long disregardedUriCount()(Code)

doJournalAdded
protected void doJournalAdded(CrawlURI c)(Code)

doJournalEmitted
protected void doJournalEmitted(CrawlURI c)(Code)

doJournalFinishedFailure
protected void doJournalFinishedFailure(CrawlURI c)(Code)

doJournalFinishedSuccess
protected void doJournalFinishedSuccess(CrawlURI c)(Code)

doJournalRescheduled
protected void doJournalRescheduled(CrawlURI c)(Code)

failedFetchCount
public long failedFetchCount()(Code)
(non-Javadoc)
See Also:   org.archive.crawler.framework.Frontier.failedFetchCount

finishedUriCount
public long finishedUriCount()(Code)
(non-Javadoc)
See Also:   org.archive.crawler.framework.Frontier.finishedUriCount

getClassKey
public String getClassKey(CandidateURI cauri)(Code)

Parameters:
  cauri - CrawlURI we're to get a key for. a String token representing a queue

getFrontierJournal
public FrontierJournal getFrontierJournal()(Code)
RecoveryJournal instance. May be null.

getServer
protected CrawlServer getServer(CrawlURI curi)(Code)

Parameters:
  curi - the CrawlServer to be associated with this CrawlURI

importRecoverLog
public void importRecoverLog(String pathToLog, boolean retainFailures) throws IOException(Code)

incrementDisregardedUriCount
protected synchronized void incrementDisregardedUriCount()(Code)
Increment the running count of disregarded URIs. Synchronized because operations on longs are not atomic.

incrementFailedFetchCount
protected synchronized void incrementFailedFetchCount()(Code)
Increment the running count of failed URIs. Synchronized because operations on longs are not atomic.

incrementQueuedUriCount
protected synchronized void incrementQueuedUriCount()(Code)
Increment the running count of queued URIs. Synchronized because operations on longs are not atomic.

incrementQueuedUriCount
protected synchronized void incrementQueuedUriCount(long increment)(Code)
Increment the running count of queued URIs. Synchronized because operations on longs are not atomic.
Parameters:
  increment - amount to increment the queued count

incrementSucceededFetchCount
protected synchronized void incrementSucceededFetchCount()(Code)
Increment the running count of successfully fetched URIs. Synchronized because operations on longs are not atomic.

initialize
public void initialize(CrawlController c) throws FatalConfigurationException, IOException(Code)

isDisregarded
protected boolean isDisregarded(CrawlURI curi)(Code)

isEmpty
public synchronized boolean isEmpty()(Code)
Frontier is empty only if all queues are empty and no URIs are in-process True if queues are empty.

kickUpdate
public void kickUpdate()(Code)

loadSeeds
public void loadSeeds()(Code)
Load up the seeds. This method is called on initialize and inside in the crawlcontroller when it wants to force reloading of configuration.
See Also:   org.archive.crawler.framework.CrawlController.kickUpdate

log
protected void log(CrawlURI curi)(Code)
Log to the main crawl.log
Parameters:
  curi -

logLocalizedErrors
protected void logLocalizedErrors(CrawlURI curi)(Code)
Take note of any processor-local errors that have been entered into the CrawlURI.
Parameters:
  curi -

needsRetrying
protected boolean needsRetrying(CrawlURI curi)(Code)
Checks if a recently completed CrawlURI that did not finish successfully needs to be retried (processed again after some time elapses)
Parameters:
  curi - The CrawlURI to check True if we need to retry.

noteAboutToEmit
protected void noteAboutToEmit(CrawlURI curi, WorkQueue q)(Code)
Perform fixups on a CrawlURI about to be returned via next().
Parameters:
  curi - CrawlURI about to be returned by next()
Parameters:
  q - the queue from which the CrawlURI came

overMaxRetries
protected boolean overMaxRetries(CrawlURI curi)(Code)

pause
public synchronized void pause()(Code)

politenessDelayFor
protected long politenessDelayFor(CrawlURI curi)(Code)
Update any scheduling structures with the new information in this CrawlURI. Chiefly means make necessary arrangements for no other URIs at the same host to be visited within the appropriate politeness window.
Parameters:
  curi - The CrawlURI millisecond politeness delay

preNext
protected synchronized void preNext(long now) throws InterruptedException, EndedException(Code)

Parameters:
  now -
throws:
  InterruptedException -
throws:
  EndedException -

queuedUriCount
public long queuedUriCount()(Code)
(non-Javadoc)
See Also:   org.archive.crawler.framework.Frontier.queuedUriCount

reportTo
public void reportTo(PrintWriter writer)(Code)

retryDelayFor
protected long retryDelayFor(CrawlURI curi)(Code)
Return a suitable value to wait before retrying the given URI.
Parameters:
  curi - CrawlURI to be retried millisecond delay before retry

saveIgnoredItems
public static void saveIgnoredItems(String ignoredItems, File dir)(Code)
Dump ignored seed items (if any) to disk; delete file otherwise. Static to allow non-derived sibling classes (frontiers not yet subclassed here) to reuse.
Parameters:
  ignoredItems -
Parameters:
  dir -

scratchDirFor
protected File scratchDirFor(String key)(Code)
Utility method to return a scratch dir for the given key's temp files. Every key gets its own subdir. To avoid having any one directory with thousands of files, there are also two levels of enclosing directory named by the least-significant hex digits of the key string's java hashcode.
Parameters:
  key - File representing scratch directory

singleLineReport
public String singleLineReport()(Code)

start
public void start()(Code)

succeededFetchCount
public long succeededFetchCount()(Code)
(non-Javadoc)
See Also:   org.archive.crawler.framework.Frontier.succeededFetchCount

terminate
public synchronized void terminate()(Code)

totalBytesWritten
public long totalBytesWritten()(Code)

unpause
public synchronized void unpause()(Code)

Methods inherited from org.archive.crawler.settings.ModuleType
public Type addElement(CrawlerSettings settings, Type type) throws InvalidAttributeValueException(Code)(Java Doc)
protected void listUsedFiles(List<String> list)(Code)(Java Doc)

www.java2java.com | Contact Us

All other trademarks are property of their respective owners.