Method Summary |
|
public void | aboutToLog() |
public static void | addAlistPersistentMember(Object key) Add the key of alist items you want to persist across
processings. |
public void | addAnnotation(String annotation) Add an annotation: an abbrieviated indication of something special
about this URI that need not be present in every crawl.log line,
but should be noted for future reference. |
public void | addCredentialAvatar(CredentialAvatar ca) Add an avatar. |
public void | addLocalizedError(String processorName, Throwable ex, String message) Make note of a non-fatal error, local to a particular Processor,
which should be logged somewhere, but allows processing to continue. |
public void | addOutLink(Link link) Add a discovered Link, unless it would exceed the max number
to accept. |
protected boolean | annotationContains(String str2Find) |
public void | clearOutlinks() |
public void | createAndAddLink(String url, CharSequence context, char hopType) |
public void | createAndAddLinkRelativeToBase(String url, CharSequence context, char hopType) |
public void | createAndAddLinkRelativeToVia(String url, CharSequence context, char hopType) Convenience method for creating a Link with the given string and
context, relative to this CrawlURI's via UURI if available. |
public Link | createLink(String url, CharSequence context, char hopType) |
public static String | fetchStatusCodesToString(int code) Takes a status code and converts it into a human readable string. |
public static CrawlURI | from(CandidateURI caUri, long ordinal) Make a CrawlURI from the passed CandidateURI .
Its safe to pass a CrawlURI instance. |
public String | getAnnotations() Get the annotations set for this uri. |
public UURI | getBaseURI() Get the (HTML) Base URI used for derelativizing internal URIs. |
protected String | getClassSimpleName(Class c) |
public Object | getContentDigest() Return the retained content-digest value, if any. |
public String | getContentDigestSchemeString() |
public String | getContentDigestString() |
public long | getContentLength() For completed HTTP transactions, the length of the content-body. |
public long | getContentSize() Get the size in bytes of this URI's recorded content, inclusive
of things like protocol headers. |
public String | getContentType() Get the content type of this URI.
Fetched URIs content type. |
public String | getCrawlURIString() |
public Set<CredentialAvatar> | getCredentialAvatars() Credential avatars. |
public int | getDeferrals() Get the deferral count. |
public int | getEmbedHopCount() Get the embeded hop count. |
public int | getFetchAttempts() Get the number of attempts at getting the document referenced by this
URI. |
public int | getFetchStatus() Return the overall/fetch status of this CrawlURI for its
current trip through the processing loop. |
public Object | getHolder() Return the 'holder' for the convenience of
an external facility. |
public int | getHolderCost() |
public Object | getHolderKey() Return the 'holderKey' for convenience of
an external facility (Frontier). |
public HttpRecorder | getHttpRecorder() Get the http recorder associated with this uri.
Returns the httpRecorder. |
public int | getLinkHopCount() Get the link hop count. |
public long | getOrdinal() Get the ordinal (serial number) assigned at creation. |
public Collection<CandidateURI> | getOutCandidates() Returns discovered candidate URIs. |
public Collection<Link> | getOutLinks() Returns discovered links. |
public Collection<Object> | getOutObjects() Returns all of the outbound objects. |
public AList | getPersistentAList() |
public Object | getPrerequisiteUri() Get the prerequisite for this URI. |
public long | getRecordedSize() |
public int | getThreadNumber() Get the number of the ToeThread responsible for processing this uri. |
public String | getUserAgent() Get the user agent to use for crawling this URI. |
public boolean | hasBeenLinkExtracted() If true then a link extractor has already claimed this CrawlURI and
performed link extraction on the document content. |
public boolean | hasCredentialAvatars() |
public boolean | hasPrerequisiteUri() |
public boolean | hasRfc2617CredentialAvatar() |
public void | incrementDeferrals() Increment the deferral count. |
public int | incrementFetchAttempts() Increment the number of attempts at getting the document referenced by
this URI. |
public boolean | is2XXSuccess() |
public boolean | isHeaderTruncatedFetch() |
public boolean | isHttpTransaction() Return true if this is a http transaction. |
public boolean | isLengthTruncatedFetch() |
public boolean | isPost() Returns true if this URI should be fetched by sending a HTTP POST request. |
public boolean | isPrerequisite() Returns true if this CrawlURI is a prerequisite. |
public boolean | isSuccess() Ask this URI if it was a success or not.
Only makes sense to call this method after execution of
HttpMethod#execute. |
public boolean | isTimeTruncatedFetch() |
public boolean | isTruncatedFetch() TODO: Implement truncation using booleans rather than as this
ugly String parse. |
public void | linkExtractorFinished() Note that link extraction has been performed on this CrawlURI. |
public void | markAsSeed() Mark this uri as being a seed. |
public void | markPrerequisite(String preq, ProcessorChain lastProcessorChain) Do all actions associated with setting a CrawlURI as
requiring a prerequisite.
Parameters: lastProcessorChain - Last processor chain reference. |
public Processor | nextProcessor() Get the next processor to process this URI. |
public ProcessorChain | nextProcessorChain() Get the processor chain that should be processing this URI after the
current chain is finished with it. |
public int | outlinksSize() |
public void | processingCleanup() Clean up after a run through the processing chain.
Called on the end of processing chain by Frontier#finish. |
public static boolean | removeAlistPersistentMember(Object key) Parameters: key - Key to remove. |
public boolean | removeCredentialAvatar(CredentialAvatar ca) Remove all credential avatars from this crawl uri.
Parameters: ca - Avatar to remove. |
public void | removeCredentialAvatars() Remove all credential avatars from this crawl uri. |
public void | replaceOutlinks(Collection<CandidateURI> links) Replace current collection of links w/ passed list. |
public void | resetDeferrals() Reset deferrals counter. |
public void | resetFetchAttempts() Reset fetchAttempts counter. |
public void | setBaseURI(String baseHref) Set the (HTML) Base URI used for derelativizing internal URIs. |
public void | setContentDigest(byte[] digestValue) Set the retained content-digest value (usu. |
public void | setContentDigest(String scheme, byte[] digestValue) |
public void | setContentSize(long l) Sets the 'content size' for the URI, which is considered inclusive
of all recorded material (such as protocol headers) or even material
'virtually' considered (as in material from a previous fetch
confirmed unchanged with a server). |
public void | setContentType(String ct) Set a fetched uri's content type.
Parameters: ct - Contenttype. |
public void | setFetchStatus(int newstatus) Set the overall/fetch status of this CrawlURI for
its current trip through the processing loop. |
public void | setHolder(Object obj) Remember a 'holder' to which some enclosing/queueing
facility has assigned this CrawlURI
. |
public void | setHolderCost(int cost) |
public void | setHolderKey(Object obj) Remember a 'holderKey' which some enclosing/queueing
facility has assigned this CrawlURI
. |
public void | setHttpRecorder(HttpRecorder httpRecorder) Set the http recorder to be associated with this uri. |
public void | setNextProcessor(Processor processor) Set the next processor to process this URI. |
public void | setNextProcessorChain(ProcessorChain nextProcessorChain) Set the next processor chain to process this URI. |
public void | setPost(boolean b) Set whether this URI should be fetched by sending a HTTP POST request.
Else a HTTP GET request will be used.
Parameters: b - Set whether this curi is to be POST'd. |
public void | setPrerequisite(boolean prerequisite) Set if this CrawlURI is itself a prerequisite URI. |
public void | setPrerequisiteUri(Object link) Set a prerequisite for this URI. |
public void | setThreadNumber(int i) Set the number of the ToeThread responsible for processing this uri. |
public void | setUserAgent(String string) Set the user agent to use when crawling this URI. |
public void | skipToProcessor(ProcessorChain processorChain, Processor processor) Set which processor should be the next processor to process this uri
instead of using the default next processor. |
public void | skipToProcessorChain(ProcessorChain processorChain) Set which processor chain should be processing this uri next. |
public void | stripToMinimal() Remove all attributes set on this uri. |