A processor for calculating custum HTTP content digests in place of the
default (if any) computed by the HTTP fetcher processors.
This processor allows the user to specify a regular expression called
strip-reg-expr. Any segment of a document (text only, binary files will
be skipped) that matches this regular expression will by rewritten with
the blank character (character 32 in the ANSI character set) for the
purpose of the digest this has no effect on the document for subsequent
processing or archiving.
NOTE: Content digest only accounts for the document body, not headers.
The operator will also be able to specify a maximum length for documents
being evaluated by this processors. Documents exceeding that length will be
ignored.
To further discriminate by file type or URL, an operator should use the
override and refinement options.
It is generally recommended that this recalculation only be performed when
absolutely needed (because of stripping data that changes automatically each
time the URL is fetched) as this is an expensive operation.
author: Kristinn Sigurdsson |