| org.archive.crawler.url.canonicalize.BaseRule org.archive.crawler.url.canonicalize.StripWWWNRule
StripWWWNRule | public class StripWWWNRule extends BaseRule (Code) | | Strip any 'www[0-9]*' found on http/https URLs IF they have some
path/query component (content after third slash). Top 'slash page'
URIs are left unstripped: we prefer crawling redundant
top pages to missing an entire site only available from either
the www-full or www-less hostname, but not both.
author: stack version: $Date: 2006-09-18 20:32:47 +0000 (Mon, 18 Sep 2006) $, $Revision: 4634 $ |
Fields inherited from org.archive.crawler.url.canonicalize.BaseRule | final public static String ATTR_ENABLED(Code)(Java Doc)
|
|
|