| org.archive.crawler.settings.ModuleType org.archive.crawler.url.canonicalize.BaseRule
All known Subclasses: org.archive.crawler.url.canonicalize.StripExtraSlashes, org.archive.crawler.url.canonicalize.LowercaseRule, org.archive.crawler.url.canonicalize.StripSessionIDs, org.archive.crawler.url.canonicalize.StripSessionCFIDs, org.archive.crawler.url.canonicalize.RegexRule, org.archive.crawler.url.canonicalize.StripWWWNRule, org.archive.crawler.url.canonicalize.StripUserinfoRule, org.archive.crawler.url.canonicalize.StripWWWRule, org.archive.crawler.url.canonicalize.FixupQueryStr,
BaseRule | public BaseRule(String name, String description)(Code) | | Constructor.
Parameters: name - Name of this canonicalization rule. Parameters: description - Description of what this rule does. |
doStripRegexMatch | protected String doStripRegexMatch(String url, Matcher matcher)(Code) | | Run a regex that strips elements of a string.
Assumes the regex has a form that wants to strip elements of the passed
string. Assumes that if a match, appending group 1
and group 2 yields desired result.
Parameters: url - Url to search in. Parameters: matcher - Matcher whose form yields a group 1 and group 2 if amatch (non-null. Original url else concatenization of group 1and group 2. |
|
|