org.apache.solr.analysis |
|
Java Source File Name | Type | Comment |
BaseTokenFilterFactory.java | Class | Simple abstract implementation that handles init arg processing. |
BaseTokenizerFactory.java | Class | Simple abstract implementation that handles init arg processing. |
BaseTokenTestCase.java | Class | |
BufferedTokenStream.java | Class | |
EdgeNGramTokenizerFactory.java | Class | Creates new instances of
EdgeNGramTokenizer . |
EnglishPorterFilterFactory.java | Class | |
HTMLStripReader.java | Class | A Reader that wraps another reader and attempts to strip out HTML constructs. |
HTMLStripStandardTokenizerFactory.java | Class | |
HTMLStripWhitespaceTokenizerFactory.java | Class | |
HyphenatedWordsFilter.java | Class | When the plain text is extracted from documents, we will often have many words hyphenated and broken into
two lines. |
HyphenatedWordsFilterFactory.java | Class | |
ISOLatin1AccentFilterFactory.java | Class | |
KeywordTokenizerFactory.java | Class | |
LengthFilter.java | Class | |
LengthFilterFactory.java | Class | |
LetterTokenizerFactory.java | Class | |
LowerCaseFilterFactory.java | Class | |
LowerCaseTokenizerFactory.java | Class | |
NGramTokenizerFactory.java | Class | Creates new instances of
NGramTokenizer . |
PatternReplaceFilter.java | Class | A TokenFilter which applies a Pattern to each token in the stream,
replacing match occurances with the specified replacement string. |
PatternReplaceFilterFactory.java | Class | |
PatternTokenizerFactory.java | Class | This tokenizer uses regex pattern matching to construct distinct tokens
for the input stream. |
PhoneticFilter.java | Class | Create tokens for phonetic matches. |
PhoneticFilterFactory.java | Class | |
PorterStemFilterFactory.java | Class | |
RemoveDuplicatesTokenFilter.java | Class | A TokenFilter which filters out Tokens at the same position and Term
text as the previous token in the stream. |
RemoveDuplicatesTokenFilterFactory.java | Class | |
SnowballPorterFilterFactory.java | Class | Factory for SnowballFilters, with configurable language
Browsing the code, SnowballFilter uses reflection to adapt to Lucene... |
SolrAnalyzer.java | Class | |
StandardFilterFactory.java | Class | |
StandardTokenizerFactory.java | Class | |
StopFilterFactory.java | Class | |
SynonymFilter.java | Class | SynonymFilter handles multi-token synonyms with variable position increment offsets.
The matched tokens from the input stream may be optionally passed through (includeOrig=true)
or discarded. |
SynonymFilterFactory.java | Class | |
SynonymMap.java | Class | |
TestBufferedTokenStream.java | Class | Test that BufferedTokenStream behaves as advertised in subclasses. |
TestHyphenatedWordsFilter.java | Class | |
TestPatternReplaceFilter.java | Class | |
TestPatternTokenizerFactory.java | Class | |
TestPhoneticFilter.java | Class | |
TestRemoveDuplicatesTokenFilter.java | Class | |
TestSynonymFilter.java | Class | |
TestTrimFilter.java | Class | |
TestWordDelimiterFilter.java | Class | New WordDelimiterFilter tests... |
TokenFilterFactory.java | Interface | A TokenFilterFactory creates a
TokenFilter to transform one TokenStream
into another. |
TokenizerChain.java | Class | |
TokenizerFactory.java | Interface | A TokenizerFactory breaks up a stream of characters
into tokens. |
TrimFilter.java | Class | Trims leading and trailing whitespace from Tokens in the stream. |
TrimFilterFactory.java | Class | |
WhitespaceTokenizerFactory.java | Class | |
WordDelimiterFilter.java | Class | Splits words into subwords and performs optional transformations on subword groups.
Words are split into subwords with the following rules:
- split on intra-word delimiters (by default, all non alpha-numeric characters).
- "Wi-Fi" -> "Wi", "Fi"
- split on case transitions
- "PowerShot" -> "Power", "Shot"
- split on letter-number transitions
- "SD500" -> "SD", "500"
- leading and trailing intra-word delimiters on each subword are ignored
- "//hello---there, 'dude'" -> "hello", "there", "dude"
- trailing "'s" are removed for each subword
- "O'Neil's" -> "O", "Neil"
- Note: this step isn't performed in a separate filter because of possible subword combinations.
The combinations parameter affects how subwords are combined:
- combinations="0" causes no subword combinations.
- "PowerShot" -> 0:"Power", 1:"Shot" (0 and 1 are the token positions)
- combinations="1" means that in addition to the subwords, maximum runs of non-numeric subwords are catenated and produced at the same position of the last subword in the run.
- "PowerShot" -> 0:"Power", 1:"Shot" 1:"PowerShot"
- "A's+B's&C's" -> 0:"A", 1:"B", 2:"C", 2:"ABC"
- "Super-Duper-XL500-42-AutoCoder!" -> 0:"Super", 1:"Duper", 2:"XL", 2:"SuperDuperXL", 3:"500" 4:"42", 5:"Auto", 6:"Coder", 6:"AutoCoder"
One use for WordDelimiterFilter is to help match words with different subword delimiters.
For example, if the source text contained "wi-fi" one may want "wifi" "WiFi" "wi-fi" "wi+fi"
queries to all match.
One way of doing so is to specify combinations="1" in the analyzer
used for indexing, and combinations="0" (the default) in the analyzer
used for querying. |
WordDelimiterFilterFactory.java | Class | |