public final class PatternTokenizer
extends org.apache.lucene.analysis.Tokenizer
group=-1 (the default) is equivalent to "split". In this case, the tokens will
be equivalent to the output from (without empty tokens):
String.split(java.lang.String)
Using group >= 0 selects the matching group as the token. For example, if you have:
pattern = \'([^\']+)\' group = 0 input = aaa 'bbb' 'ccc'the output will be two tokens: 'bbb' and 'ccc' (including the ' marks). With the same input but using group=1, the output would be: bbb and ccc (no ' marks)
NOTE: This Tokenizer does not output tokens that are of zero length.
Pattern| Modifier and Type | Field and Description |
|---|---|
(package private) char[] |
buffer |
private int |
group |
private int |
index |
private java.util.regex.Matcher |
matcher |
private org.apache.lucene.analysis.tokenattributes.OffsetAttribute |
offsetAtt |
private java.util.regex.Pattern |
pattern |
private java.lang.StringBuilder |
str |
private org.apache.lucene.analysis.tokenattributes.CharTermAttribute |
termAtt |
| Constructor and Description |
|---|
PatternTokenizer(java.io.Reader input,
java.util.regex.Pattern pattern,
int group)
creates a new PatternTokenizer returning tokens from group (-1 for split functionality)
|
| Modifier and Type | Method and Description |
|---|---|
void |
end() |
private void |
fillBuffer(java.lang.StringBuilder sb,
java.io.Reader input) |
boolean |
incrementToken() |
void |
reset(java.io.Reader input) |
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState, toStringprivate final org.apache.lucene.analysis.tokenattributes.CharTermAttribute termAtt
private final org.apache.lucene.analysis.tokenattributes.OffsetAttribute offsetAtt
private final java.lang.StringBuilder str
private int index
private final java.util.regex.Pattern pattern
private final int group
private final java.util.regex.Matcher matcher
final char[] buffer
public PatternTokenizer(java.io.Reader input,
java.util.regex.Pattern pattern,
int group)
throws java.io.IOException
java.io.IOExceptionpublic boolean incrementToken()
throws java.io.IOException
incrementToken in class org.apache.lucene.analysis.TokenStreamjava.io.IOExceptionpublic void end()
throws java.io.IOException
end in class org.apache.lucene.analysis.TokenStreamjava.io.IOExceptionpublic void reset(java.io.Reader input)
throws java.io.IOException
reset in class org.apache.lucene.analysis.Tokenizerjava.io.IOExceptionprivate void fillBuffer(java.lang.StringBuilder sb,
java.io.Reader input)
throws java.io.IOException
java.io.IOException