public class QueryAutoStopWordAnalyzer
extends org.apache.lucene.analysis.Analyzer
Analyzer used primarily at query time to wrap another analyzer and provide a layer of protection
which prevents very common words from being passed into queries.
For very large indexes the cost of reading TermDocs for a very common word can be high. This analyzer was created after experience with a 38 million doc index which had a term in around 50% of docs and was causing TermQueries for this term to take 2 seconds.
Use the various "addStopWords" methods in this class to automate the identification and addition of stop words found in an already existing index.
| Modifier and Type | Field and Description |
|---|---|
static float |
defaultMaxDocFreqPercent |
| Constructor and Description |
|---|
QueryAutoStopWordAnalyzer(org.apache.lucene.analysis.Analyzer delegate)
Deprecated.
Use
QueryAutoStopWordAnalyzer(Version, Analyzer) instead |
QueryAutoStopWordAnalyzer(org.apache.lucene.util.Version matchVersion,
org.apache.lucene.analysis.Analyzer delegate)
Initializes this analyzer with the Analyzer object that actually produces the tokens
|
| Modifier and Type | Method and Description |
|---|---|
int |
addStopWords(org.apache.lucene.index.IndexReader reader)
Automatically adds stop words for all fields with terms exceeding the defaultMaxDocFreqPercent
|
int |
addStopWords(org.apache.lucene.index.IndexReader reader,
float maxPercentDocs)
Automatically adds stop words for all fields with terms exceeding the maxDocFreqPercent
|
int |
addStopWords(org.apache.lucene.index.IndexReader reader,
int maxDocFreq)
Automatically adds stop words for all fields with terms exceeding the maxDocFreqPercent
|
int |
addStopWords(org.apache.lucene.index.IndexReader reader,
java.lang.String fieldName,
float maxPercentDocs)
Automatically adds stop words for the given field with terms exceeding the maxPercentDocs
|
int |
addStopWords(org.apache.lucene.index.IndexReader reader,
java.lang.String fieldName,
int maxDocFreq)
Automatically adds stop words for the given field with terms exceeding the maxPercentDocs
|
org.apache.lucene.index.Term[] |
getStopWords()
Provides information on which stop words have been identified for all fields
|
java.lang.String[] |
getStopWords(java.lang.String fieldName)
Provides information on which stop words have been identified for a field
|
org.apache.lucene.analysis.TokenStream |
reusableTokenStream(java.lang.String fieldName,
java.io.Reader reader) |
org.apache.lucene.analysis.TokenStream |
tokenStream(java.lang.String fieldName,
java.io.Reader reader) |
public static final float defaultMaxDocFreqPercent
public QueryAutoStopWordAnalyzer(org.apache.lucene.analysis.Analyzer delegate)
QueryAutoStopWordAnalyzer(Version, Analyzer) insteaddelegate - The choice of Analyzer that is used to produce the token stream which needs filteringpublic QueryAutoStopWordAnalyzer(org.apache.lucene.util.Version matchVersion,
org.apache.lucene.analysis.Analyzer delegate)
delegate - The choice of Analyzer that is used to produce the token stream which needs filteringpublic int addStopWords(org.apache.lucene.index.IndexReader reader)
throws java.io.IOException
reader - The IndexReader which will be consulted to identify potential stop words that
exceed the required document frequencyjava.io.IOExceptionpublic int addStopWords(org.apache.lucene.index.IndexReader reader,
int maxDocFreq)
throws java.io.IOException
reader - The IndexReader which will be consulted to identify potential stop words that
exceed the required document frequencymaxDocFreq - The maximum number of index documents which can contain a term, after which
the term is considered to be a stop wordjava.io.IOExceptionpublic int addStopWords(org.apache.lucene.index.IndexReader reader,
float maxPercentDocs)
throws java.io.IOException
reader - The IndexReader which will be consulted to identify potential stop words that
exceed the required document frequencymaxPercentDocs - The maximum percentage (between 0.0 and 1.0) of index documents which
contain a term, after which the word is considered to be a stop word.java.io.IOExceptionpublic int addStopWords(org.apache.lucene.index.IndexReader reader,
java.lang.String fieldName,
float maxPercentDocs)
throws java.io.IOException
reader - The IndexReader which will be consulted to identify potential stop words that
exceed the required document frequencyfieldName - The field for which stopwords will be addedmaxPercentDocs - The maximum percentage (between 0.0 and 1.0) of index documents which
contain a term, after which the word is considered to be a stop word.java.io.IOExceptionpublic int addStopWords(org.apache.lucene.index.IndexReader reader,
java.lang.String fieldName,
int maxDocFreq)
throws java.io.IOException
reader - The IndexReader which will be consulted to identify potential stop words that
exceed the required document frequencyfieldName - The field for which stopwords will be addedmaxDocFreq - The maximum number of index documents which
can contain a term, after which the term is considered to be a stop word.java.io.IOExceptionpublic org.apache.lucene.analysis.TokenStream tokenStream(java.lang.String fieldName,
java.io.Reader reader)
tokenStream in class org.apache.lucene.analysis.Analyzerpublic org.apache.lucene.analysis.TokenStream reusableTokenStream(java.lang.String fieldName,
java.io.Reader reader)
throws java.io.IOException
reusableTokenStream in class org.apache.lucene.analysis.Analyzerjava.io.IOExceptionpublic java.lang.String[] getStopWords(java.lang.String fieldName)
fieldName - The field for which stop words identified in "addStopWords"
method calls will be returnedpublic org.apache.lucene.index.Term[] getStopWords()
Copyright © 2000-2016 Apache Software Foundation. All Rights Reserved.