Fieldable.tokenStreamValue() returns null for tokenized field
I use lucene for N-Gram matching. I set a field to be analyzed using an N-Gram analyzer. I want to see how the tokens resulting from the analysis look like to make sure the n-grams are being correctly computed.
If I call the method Fieldable.tokenStreamValue()
on the analyzed field of a document, I get null, while calling Fieldable.isTokenized()
returns true.
I must add that the results of querying are consistent with n-grams being correctly generated.
Any explanations for this? I am essentially trying to do what is mentioned here: How can I read a Lucene document field tokens after they are analyzed?
Here is the full code:
public class TestLuceneNgram {
public static class NGramQuery extends BooleanQuery {
public NGramQuery(final String queryTerm) throws IOException {
StringReader strReader = new StringReader(queryTerm);
TokenStream tokens = new NGramTokenizer(strReader);
CharTermAttribute termAtt = (CharTermAttribute) tokens
.addAttribute(CharTermAttribute.class);
while (tokens.incrementToken()) {
System.out.println(termAtt);
Term t = new Term("NGRAM_FIELD", termAtt.toString());
add(new TermQuery(t), BooleanClause.Occur.SHOULD);
}
}
}
public static class NGramSearcher extends IndexSearcher {
public NGramSearcher(final Directory directory)
throws CorruptIndexException, IOException {
super(IndexReader.open(directory));
}
public TopDocs search(final String term) {
try {
return search(new NGramQuery(term), 10);
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
}
public static class SubWordAnalyzer extends Analyzer {
@Override
public TokenStream tokenStream(final String fieldName,
final Reader reader) {
return new NGramTokenizer(reader);
}
}
public static Directory index(final String[] terms) {
Directory indexDirectory = new RAMDirectory();
IndexWriter indexWriter = null;
try {
indexWriter = new IndexWriter(indexDirectory,
new IndexWriterConfig(Version.LUCENE_32,
new SubWordAnalyzer()));
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (LockObtainFailedException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
for (int i = 0; i < terms.length; ++i) {
Document doc = new Document();
doc.add(new Fie开发者_运维问答ld("NGRAM_FIELD", terms[i], Field.Store.YES,
Field.Index.ANALYZED,
Field.TermVector.WITH_POSITIONS_OFFSETS));
doc.add(new Field("ORIGINAL_FIELD", terms[i], Field.Store.YES,
Field.Index.NOT_ANALYZED));
try {
indexWriter.addDocument(doc);
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
try {
indexWriter.optimize();
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
try {
indexWriter.close();
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return indexDirectory;
}
/**
* @param args
*/
public static void main(final String[] args) {
String[] terms = new String[] { "the first string", "the second one" };
Directory dir = index(terms);
NGramSearcher ngs = null;
try {
ngs = new NGramSearcher(dir);
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
TopDocs td = ngs.search("second");
System.out.println(td.totalHits);
for (ScoreDoc sd : td.scoreDocs) {
System.out.println(sd.doc + "---" + sd.score);
try {
System.out.println(ngs.doc(sd.doc).getFieldable("NGRAM_FIELD").
tokenStreamValue());
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
The first thing to check is whether you are actually storing this field at index time. If you're just indexing it, this is the expected result.
精彩评论