How to handle numbers as both words and numerals ("one" vs. "1") in Zend_Lucene
I have news-article content which is being indexes using Lucene and interrogated using Zend_Lucene in PHP.
The content frequently makes reference to UK television channels (e.g. BBC One) but I know that our users will often enter a search term of "BBC 1" or "开发者_运维知识库BBC1" rather than "BBC One".
Is there any "standard" approach to dealing with this numbers-as-words vs. numbers-as-numerals search issue?
My choices seem to be to either amend the search term whenever I see numbers so, for example, I change a search terms of "BBC1" to "BBC 1 One" (or something similar) - or I amend the indexed content so that numerals are converted to words and vice-versa and both versions stored in the index.
Please see this lucene FAQ entry, it suggests to use a token filter to provide alias / aliasing of words:
26. How can I make 'pig' also match 'hog' ?:
As far as I know, Lucene does not provide a tokenzier that support term aliasing but you should be able to write one yourself. All you need is to write a TokenFilter that accepts a word pair mapping and uses it map the first word to the second.
Again, make sure to use the same analyzer both during the indexing and searching and don't forget to submit your code to the Lucene project so other can use it as well ;-)
That's older information probably this is even more comfortable nowadays, but probably worth the direction.
精彩评论