Indexing file paths or URIs in Lucene
Some of the documents I store in Lucene have fields that contain file paths or URIs. I'd like users to be able to retrieve these documents if their query terms contain a path or URI segment.
For example, if the path is
C:\home\user\research\whitepapers\analysis\detail.txt
I'd like the user to be able to find it by queriying for path:whitepapers
.
Likewise, if the URI is
http://www.stackoverflow.com/questions/ask
A query containing uri:questions
would retrieve it.
Do I need to 开发者_开发百科use a special analyzer for these fields, or will StandardAnaylzer do the job? Will I need to do any pre-processing of these fields? (To replace the forward slashes or backslashes with spaces, for example?)
Suggestions welcome!
You can use StandardAnalyzer. I tested this, by adding the following function to Lucene's TestStandardAnalyzer.java:
public void testBackslashes() throws Exception {
assertAnalyzesTo(a, "C:\\home\\user\\research\\whitepapers\\analysis\\detail.txt", new String[]{"c","home", "user", "research","whitepapers", "analysis", "detail.txt"});
assertAnalyzesTo(a, "http://www.stackoverflow.com/questions/ask", new String[]{"http", "www.stackoverflow.com","questions","ask"});
}
This unit test passed using Lucene 2.9.1. You may want to try it with your specific Lucene distribution. I guess it does what you want, while keeping domain names and file names unbroken. Did I mention that I like unit tests?
精彩评论