Lucene complex structure search
Basically I do have pretty simple database that I'd like to index with Lucene. Domains are:
// Person domain
class Person {
Set<Pair> keys;
}
// Pair domain
class Pair {
KeyItem keyItem;
String value;
}
// KeyItem domain, name is unique field within the DB (!!)
class KeyItem{
String name;
}
I've tens of millions of profiles and hundreds of millions of Pairs, however, since most of KeyItem's "name" fields duplicates, there are only few dozens KeyItem instances. Came up to that structure to save on KeyItem instances.
Basically any Profile with any fields could be saved into that structure. Lets say we've profile with properties
- name: Andrew Morton
- eduction: University of New South Wales,
- country: Australia,
- occupation: Linux programmer.
To store it,开发者_如何学Python we'll have single Profile instance, 4 KeyItem instances: name, education,country and occupation, and 4 Pair instances with values: "Andrew Morton", "University of New South Wales", "Australia" and "Linux Programmer".
All other profile will reference (all or some) same instances of KeyItem: name, education, country and occupation.
My question is, how to index all of that so I can search for Profile for some particular values of KeyItem::name and Pair::value. Ideally I'd like that kind of query to work:
name:Andrew* AND occupation:Linux*
Should I create custom Indexer and Searcher? Or I could use standard ones and just map KeyItem and Pair as Lucene components somehow?
I believe you can use standard Lucene methodology. I would:
- Translate every profile to a Lucene Document.
- Translate every Pair to a Field in this Document. All Fields need to be indexed, but not necessarily stored.
- Add a stored Field with a profile id to the Document.
- Search using name:value pairs similarly to your example.
If you choose bare Lucene, you will need a custom Indexer and Searcher, but they are not hard to build. It may be easier for you to use Solr, where you need less programming. However, I do not know if Solr allows an open-ended schema like the one I described - I believe you have to predefine all field names, so this may prevent you from using Solr.
Lucene returns the list of hit documents essentially based on the occurence of the keyword/s regardless of the type of query. The fundamental segment reader checks for the presence of keywords in the entire index database rather than in specific field of interest.
Suggest to introduce a custom searcher that performs the following.
1.Read the short-listed documents using the document id. ( I guess the collect() method may be overridden to pass the document id from search() of IndexSearcher class ).
2.Get the field value and check the presence of expected keywords.
3.Subject the document for scoring only if the document meets your custom criteria.
Note : The default standard searcher can be modified rather than writing a custom seacher from scratch.
精彩评论