开发者

Full text search solutions for Java?

There's a 开发者_开发知识库large set of entities of different kinds:

interface Entity {
}

interface Entity1 extends Entity {
  String field1();
  String field2();
}

interface Entity2 extends Entity {
  String field1();
  String field2();
  String field3();
}

interface Entity3 extends Entity {
  String field12();
  String field23();
  String field34();
}

Set<Entity> entities = ...

The task is to implement full text search for this set. By full text search I mean I just need to get entities that contain a substring I'm looking for (I don't need to know exact property, exact offset of where this substrig is, etc). In current implementation the Entity interface has a method matches(String):

interface Entity {
  boolean matches(String text);
}

Each entity class implements it depending on its internals:

class Entity1Impl implements Entity1 {
  public String field1() {...}
  public String field2() {...}

  public boolean matches(String text) {
    return field1().toLowerCase().contains(text.toLowerCase()) ||
           field2().toLowerCase().contains(text.toLowerCase());
  }
}

I believe this approach is really awful (though, it works). I'm considering using Lucene to build indexes every time I have a new set. By index I mean content -> id mappings. The content is just a trivial "sum" of all the fields I'm considering. So, for Entity1 the content would be concatenation of field1() and field2(). I have some doubts about the performance: building the index is often quite an expensive operation, so I'm not really sure if it helps.

Do you have any other suggestions?

To clarify the details:

  1. Set<Entity> entities = ... is of ~10000 items.
  2. Set<Entity> entities = ... is not read from DB, so I can't just add where ... condition. The data source is quite non-trivial, so I can't solve the problem on its side.
  3. Entities should be thought of as of short articles, so some fields may be up to 10KB, while others may be ~10 bytes.
  4. I need to perform this search quite often, but both the query string and original set are different every time, so it looks like I can't just build index once (because the set of entities is different every time).


For such a complex Object domain, you can use lucene wrapper tool like Compass which allow quickly map you object graph to lucene index using the same approach as ORM(like hibernate)


I would strongly consider using Lucene with SOLR. http://lucene.apache.org/java/docs/index.html

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜