Denormalization of database tables for Lucene indexing
I am just starting up with Lucene, and I'm trying to index a database so I can perform searches on the content. There are 3 tables that I am interested in indexing:
1. Image table - this is a table where each entry represents an image. Each image has an unique ID and some other info (title, description, etc).
2. People table - this is a table where each entry represent a person. Each person has a unique ID and other info like (name, address, company, etc)
3. Credited table - this table has 3 fields (image, person, and credit type). It's purpose is to associate some people to a image as the credits for that image. Each image can have multiple credited people (there's the director, photographer, props artist, etc). Also, a person is credited in multiple images.
I'm trying to index these tables so I can perform some searching usi开发者_开发百科ng Lucene but as I've read, I need to flatten the structure.
The first solution the came to me would be to create Lucene documents for each combination of Image/Credited Person. I'm afraid this will create a lot of duplicate content in the index (all the details of an image/person would have to be duplicated in each Document for each person that worked on the image).
Is there anybody experienced with Lucene that can help me with this? I know there is no generic solution to denormalization, that is why I provided a more specific example.
Thank you, and I will gladly provide more info on the database is anybody needs
PS: Unfortunately, there is no way for me to change the structure of the database (it belongs to the client). I have to work with what I have.
You could create a Document
for each person with all the associated images' descriptions concatenated (either appended to the person info or in a separate Field
).
Or, you could create a minimal Document
for each person, create a Document
for each image, puts the creators' names and credit info in a separate field of the image Document
and link them by putting the person ID (or person Document
id) a third, non-indexed field. (Lucene is geared toward flat document indexing, not relational data, but relations can be defined manually.)
This is really a matter of what you want to search for, images or persons, and whether each contains enough keywords for search to function. Try several options, see if they work well enough and don't exceed the available space.
The credit table will probably not be a good candidate for Document
construction, though.
精彩评论