开发者

Searchable Static String Storage

We have a a large set of objects that include composition and name properties, both string values that contain values with a lot of duplication, what would be a suitable data structure to s开发者_StackOverflowtore the strings which can be searchable and small?

The data includes many chemical and product names that are duplicates or differ only slightly. I'd like to be able to store the string content of the objects in a compressed format that can also be searched.

I've experimented with Tries to make a fast searchable index over the names but this is currently in addition to the storage of each objects string data.

This data is static and distributed as a separate binary file with the application.


I've previously had some success with a mix of LZW compression with a large table, and then interning to 32 bit identifiers. For a similar enough corpus, the LZW can compress into fewer than 32 bits, so there's a flag on the id so it is treated as a compressed bit pattern rather than a key in a hashtable. As LZW is prefix based, you can search it in a somewhat similar fashion to a trie, but it's a bit trickier; it's easier to just do a test based on whether a bit pattern contains any of the query characters when expanded, and if so expand the string and use conventional string comparison.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜