开发者

Trie vs. suffix tree vs. suffix array

Which structure provides the best performance results; trie (prefix tree), suffix tree or suffix array? Are there other similar structures? What are good Java implementations of these structures?

Edit开发者_JAVA技巧: in this case I want to make string matching between a large dictionary of names and a large set of natural language texts, in order to identify the names of the dictionary on texts.


The trie was the first data structure of this kind discovered.

The suffix tree is an improvement over the trie (it has suffix links which allow linear error search, the suffix tree trims unnecessary branches of the trie therefore it does not require as much space).

The suffix array is a stripped down data structure based on the suffix tree (no suffix links (slow error matches), yet pattern matching is very fast).

The trie is not for real world use because it consumes too much space.

The suffix tree is lighter and faster than the trie and is used to index DNA or optimize some large web search engines.

The suffix array is slower in some pattern searches than the suffix tree but uses less space, and is more widely used than the Suffix tree.

In the same family of data structures:

There are other implementations, the CST is an implementation of the suffix tree using a suffix array and some additional data structures to get some of the suffix tree search capabilities.

The FCST takes it further, it implements a sampled suffix tree with a suffix array.

The DFCST is a dynamic version of the FCST.

Expanding:

The two important factors are space use and operation execution time. You might think that with modern day machines this is not relevant but to index the DNA of a single human being would require 40 gigabytes of memory (using an uncompressed and unoptimized suffix tree). And to build one of this indexes over this much data can take days. Imagine Google, it has lots of searchable data, they need a large index over all web data and they do not change it every time someone builds a web page. They have some form of caching for that. However the main index is probably static. And every couple of weeks or so they gather all new web sites and data and build a new index, replacing the old when the new is finished. I do not know which algorithm they use to index, but it is probably a suffix array with suffix tree properties over a partitioned database.

The CST uses 8 gigabytes, however the suffix tree operations speed are heavily reduced.

The suffix array can do the same in some 700 megas to 2 Gigas. However you will not find genetic errors in the DNA with a suffix array (meaning: searching for a pattern with a wildcard is much much slower).

The FCST (fully compressed suffix tree) can create a suffix tree in 800 to 1.5 gigas. With a rather small speed deterioration towards the CST.

The DFCST uses 20% more space than the FCST, and loses speed to the static implementation of the FCST (however a dynamic index is very important).

There are not many viable (space wise) implementations of the suffix tree because it is very hard to make the operations speed boost compensate the data structures RAM space cost.

This said, the suffix tree has very interesting search results for pattern matching with errors. The aho corasick is not as fast (though nearly as fast for some operations, not error matching) and the boyer moore is left in the dust.


What operations do you plan on doing? libdivsufsort was at one time the best suffix array implementation in C.


Using Suffix Trees you can write something that will match your dictionary to your text in O(n+m+k) time where n is letters in your dictionary, m is letters in your text, and k is the number of matches. Tries are much slower for this. I'm not sure what a Suffix Array is, so I can't comment on that.

That said, it's non-trivial to code and I don't happen to know of any Java libraries that provide the necessary functions.


EDIT: In this case I want to make string matching between a large dictionary of names and a large set of natural language texts, in order to identify the names of the dictionary on texts.

This sounds like an application for the Aho-Corasick algorithm: construct an automaton from the dictionary (in linear time), which can then be used to find all the occurrences of any of the dictionary words in multiple texts (also in linear time).

(The description in these lecture notes, linked from the "External links" section of the Wikipedia page, is a lot easier to read than the description on the page itself.)


I prefer Suffix Auto Machine. You can find more details through my website: http://www.fogsail.net/2019/03/06/20190306/

enter image description here

first, If you used normal construction, it will takes O(n^2) to travel all the suffix

We use radix-sort to sort the suffix Array by first character.

But, if we sort the first character, we can use the information.

Details are showed by the images (neglect Chinese)

We sort array by the first-key, the result is presented by the red rectangle

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜