How can I build an incremental directed acyclic word graph to store and search strings?

2022-12-20 17:49 问答作者：

I am trying to store a large list of strings in a concise manner so that they can be very quickly analyzed/searched through.

A directed acyclic word graph (DAWG) suits this purpose wonderfully. However, I do not have a list of the strings to include in the first place, so it must be incrementally buildable. Additionally, when I search through it for a string, I need to bring back data associated with the result (not just a boolean saying if it was present).

I have found information on a modification of the DAWG for string data tracking here: http://www.pathcom.com/~vadco/adtdawg.html It looks extremely, extremely complex and I am not sure I am capable of writing it.

I have also found a few research papers describing incremental building algorithms, though I've found that research papers in general are not ver开发者_运维知识库y helpful.

I don't think I am advanced enough to be able to combine both of these algorithms myself. Is there documentation of an algorithm already that features these, or an alternative algorithm with good memory use & speed?

I wrote the ADTDAWG web page. Adding words after construction is not an option. The structure is nothing more than 4 arrays of unsigned integer types. It was designed to be immutable for total CPU cache inclusion, and minimal multi-thread access complexity.

The structure is an automaton that forms a minimal and perfect hash function. It was built for speed while traversing recursively using an explicit stack.

As published, it supports up to 18 characters. Including all 26 English chars will require further augmentation.

My advice is to use a standard Trie, with an array index stored in each node. Ya, it is going to seem infantile, but each END_OF_WORD node represents only one word. The ADTDAWG is a solution to each END_OF_WORD node in a traditional DAWG representing many, many words.

Minimal and perfect hash tables are not the sort of thing that you can just put together on the fly.

I am looking for something else to work on, or a job, so contact me, and I'll do what I can. For now, all I can say is that it is unrealistic to use heavy optimization on a structure that is subject to being changed frequently.

Java

For graph problems which require persistence, I'd take a look at the Neo4j graph DB project. Neo4j is designed to store large graphs and allow incremental building and modification of the data, which seems to meet the criteria you describe.

They have some good examples to get you going quickly and there's usually example code to get you started with most problems.

They have a DAG example with a link at the bottom to the full source code.

C++

If you're using C++, a common solution to graph building/analysis is to use the Boost graph library. To persist your graph you could maintain a file based version of the graph in GraphML (for example) and read and write to that file as your graph changes.

You may also want to look at a trie structure for this (potentially building a radix-tree). It seems like a decent 'simple' alternative structure.

I'm suggesting this for a few reasons:

I really don't have a full understanding of your result.
Definitely incremental to build.
Leaf nodes can contain any data you wish.
Subjectively, a simple algorithm.

继续阅读：algorithm graph string

How can I build an incremental directed acyclic word graph to store and search strings?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？