Matching substrings from a dictionary to other string: suggestions?

2022-12-16 02:34 问答作者：

Hellow Stack Overflow people. I'd like some suggestions regarding the following problem. I am using Java.

I have an array #1 with a number of Strings. For example, two of the strings might be: "An apple fell on Newton's head" and "Apples grow on trees".

On the other side, I have another array #2 with terms like (Fruits => Apple, Orange, Peach; Items => Pen, Book; ...). I'd call this array my "dictionary".

By comparing items from one array to the other, I need to see in which "category" the items from #1 fall into from #2. E.g. Both from #1 would fall under "Fruits".

My most important consideration is speed. I need to do those operations fast. A structure allowing constant time retrieval would be good.

I considered a Hashset with the contains() method, but it doesn't allow substrings. I also tried running regex like (apple|orange|peach|...etc) with case insensitive flag on, but I read that it will not be fast when the terms increase in number (minimum 200 to be expected). Finally, I searched, and am considering using an ArrayList with indexOf() but I don't开发者_StackOverflow中文版 know about its performance. I also need to know which of the terms actually matched, so in this case, it would be "Apple".

Please provide your views, ideas and suggestions on this problem.

I saw Aho-Corasick algorithm, but the keywords/terms are very likely to change often. So I don't think I can use that. Oh, I'm no expert in text mining and maths, so please elaborate on complex concepts.

Thank you, Stack Overflow people, for your time! :)

If you use a multimap from Google Collections, they have a function to invert the map (so you can start with a map like {"Fruits" => [Apple]}, and produce a map with {"Apple" => ["Fruits"]}. So you can lookup the word and find a list of categories for it, in one call to the map.

I would expect I'd want to split the strings myself and lookup the words in the map one at a time, so that I could do stemming (adjusting for different word endings) and stopword-filtering. Using the map should get good lookup times, plus it's easy to try out.

Would a suffix tree or similar data structure work for your application? It offers O(m) string lookup, where m is the length of the search string, after an O(n²)--or better with some trickery--initial setup, and, with some extra effort, you can associate arbitrary data, such as a reference to a category, with complete words in your dictionary. If you don't want to code it yourself, I believe the BioJava library includes an implementation.

You can also add strings to a suffix tree after initial setup, although the cost will still be around O(n²). That's probably not a big deal if you're adding short words.

If you have only 200 terms to look for, regexps might actually work for you. Of course the regular expression is large, but if you compile it once and just use this compiled Pattern the lookup time is probably linear in the combined length of all the strings in array#1 and I don't see how you can hope for being better than that.

So the algorithm would be: concatenate the words of array#2 which you want to look for into the regular expression, compile it, and then find the matches in array#1 .

(Regular expressions are compiled into a state machine - that is on each character of the string it just does a table lookup for the next state. If the regular expression is complicated you might have backtracking that increases the time, but your regular expression has a very simple structure.)

Matching substrings from a dictionary to other string: suggestions?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？