Algorithm to find very common occurences of substrings in a set of short strings

2023-04-13 00:32 问答作者：

I have a list of about 1500 strings from an external database and over time, as a group of business users managed them, they came to have recurring substrings which have semantic value.

I'm building a front-end and would like to present the user with filtering drop down list of those substrings.

For example if I have the input strings:

US foo
US bar (Inactive)
UK bat
UK baz (Inactive)
AU womp
AU rat

I want to get back:

US
UK
AU
Inactive

My first thoughts are to have a threshold parameter and a list of delimeters. For the above I might say threshold=.3 and delimiters are space, (, and ).

Then do a string.split on using the delimiters and use a datastructure like a set that that counts repeated items (?)...

I am not trying to have some开发者_如何学编程one do my work for me here - advice on the approach to take from someone who has done this would be great.

This problem is a good candidate for a Linq approach:

var words = from s in listOfStrings
            from word in s.Split(new[] { ' ', '(', ')' }, StringSplitOptions.RemoveEmptyEntries)
            group word by word;
var dic = words.ToDictionary(g => g.Key, g => g.Count());

A simple way would be something like you stated. Have a Dictionary<String, int> set up to contain your data. Then, it's easy:

for each word in string
   if word is in dictionary
      increment dictionary value
   else
      add to dictionary with value of 1

Then, simply filter that dictionary based on a threshold, or return the entries sorted by count. You may also choose to have an "ignore list" with common words you don't want to track.

Also, if you want case-insensitivity, construct the dictionary like this: new Dictionary<string, int>(StringComparer.OrdinalIgnoreCase);

var input = new List<string>();
input.Add("Foo"); // I'd go for splitting by delimiters as well
input.Add("Bar");
input.Add("Foo");
var results = input.Distinct(); // -> Foo, Bar

I'm not quite sure what your threshold is.

继续阅读：algorithm string

Algorithm to find very common occurences of substrings in a set of short strings

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？