Algorithm to find very common occurences of substrings in a set of short strings
I have a list of about 1500 strings from an external database and over time, as a group of business users managed them, they came to have recurring substrings which have semantic value.
I'm building a front-end and would like to present the user with filtering drop down list of those substrings.
For example if I have the input strings:
- US foo
- US bar (Inactive)
- UK bat
- UK baz (Inactive)
- AU womp
- AU rat
I want to get back:
- US
- UK
- AU
- Inactive
My first thoughts are to have a threshold parameter and a list of delimeters. For the above I might say threshold=.3 and delimiters are space, (, and ).
Then do a string.split on using the delimiters and use a datastructure like a set that that counts repeated items (?)...
I am not trying to have some开发者_如何学编程one do my work for me here - advice on the approach to take from someone who has done this would be great.
This problem is a good candidate for a Linq approach:
var words = from s in listOfStrings
from word in s.Split(new[] { ' ', '(', ')' }, StringSplitOptions.RemoveEmptyEntries)
group word by word;
var dic = words.ToDictionary(g => g.Key, g => g.Count());
A simple way would be something like you stated. Have a Dictionary<String, int>
set up to contain your data. Then, it's easy:
for each word in string
if word is in dictionary
increment dictionary value
else
add to dictionary with value of 1
Then, simply filter that dictionary based on a threshold, or return the entries sorted by count. You may also choose to have an "ignore list" with common words you don't want to track.
Also, if you want case-insensitivity, construct the dictionary like this: new Dictionary<string, int>(StringComparer.OrdinalIgnoreCase);
var input = new List<string>();
input.Add("Foo"); // I'd go for splitting by delimiters as well
input.Add("Bar");
input.Add("Foo");
var results = input.Distinct(); // -> Foo, Bar
I'm not quite sure what your threshold is.
精彩评论