How can I see if one string loosely contains another (case, extra whitespace, and punctuation ignored)?
I am writing a program in C# that compares strings similarly to the way that Google searches documents for keywords.
I am wanting a search for "stack overflow" to return true for "stack overflow" (plain), "This is the stack overflow." (in the middle), "Welcome to Stack Overflow." (case insensitive), "I like stack overflow." (variable whitespace), and "Who puts a dash in stack-overflow?", but not "stackoverflow" (no whitespace).
I was thinking that I could use a regular expression like "stack([ -]|. )+overflow", it seems overkill to have to replace every space in each keyword with a character set for each new keyword. Because "stack overflow" is not the only string I am searchi开发者_如何学Pythonng, I have to do it pragmatically.
To meet your specification, you could first do
newSearchString = Regex.Replace(Regex.Escape(searchString), @"\s+", @"[\s\p{P}]+");
(to transform your plain text search string into a regular expression that also allows punctuation in places where there used to be only whitespace), and then apply that regex to whatever text you're searching.
But of course this will fail to match on the slightest typo whereas an algorithm using Levensthein distance will also match "Stak Overfloor".
If you simply want to achieve the effect in the specific case you mentioned, you can use regular expressions to replace the tokens you want to ignore by a single space (or empty string).
If you want a more elaborate solution, you could use dynamic programming to get the smallest permutation required to transform the first string into the second. This will allow matching with (few) missing letters or typos, too.
If you are comparing against short strings, then the easiest way that I can see would be to strip out all of the white space and other characters from both strings and do a simple string.Contains
.
精彩评论