Basic Profanity Filter in Objective C for iPhone
How have you like minded individuals tackled the basic challenge of filtering profanity, obviously one can't possibly tackle every scenario but it would be nice to have one at the most basic level开发者_如何学编程 as a first line of defense.
In Obj-c I've got
NSString *tokens = [text componentsSeparatedByString:@" "];
And then I loop through each token to see if any of the keywords (I've got about 400 in a list) are found within each token.
Realising False positives are also a problem, if the word is a perfect match, its flagged as profanity otherwise if more than 3 words with profanity are found without being perfect matches it is also flagged as profanity.
Later on I will use a webservice that tackles the problem more precisely, but I really just need something basic. So if you wrote the word penis it would go yup naughty naughty, bad word written.
Obscenity Filters: Bad Idea, or Incredibly Intercoursing Bad Idea?
Jeff has an interesting article to consider before embarking on such a piece of code:
http://www.codinghorror.com/blog/2008/10/obscenity-filters-bad-idea-or-incredibly-intercoursing-bad-idea.html
I just have a suggestion for tokenizing the string. Your ways works well if the words are all separated by strings but that is rarely the case in most usage scenarios as you would normally have to deal with newlines, punctuation, etc. Try this if you are interested:
NSMutableCharacterSet *separators = [NSMutableCharacterSet punctuationCharacterSet];
[separators formUnionWithCharacterSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];
NSArray *words = [bigString componentsSeparatedByCharactersInSet:separators];
Source: http://www.tech-recipes.com/rx/3418/cocoa-explode-break-nsstring-into-individual-words/
Well, searching in that manner is certainly not the most efficient way to search for profanity... a more efficient approach would be to construct a finite state automaton to detect the words, and run the text once through that FSA. You don't really need to split strings to find profanity, and all that splitting adds extra allocation and copying overhead that you don't need. Also, there may be common patterns in some of the blacklisted words, which you are not exploiting by searching each word individually.
That said, I think 400 words is quite a lot. Who, exactly, is your audience? What if a user has a medical question? Should such questions actually be disallowed? I can only think of a handful of words that would be considered profane in any context, so you might want to rethink the filtering.
A couple of things:
- FSA won't necessarily work depending on how intelligent you want the filter to be
- Regex are generally extremely slow depending on how many you want to run
- 400 words is somewhat low, depending on your needs and langauges
- There are a number of extremely tricky cases to be careful of when filtering, particularly embedding of words such as "ASSume"
My company, Inversoft, builds a commercial filtering solution and it is quite intelligent. It doesn't use regex or FSA, but has a custom built fast-linear processing technology that makes it extremely fast and accurate (4,000+ messages per second). It also has over 600 English words in a number of categories including Slang, Racial Slurs, Drug, Gang, Religious, etc.
If you are looking for an intelligent context-aware solution with support, you should check out Clean Speak from Inversoft. Hooking it up to Obj-C should be simple using the XML WebService.
精彩评论