开发者

How to cut specified words from string

There is a list of banned words ( or strings to be more general) and another list with let's say users mails. I would like to excise all banned words from all mails.

Trivial example:

foreach(string word in wordsList)
{
   foreach(string mail in mailList)
   {
      mail.Replace(word,String.Empty);
   }
}

How I can improve this algorithm?


Thanks for advices. I voted few answers up but I didn't mark any 开发者_运维问答as answer since it was more like discussion than solution. Some people missed banned words with bad words. In my case I don't have to bother about recognize 'sh1t' or something like that.


Simple approaches to profanity filtering won't work - complex approaches don't work, for the most part, either.

What happens when you get a work like 'password' and you want to filter out 'ass'? What happens when some clever person writes 'a$$' instead - the intent is still clear, right?

See How do you implement a good profanity filter? for extensive discussion.


You could use RegEx to make things a little cleaner:

var bannedWords = @"\b(this|is|the|list|of|banned|words)\b";

foreach(mail in mailList)
    var clean = Regex.Replace(mail, bannedWords, "", RegexOptions.IgnoreCase);

Even that, though, is far from perfect since people will always figure out a way around any type of filter.


You'll get best performance by drawing up a finite state machine (FSM) (or generate one) and then parsing your input 1 character at a time and walking through the states.

You can do this pretty easily with a function that takes your next input char and your current state and that returns the next state, you also create output as you walk through the mail message's characters. You draw the FSM on a paper.

Alternatively you could look into the Windows Workflow Foundation: State Machine Workflows.

In that way you only need to walk each message a single time.


Constructing a regular expression from the words (word1|word2|word3|...) and using this instead of the outer loop might be faster, since then, every e-mail only needs to be parsed once. In addition, using regular expressions would enable you to remove only "complete words" by using the word boundary markers (\b(word1|word2|word3|...)\b).

In general, I don't think you will find a solution which is orders of magnitude faster than your current one: You will have to loop through all mails and you will have to search for all the words, there's no easy way around that.


A general algorithm would be to:

  1. Generate a list of tokens based on the input string (ie. by treating whitespace as token separators)
  2. Compare each token against a list of banned words
  3. Replace matched tokens

A regular expression is convenient for identifying tokens, and a HashSet would provide quick lookups for your list of banned words. There is an overloaded Replace method on the Regex class that takes a function, where you could control the replace behavior based on your lookup.

HashSet<string> BannedWords = new HashSet<string>(StringComparer.InvariantCultureIgnoreCase)
{
    "bad",
};

string Input = "this is some bad text.";

string Output = Regex.Replace(Input, @"\b\w+\b", (Match m) => BannedWords.Contains(m.Value) ? new string('x', m.Value.Length) : m.Value);


Replacing it with * is annoying, but less annoying than something that removes the context of your intention by removing the word and leaving a malformed sentence. In discussing the Battle of Hastings, I'd be irritated if I saw William given the title "Grand ******* of Normandy", but at least I'd know I was playing in the small-kids playground, while his having the title of "Grand of Normandy" just looks like a mistake, or (worse) I might think that was actually his title.

Don't try replacing words with more innocuous words unless its funny. People get the joke on 4chan, but yahoo groups about history had confused people because the medireview and mediareview periods were being discussed when eval (not profanity, but is used in some XSS attacks that yahoo had been hit by) was replaced with review in medieval and mediaeval (apparantly, medireview is the American spelling of mediareview!).


In some circumstance is possible to improve it: Just for fun:

u can use SortedList, if ur mailing list is mailing list (because u have a delimiter like ";") u can do as bellow:

first calculate ur running time algorithm: Words: n item. (each item has an O(1) length). mailing list: K item. each item in mailing list average length of Z. each sub item in mailing list item average length of Y so the average number of subitems in mailing list items is m = Z/Y.

ur algorithm takes O(n*K*Z). // the best way with knut algorithm

1.now if u sort the words list in O(n log n).

2.1- use mailingListItem.Split(";".ToCharArray()) for each mailing list item: O(Z). 2.2- sort the items in mailing list: O(m * log m) total sorting takes O(K * Z) in worth case with respect to (m logm << Z).

3- use merge algorithm to merge items of bad word and specific mailing list: O((m + n) * k)

total time is O((m+n)*K + m*Z + n^2) with respect to m << n, total algorithm running time is O(n^2 + Z*K) in worth case, which is smaller than O(n*K*Z) if n < K * Z ( i think so).

So if performance is very very very important, u can do this.


You might consider using Regex instead of simple string matches, to avoid replacing partial content within words. A Regex would allow you to assure you are only getting full words that match. You could use a pattern like this:

"\bBADWORD\b"

Also, you may want to iterate over the mailList on the outside, and the word list on the inner loop.


Wouldn't it be easier (and more efficient) to simply redact them by changing all their characters to * or something? That way no large string needs to be resized or moved around, and the recipents are made more aware what happened, rather than getting nonsensical sentences with missing words.


Well, you certainly don' want to make the clbuttic mistake of naive string.Replace() to do it. The regex solution could work, although you'd either be iterating or using the pipe alternator (and I don't know if/how much that would slow your operation down, particularly for a large list of banned words). You could always just...not do it, since it's entirely futile no matter what--there are ways to make your intended words quite clear even without using the exact letters.

That, and it's ridiculous to have a list of words that "people find offensive" in the first place. There's someone who will be offended by pretty much any word

/censorship is bullshit rant


I assume that you want to detect only complete words (separated by non-letter characters) and ignore words with a filter-word substring (like a p[ass]word example). In that case you should build yourself a HashSet of filter-words, scan the text for words, and for each word check its existence in HashSet. If it's a filter word then build resulting StringBuilder object without it (or with an equal number of asterisks).


I had great results using this algorithm on codeproject.com better than brute force text replacments.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜