How to cut specified words from string

2023-01-18 23:32 问答作者：

There is a list of banned words ( or strings to be more general) and another list with let's say users mails. I would like to excise all banned words from all mails.

Trivial example:

foreach(string word in wordsList)
{
   foreach(string mail in mailList)
   {
      mail.Replace(word,String.Empty);
   }
}

How I can improve this algorithm?

Thanks for advices. I voted few answers up but I didn't mark any 开发者_运维问答as answer since it was more like discussion than solution. Some people missed banned words with bad words. In my case I don't have to bother about recognize 'sh1t' or something like that.

Simple approaches to profanity filtering won't work - complex approaches don't work, for the most part, either.

What happens when you get a work like 'password' and you want to filter out 'ass'? What happens when some clever person writes 'a$$' instead - the intent is still clear, right?

See How do you implement a good profanity filter? for extensive discussion.

You could use RegEx to make things a little cleaner:

var bannedWords = @"\b(this|is|the|list|of|banned|words)\b";

foreach(mail in mailList)
    var clean = Regex.Replace(mail, bannedWords, "", RegexOptions.IgnoreCase);

Even that, though, is far from perfect since people will always figure out a way around any type of filter.

You'll get best performance by drawing up a finite state machine (FSM) (or generate one) and then parsing your input 1 character at a time and walking through the states.

You can do this pretty easily with a function that takes your next input char and your current state and that returns the next state, you also create output as you walk through the mail message's characters. You draw the FSM on a paper.

Alternatively you could look into the Windows Workflow Foundation: State Machine Workflows.

In that way you only need to walk each message a single time.

Constructing a regular expression from the words (word1|word2|word3|...) and using this instead of the outer loop might be faster, since then, every e-mail only needs to be parsed once. In addition, using regular expressions would enable you to remove only "complete words" by using the word boundary markers (\b(word1|word2|word3|...)\b).

In general, I don't think you will find a solution which is orders of magnitude faster than your current one: You will have to loop through all mails and you will have to search for all the words, there's no easy way around that.

A general algorithm would be to:

Generate a list of tokens based on the input string (ie. by treating whitespace as token separators)
Compare each token against a list of banned words
Replace matched tokens

A regular expression is convenient for identifying tokens, and a HashSet would provide quick lookups for your list of banned words. There is an overloaded Replace method on the Regex class that takes a function, where you could control the replace behavior based on your lookup.

HashSet<string> BannedWords = new HashSet<string>(StringComparer.InvariantCultureIgnoreCase)
{
    "bad",
};

string Input = "this is some bad text.";

string Output = Regex.Replace(Input, @"\b\w+\b", (Match m) => BannedWords.Contains(m.Value) ? new string('x', m.Value.Length) : m.Value);

Replacing it with * is annoying, but less annoying than something that removes the context of your intention by removing the word and leaving a malformed sentence. In discussing the Battle of Hastings, I'd be irritated if I saw William given the title "Grand ******* of Normandy", but at least I'd know I was playing in the small-kids playground, while his having the title of "Grand of Normandy" just looks like a mistake, or (worse) I might think that was actually his title.

Don't try replacing words with more innocuous words unless its funny. People get the joke on 4chan, but yahoo groups about history had confused people because the medireview and mediareview periods were being discussed when eval (not profanity, but is used in some XSS attacks that yahoo had been hit by) was replaced with review in medieval and mediaeval (apparantly, medireview is the American spelling of mediareview!).

In some circumstance is possible to improve it: Just for fun:

u can use SortedList, if ur mailing list is mailing list (because u have a delimiter like ";") u can do as bellow:

first calculate ur running time algorithm: Words: n item. (each item has an O(1) length). mailing list: K item. each item in mailing list average length of Z. each sub item in mailing list item average length of Y so the average number of subitems in mailing list items is m = Z/Y.

ur algorithm takes O(n*K*Z). // the best way with knut algorithm

1.now if u sort the words list in O(n log n).

2.1- use mailingListItem.Split(";".ToCharArray()) for each mailing list item: O(Z). 2.2- sort the items in mailing list: O(m * log m) total sorting takes O(K * Z) in worth case with respect to (m logm << Z).

3- use merge algorithm to merge items of bad word and specific mailing list: O((m + n) * k)

total time is O((m+n)*K + m*Z + n^2) with respect to m << n, total algorithm running time is O(n^2 + Z*K) in worth case, which is smaller than O(n*K*Z) if n < K * Z ( i think so).

So if performance is very very very important, u can do this.

You might consider using Regex instead of simple string matches, to avoid replacing partial content within words. A Regex would allow you to assure you are only getting full words that match. You could use a pattern like this:

"\bBADWORD\b"

Also, you may want to iterate over the mailList on the outside, and the word list on the inner loop.

Wouldn't it be easier (and more efficient) to simply redact them by changing all their characters to * or something? That way no large string needs to be resized or moved around, and the recipents are made more aware what happened, rather than getting nonsensical sentences with missing words.

Well, you certainly don' want to make the clbuttic mistake of naive string.Replace() to do it. The regex solution could work, although you'd either be iterating or using the pipe alternator (and I don't know if/how much that would slow your operation down, particularly for a large list of banned words). You could always just...not do it, since it's entirely futile no matter what--there are ways to make your intended words quite clear even without using the exact letters.

That, and it's ridiculous to have a list of words that "people find offensive" in the first place. There's someone who will be offended by pretty much any word

/censorship is bullshit rant

I assume that you want to detect only complete words (separated by non-letter characters) and ignore words with a filter-word substring (like a p[ass]word example). In that case you should build yourself a HashSet of filter-words, scan the text for words, and for each word check its existence in HashSet. If it's a filter word then build resulting StringBuilder object without it (or with an equal number of asterisks).

I had great results using this algorithm on codeproject.com better than brute force text replacments.

继续阅读：algorithm

How to cut specified words from string

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？