开发者

Newbie regex question - detect spam

Here's my regex newbie questions:

  • How can I check if a string has 3 spam words? (for example: viagra, pills and shop)
  • How can I detect also variations of those spam开发者_JAVA技巧 words like "v-iagra" or "v.iagra" ? (one additional character)


Regex doesn't seem like quite the right hammer for this particular nail. For your list, you can simply throw all of you blacklisted words in a sorted list of some kind, and scan each token against that list. Direct string operations are always faster than invoking the regular expression engine du jour.

For your variations ("v-iagra", et. al) I'd remove all non-characters (as @Kinopiko suggested) and then run them past your blacklist again. If you're wary of things like "viiagra", et cetera, I'd check out Aspell. It's a great library, and looks like CPAN has a Perl binding.


How can I check if a string has 3 spam words? (for example: viagra,pills and shop)

A regex to spot any one of those three words might look like this (Perl):

if ($string =~ /(viagra|pills|shop)/) {
    # spam
}

If you want to spot all three, a regex alone isn't really enough:

my $bad_words = 0;
while ($string =~ /(viagra|pills|shop)/g) {
     $bad_words++;
}
if ($bad_words >= 3) {
     # spam
}

How can I detect also variations of those spam words like "v-iagra" or "v.iagra" ? (one additional character)

It's not so easy to do that with just a regex. You could try something like

 $string =~ s/\W//g;

to remove all non-word characters like . and -, and then check the string using the test above. This would strip spaces too though.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜