Newbie regex question - detect spam
Here's my regex newbie questions:
- How can I check if a string has 3 spam words? (for example: viagra, pills and shop)
- How can I detect also variations of those spam开发者_JAVA技巧 words like "v-iagra" or "v.iagra" ? (one additional character)
Regex doesn't seem like quite the right hammer for this particular nail. For your list, you can simply throw all of you blacklisted words in a sorted list of some kind, and scan each token against that list. Direct string operations are always faster than invoking the regular expression engine du jour.
For your variations ("v-iagra", et. al) I'd remove all non-characters (as @Kinopiko suggested) and then run them past your blacklist again. If you're wary of things like "viiagra", et cetera, I'd check out Aspell. It's a great library, and looks like CPAN has a Perl binding.
How can I check if a string has 3 spam words? (for example: viagra,pills and shop)
A regex to spot any one of those three words might look like this (Perl):
if ($string =~ /(viagra|pills|shop)/) {
# spam
}
If you want to spot all three, a regex alone isn't really enough:
my $bad_words = 0;
while ($string =~ /(viagra|pills|shop)/g) {
$bad_words++;
}
if ($bad_words >= 3) {
# spam
}
How can I detect also variations of those spam words like "v-iagra" or "v.iagra" ? (one additional character)
It's not so easy to do that with just a regex. You could try something like
$string =~ s/\W//g;
to remove all non-word characters like . and -, and then check the string using the test above. This would strip spaces too though.
精彩评论