How to define a regex to remove text-masked spam links ("spam1 dot com") from a Java String?
I have a list of sites that represent spam links:
List<String> bannedSites = ["spam1.com", "spam2.com", "spam3.com"];
Is there a regex way of removing links matching these banned sites from this text:
Dear Arezzo,
Please check out my website at spam1.com or http://www.spam1.com
or http://spam1.com or spam1 dot com to win millions of dollars in prizes.
Thank you.
Big Spammer
Notice the link may have multiple URL formats which aioobe's solution does a good job of identifying:
String input = "Dear Arezzo,\n"
+ "Please check out my website at spam1.com or http://www.spam1.com"
+ "or http://spam1.com or spam1 dot com to win millions of dollars in prizes."
+ "Thank you.";
List<String> bannedSites = Arrays.asList("spam1.com", "spam2.com", "spam3.com");
StringBuilder re = new StringBuilder();
for (String bannedSite : bannedSites) {
if (re.length() > 0)
re.append("|");
re.append(String.format("http://(www\\.)?%s\\S*|%1$s",
Pattern.quote(bannedSite)));
}
System.out.println(input.replaceAll(re.toString(), "LINK REMOVED"));
But while the code above works great for the URL formats spam1.com
or http://www.spam1.com
or http://spam1.com
, it misses the multiple text formats:
How can I modify the regex to target text formats such as these?
spam1 dot com
spam1[.com]
spam1 .com
spam1 . com
The idea is to produce a result like this:
Dear Arezzo,
Please check out my website at [LINK REMOVED] or [LINK REMOVED]
or [LINK REMOVED] or [LINK REMOVED] to win millions of dollars in prizes.
Thank you.
Big Spammer
As I remarked in the comments below, I probably don't need to开发者_StackOverflow ban the whole string spam1 dot com
. If I can efface just the spam1
part so that it becomes: [LINK REMOVED] dot com
- that would do the job.
Here's a start for you.
import java.util.*;
import java.util.regex.Pattern;
class Test {
public static void main(String[] args) {
String input = "Dear Arezzo,\n"
+ "Please check out my website at spam1.com "
+ "or http://www.spam1.com or http://spam1.com or "
+ "spam1 dot com to win millions of dollars in prizes.\n"
+ "Thank you.";
List<String> bannedSites = Arrays.asList("spam1", "spam2", "spam3");
StringBuilder re = new StringBuilder();
for (String bannedSite : bannedSites) {
if (re.length() > 0)
re.append("|");
String quotedSite = Pattern.quote(bannedSite);
re.append("https?://(www\\.)?" + quotedSite + "\\S*");
re.append("|" + quotedSite + "\\s*(dot|\\.)?\\s*(com|net|org)");
//re.append("|" ... your variation here);
}
System.out.println(input.replaceAll(re.toString(), "LINK REMOVED"));
}
}
Output:
Dear Arezzo,
Please check out my website at LINK REMOVED or LINK REMOVED or LINK REMOVED or LINK REMOVED to win millions of dollars in prizes. Thank you.
Extend the regular expression as needed.
I will suggest to use TRIE (http://en.wikipedia.org/wiki/Trie) DS to store the blacklist of websites. Now while reading the website you can do the comparison and remove the banned sites. It will be efficient than regex as using regex you will be searching for each spam website string in the input text.
Using regular expressions for this purpose could prove a performance bottle neck as the list of spammed sites, total number of messages processed and message size increases.
The regular expression in the following test code works, but I would only use it after thorough testing and making all possible performance improvements.
final String[] spam = new String[] {"spam1.com", "spam2.net"};
System.out.println("***** SPAM SITES *****\n" + Arrays.toString(spam)
+ "\n");
final StringBuilder patternBuilder = new StringBuilder();
patternBuilder.append("(?i)(?:(?:f|ht)tps?://)?(?:\\S*?)(");
for (final String s : spam) {
patternBuilder
.append("(?:\\[|\\])?"
+ s.replaceAll("\\.",
"\\\\s*(?:\\\\[|\\\\])?\\\\s*(?:\\\\.|dot)\\\\s*(?:\\\\[|\\\\])?\\\\s*")
+ "\\s*(?:\\[|\\])?").append("|");
}
patternBuilder.setLength(patternBuilder.length() - 1);
patternBuilder.append(")(?:/\\S*)?(?=\\s|$)");
final String ps = patternBuilder.toString();
final String psLong = ps;
System.out.println("***** PATTERN *****\n" + psLong + "\n");
final Pattern p = Pattern.compile(ps);
for (String s : new String[] {"http://www.spam1.com",
"http://spam2.net", "www.spam1.com", "spam1 dot com",
"spam1[.com]", "spam1 .com", "spam2 . net", "no links here"})
{
final Matcher m = p.matcher(s);
if (m.matches()) {
System.out.println("Success: " + s);
} else {
System.out.println("Fail: " + s);
}
}
final String message =
"Dear Arezzo,\nPlease check out my website at spam1.com or http://www.spam1.com \nor http://spam1.com or spam1 dot com to win millions of dollars in prizes.\nThank you.\nBig Spammer\n";
final Matcher m = p.matcher(message);
System.out.println("\n\n***** ORIGINAL MESSAGE *****\n" + message
+ "\n\n***** REPLACED LINKS *****\n"
+ m.replaceAll("[LINK REMOVED]"));
Which outputs:
***** SPAM SITES *****
[spam1.com, spam2.net]
***** PATTERN *****
(?i)(?:(?:f|ht)tps?://)?(?:\S*?)((?:\[|\])?spam1\s*(?:\[|\])?\s*(?:\.|dot)\s*(?:\[|\])?\s*com\s*(?:\[|\])?|(?:\[|\])?spam2\s*(?:\[|\])?\s*(?:\.|dot)\s*(?:\[|\])?\s*net\s*(?:\[|\])?)(?:/\S*)?(?=\s|$)
Success: http://www.spam1.com
Success: http://spam2.net
Success: www.spam1.com
Success: spam1 dot com
Success: spam1[.com]
Success: spam1 .com
Success: spam2 . net
Fail: no links here
***** ORIGINAL MESSAGE *****
Dear Arezzo,
Please check out my website at spam1.com or http://www.spam1.com
or http://spam1.com or spam1 dot com to win millions of dollars in prizes.
Thank you.
Big Spammer
***** REPLACED LINKS *****
Dear Arezzo,
Please check out my website at [LINK REMOVED] or [LINK REMOVED]
or [LINK REMOVED] or [LINK REMOVED] to win millions of dollars in prizes.
Thank you.
Big Spammer
精彩评论