Using regex to match HTML
Here's an input HTML string:
<p>Johnny: My favorite color is pink<br />
Sarah: My favorite color is blue<br /> Johnny: Let's swap genders?<br /> Sarah: OK!<br /> </p>I开发者_开发百科 want to regex-match the bolded part above. Basically put, find any matches between ">" (or beginning of line) and ":"
I made this regex (?>)[^>](.+):
but it didn't work correctly, it bolded the parts below, including the <p> tag. I don't want to match any HTML tag:
<p>Johnny: My favorite color is pink<br />
Sarah: My favorite color is blue<br /> Johnny: Let's swap genders?<br /> Sarah: OK!<br /> </p>I am using Java, with code like this:
Matcher m = Pattern.compile("`(?>)[^>](.+):`", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL).matcher(string);
Following code should work:
String str = "<p>Johnny Smith: My favorite color abc: is pink<br />" +
"Sarah: My favorite color is dark: blue<br />" +
"Johnny: Let's swap: genders?<br />" +
"Sarah: OK: sure!<br />" +
"</p>";
Pattern p = Pattern.compile("(?:>|^)([\\w\\s]+)(?=:)", Pattern.MULTILINE);
Matcher m = p.matcher(str);
while(m.find()){
System.out.println(m.group(1));
}
OUTPUT
Johnny Smith
Sarah
Johnny
Sarah
If you want a match when a word is followed by ':' then "\w+:" should be enough. But if you want to include the '>' possibility you can try:
String s = "<p>Johnny: My favorite color is pink<br />" +
"Sarah: My favorite color is blue<br />" +
"Johnny: Let's swap genders?<br />" +
"Sarah: OK!<br />" +
"</p>";
Pattern p = Pattern.compile("[>]?(\\w+):");
Matcher m = p.matcher(s);
while(m.find()){
System.out.println(m.start()+" : "+m.group(1));
}
精彩评论