java Regular expression matching html
solution: this works:
String p="<pre>[\\\\w\\\\W]*</pre>";
I want to match and capture the enclosing content of the <pre></pre> tag 开发者_JS百科tried the following, not working, what's wrong?
String p="<pre>.*</pre>";
Matcher m=Pattern.compile(p,Pattern.MULTILINE|Pattern.CASE_INSENSITIVE).matcher(input);
if(m.find()){
String g=m.group(0);
System.out.println("g is "+g);
}
Regex is in fact not the right tool for this. Use a parser. Jsoup is a nice one.
Document document = Jsoup.parse(html);
for (Element element : document.getElementsByTag("pre")) {
System.out.println(element.text());
}
The parse() method can also take an URL or File by the way.
The reason I recommend Jsoup is by the way that it is the least verbose of all HTML parsers I tried. It not only provides JavaScript like methods returning elements implementing Iterable, but it also supports jQuery like selectors and that was a big plus for me.
You want the DOTALL flag, not MULTILINE. MULTILINE changes the behavior of the ^ and $, while DOTALL is the one that lets . match line separators. You probably want to use a reluctant quantifier, too:
String p = "<pre>.*?</pre>";
String stringToSearch = "H1 FOUR H1 SCORE AND SEVEN YEARS AGO OUR FATHER...";
// the case-insensitive pattern we want to search for
Pattern p = Pattern.compile("H1", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(stringToSearch);
// see if we found a match
int count = 0;
while (m.find())
count++;
System.out.println("H1 : "+count);
加载中,请稍侯......
精彩评论