optimize regex which matches two html tags
((<(\\s*?)(object|OBJECT|EMBED|embed))+(.*?)+((object|OBJECT|EMBED|embed)(\\s*?)>))
I need to get object and embed tags fro开发者_如何学Cm some html files stored locally on disk. I've come up with the above regex to match the tags in java then use matcher.group(1); to get the entire tag and its contents
Can anyone perhaps improve this? Is there anything that stands out immediately to you that i should change?
It does work BTW, just wanting an input to see if it can be better because i'm fairly new to regex myself.
Yes, here's the improvement:
Download a fullworthy Java HTML parser like Jsoup and put it in classpath.
Now you can select all
<object>
and<embed>
elements as follows:Document document = Jsoup.parse(new File("/path/to/file.html"), "UTF-8"); Elements elements = document.select("object,embed"); for (Element element : elements) { System.out.println(element.outerHtml()); }
See also:
- Regular Expressions - Now you have two problems
- Parsing HTML - The Cthulhu way
- Pros and cons of HTML parsers in Java
精彩评论