Trying to parse links in an HTML directory listing using Java regex
Ok I know everyone is going to tell me not to use RegEx for parsi开发者_JS百科ng HTML, but I'm programming on Android and don't have ready access to an HTML parser (that I'm aware of). Besides, this is server generated HTML which should be more consistent than user-generated HTML.
The regex looks like this:
Pattern patternMP3 = Pattern.compile(
"<A HREF=\"[^\"]+.+\\.mp3</A>",
Pattern.CASE_INSENSITIVE |
Pattern.UNICODE_CASE);
Matcher matcherMP3 = patternMP3.matcher(HTML);
while (matcherMP3.find()) { ... }
The input HTML is all on one line, which is causing the problem. When the HTML is on separate lines this pattern works. Any suggestions?
The regex
"<A HREF=\"([^\"]+)\"[^>]*>([^<]+?)\\.mp3</A>"
should match your links, and have the link and the filename in its groups.
Note though, that the argument of href
does not neccesarily need to be enclosed in quotes in html. (Or, if it needs to be, neither browsers nor developers know that =). )
You shouldn't be matching '.+' since you've already got [^\"]+ (which is better for your particular situation).
Try:
"<A HREF=\"[^\"]+\\.mp3\"</A>"
Also, don't forget the double-quote after the mp3.
For your information, on Android you can parse HTML 'properly' with a combination of org.cyberneko.html.parsers.SAXParser, org.xml.sax.* and org.dom4j.*.
http://sourceforge.net/projects/nekohtml
http://www.saxproject.org
http://dom4j.sourceforge.net
精彩评论