开发者

Trying to parse links in an HTML directory listing using Java regex

Ok I know everyone is going to tell me not to use RegEx for parsi开发者_JS百科ng HTML, but I'm programming on Android and don't have ready access to an HTML parser (that I'm aware of). Besides, this is server generated HTML which should be more consistent than user-generated HTML.

The regex looks like this:

Pattern patternMP3 = Pattern.compile(
        "<A HREF=\"[^\"]+.+\\.mp3</A>",
        Pattern.CASE_INSENSITIVE |
        Pattern.UNICODE_CASE);
Matcher matcherMP3 = patternMP3.matcher(HTML);
while (matcherMP3.find()) { ... }

The input HTML is all on one line, which is causing the problem. When the HTML is on separate lines this pattern works. Any suggestions?


The regex

"<A HREF=\"([^\"]+)\"[^>]*>([^<]+?)\\.mp3</A>"

should match your links, and have the link and the filename in its groups. Note though, that the argument of href does not neccesarily need to be enclosed in quotes in html. (Or, if it needs to be, neither browsers nor developers know that =). )


You shouldn't be matching '.+' since you've already got [^\"]+ (which is better for your particular situation).

Try:

"<A HREF=\"[^\"]+\\.mp3\"</A>"

Also, don't forget the double-quote after the mp3.


For your information, on Android you can parse HTML 'properly' with a combination of org.cyberneko.html.parsers.SAXParser, org.xml.sax.* and org.dom4j.*.

http://sourceforge.net/projects/nekohtml

http://www.saxproject.org

http://dom4j.sourceforge.net

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜