Trying to parse links in an HTML directory listing using Java
Please can someone help me parse these links from an HTML page
- http://nemertes.lis.upatras.开发者_开发技巧gr/dspace/handle/123456789/2299
- http://nemertes.lis.upatras.gr/dspace/handle/123456789/3154
- http://nemertes.lis.upatras.gr/dspace/handle/123456789/3158
I want to parse using the "handle" word which is common in these links.
I'm using the command [Pattern pattern = Pattern.compile("<a.+href=\"(.+?)\"");]
but it parse me all the href
links of the page.
Any suggestions?
ThanksYour regular expression is looking at ALL <a href...
tags. "handle" is always used as "/dspace/handle" etc. so you can use something like this to scrape the urls you're looking for:
Pattern pattern = Pattern.compile("<a.+href=\"(/dspace/handle/.+?)\"");
Looks like your regex is doing something wrong. Instead of
Pattern pattern = Pattern.compile("<a.+href=\"(.+?)\"");
Try:
Pattern pattern = Pattern.compile("<a\\s+href=\"(.+?)\"");
the 'a.+' on your first pattern is matching any character at least one time. If you intended to set the space character the use '\s+' instead.
The following code works perfect:
String s = "<a href=\"http://nemertes.lis.upatras.gr/dspace/handle/123456789/2299\"/> " +
"<a href=\"http://nemertes.lis.upatras.gr/dspace/handle/123456789/3154\" /> " +
"<a href=\"http://nemertes.lis.upatras.gr/dspace/handle/123456789/3158\"/>";
Pattern p = Pattern.compile("<a\\s+href=\"(.+?)\"", Pattern.MULTILINE);
Matcher m = p.matcher(s);
while(m.find()){
System.out.println(m.start()+" : "+m.group(1));
}
output:
0 : http://nemertes.lis.upatras.gr/dspace/handle/123456789/2299
72 : http://nemertes.lis.upatras.gr/dspace/handle/123456789/3154
145 : http://nemertes.lis.upatras.gr/dspace/handle/123456789/3158
精彩评论