开发者

Retrieving URL's from webpage in java

I have the most basic java code to do a http request and it works fine. I request data and a ton of html comes back. I want to retrieve all the url's from that page and list them. For a simple first test i made it look like this:

int b = line.indexOf("http://",lastE);
int e = line.indexOf("\"", b);

This wor开发者_开发技巧ks but as you can imagine it's horrible and only works in 80% of the cases. The only alternative i could come up with myself sounded slow and stupid. So my question is pretty mutch do i go from

String html

to

List<Url> 

?


Pattern p = Pattern.compile("http://[\w^\"]++");
Matcher m = p.matcher(yourFetchedHtmlString);
while (m.find()) {
   nextUrl=m.group();//Do whatever you want with it
}

You may also have to tweak the regexp, as i have just written it without testing. This should be a very fast way to fetch urls.


I would try a library like HTML Parser to parse the html string and extract all url tags from it.


Your thinking is good, you just missing some parts.

Yous should add some known extension for urls. like .html .aspx .php .htm .cgi .js .pl .asp

And if you like images too then add .gif .jpg .png

I think your doing it the best just need to add more extensions checking.

If you can post the full method code, i will be happy to help you make it better.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜