Extract links from a web page

2023-02-13 06:39 问答作者：

Using Java, how can I extract all the links f开发者_StackOverflowrom a given web page?

download java file as plain text/html pass it through Jsoup or html cleaner both are similar and can be used to parse even malformed html 4.0 syntax and then you can use the popular HTML DOM parsing methods like getElementsByName("a") or in jsoup its even cool you can simply use

File input = new File("/tmp/input.html");
 Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

Elements links = doc.select("a[href]"); // a with href
Elements pngs = doc.select("img[src$=.png]");
// img with src ending .png

Element masthead = doc.select("div.masthead").first();

and find all links and then get the detials using

String linkhref=links.attr("href");

Taken from http://jsoup.org/cookbook/extracting-data/selector-syntax

The selectors have same syntax as jQuery if you know jQuery function chaining then you will certainly love it.

EDIT: In case you want more tutorials, you can try out this one made by mkyong.

http://www.mkyong.com/java/jsoup-html-parser-hello-world-examples/

Either use a Regular Expression and the appropriate classes or use a HTML parser. Which one you want to use depends on whether you want to be able to handle the whole web or just a few specific pages of which you know the layout and which you can test against.

A simple regex which would match 99% of pages could be this:

// The HTML page as a String
String HTMLPage;
Pattern linkPattern = Pattern.compile("(<a[^>]+>.+?<\/a>)",  Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
Matcher pageMatcher = linkPattern.matcher(HTMLPage);
ArrayList<String> links = new ArrayList<String>();
while(pageMatcher.find()){
    links.add(pageMatcher.group());
}
// links ArrayList now contains all links in the page as a HTML tag
// i.e. <a att1="val1" ...>Text inside tag</a>

You can edit it to match more, be more standard compliant etc. but you would want a real parser in that case. If you are only interested in the href="" and text in between you can also use this regex:

Pattern linkPattern = Pattern.compile("<a[^>]+href=[\"']?([\"'>]+)[\"']?[^>]*>(.+?)<\/a>",  Pattern.CASE_INSENSITIVE|Pattern.DOTALL);

And access the link part with .group(1) and the text part with .group(2)

You can use the HTML Parser library to achieve this:

public static List<String> getLinksOnPage(final String url) {
    final Parser htmlParser = new Parser(url);
    final List<String> result = new LinkedList<String>();

    try {
        final NodeList tagNodeList = htmlParser.extractAllNodesThatMatch(new NodeClassFilter(LinkTag.class));
        for (int j = 0; j < tagNodeList.size(); j++) {
            final LinkTag loopLink = (LinkTag) tagNodeList.elementAt(j);
            final String loopLinkStr = loopLink.getLink();
            result.add(loopLinkStr);
        }
    } catch (ParserException e) {
        e.printStackTrace(); // TODO handle error
    }

    return result;
}

This simple example seems to work, using a regex from here

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public ArrayList<String> extractUrlsFromString(String content)
{
    ArrayList<String> result = new ArrayList<String>();

    String regex = "(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]";

    Pattern p = Pattern.compile(regex);
    Matcher m = p.matcher(content);
    while (m.find())
    {
        result.add(m.group());
    }

    return result;
}

and if you need it, this seems to work to get the HTML of an url as well, returning null if it can't be grabbed. It works fine with https urls as well.

import org.apache.commons.io.IOUtils;

public String getUrlContentsAsString(String urlAsString)
{
    try
    {
        URL url = new URL(urlAsString);
        String result = IOUtils.toString(url);
        return result;
    }
    catch (Exception e)
    {
        return null;
    }
}

import java.io.*;
import java.net.*;

public class NameOfProgram {
    public static void main(String[] args) {
        URL url;
        InputStream is = null;
        BufferedReader br;
        String line;

        try {
            url = new URL("http://www.stackoverflow.com");
            is = url.openStream();  // throws an IOException
            br = new BufferedReader(new InputStreamReader(is));

            while ((line = br.readLine()) != null) {
                if(line.contains("href="))
                    System.out.println(line.trim());
            }
        } catch (MalformedURLException mue) {
             mue.printStackTrace();
        } catch (IOException ioe) {
             ioe.printStackTrace();
        } finally {
            try {
                if (is != null) is.close();
            } catch (IOException ioe) {
                //exception
            }
        }
    }
}

You would probably need to use regular expressions on the HTML link tags <a href=> and </a>

继续阅读：extract hyperlink package

Extract links from a web page

更多精彩内容

精彩评论

最新问答

绝区零男女主角怎么切换?？

索尼索尼65x95j怎样辨别是新机？

极品飞车集结最低配置是什么?？

输卵管造影有治疗作用吗？

绝区零是单机游戏吗?？

问答排行榜

Escaping "<" in Perl-generated XML

微信重新建群怎么建？

imessage会显示已读吗？

太快了能不能慢一点好爽~好大~不要拔出来了？

二年级家长回音怎么写大全简短的（二年级家长回音怎么写）？