开发者

Extract links from a web page

Using Java, how can I extract all the links f开发者_StackOverflowrom a given web page?


download java file as plain text/html pass it through Jsoup or html cleaner both are similar and can be used to parse even malformed html 4.0 syntax and then you can use the popular HTML DOM parsing methods like getElementsByName("a") or in jsoup its even cool you can simply use

File input = new File("/tmp/input.html");
 Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

Elements links = doc.select("a[href]"); // a with href
Elements pngs = doc.select("img[src$=.png]");
// img with src ending .png

Element masthead = doc.select("div.masthead").first();

and find all links and then get the detials using

String linkhref=links.attr("href");

Taken from http://jsoup.org/cookbook/extracting-data/selector-syntax

The selectors have same syntax as jQuery if you know jQuery function chaining then you will certainly love it.

EDIT: In case you want more tutorials, you can try out this one made by mkyong.

http://www.mkyong.com/java/jsoup-html-parser-hello-world-examples/


Either use a Regular Expression and the appropriate classes or use a HTML parser. Which one you want to use depends on whether you want to be able to handle the whole web or just a few specific pages of which you know the layout and which you can test against.

A simple regex which would match 99% of pages could be this:

// The HTML page as a String
String HTMLPage;
Pattern linkPattern = Pattern.compile("(<a[^>]+>.+?<\/a>)",  Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
Matcher pageMatcher = linkPattern.matcher(HTMLPage);
ArrayList<String> links = new ArrayList<String>();
while(pageMatcher.find()){
    links.add(pageMatcher.group());
}
// links ArrayList now contains all links in the page as a HTML tag
// i.e. <a att1="val1" ...>Text inside tag</a>

You can edit it to match more, be more standard compliant etc. but you would want a real parser in that case. If you are only interested in the href="" and text in between you can also use this regex:

Pattern linkPattern = Pattern.compile("<a[^>]+href=[\"']?([\"'>]+)[\"']?[^>]*>(.+?)<\/a>",  Pattern.CASE_INSENSITIVE|Pattern.DOTALL);

And access the link part with .group(1) and the text part with .group(2)


You can use the HTML Parser library to achieve this:

public static List<String> getLinksOnPage(final String url) {
    final Parser htmlParser = new Parser(url);
    final List<String> result = new LinkedList<String>();

    try {
        final NodeList tagNodeList = htmlParser.extractAllNodesThatMatch(new NodeClassFilter(LinkTag.class));
        for (int j = 0; j < tagNodeList.size(); j++) {
            final LinkTag loopLink = (LinkTag) tagNodeList.elementAt(j);
            final String loopLinkStr = loopLink.getLink();
            result.add(loopLinkStr);
        }
    } catch (ParserException e) {
        e.printStackTrace(); // TODO handle error
    }

    return result;
}


This simple example seems to work, using a regex from here

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public ArrayList<String> extractUrlsFromString(String content)
{
    ArrayList<String> result = new ArrayList<String>();

    String regex = "(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]";

    Pattern p = Pattern.compile(regex);
    Matcher m = p.matcher(content);
    while (m.find())
    {
        result.add(m.group());
    }

    return result;
}

and if you need it, this seems to work to get the HTML of an url as well, returning null if it can't be grabbed. It works fine with https urls as well.

import org.apache.commons.io.IOUtils;

public String getUrlContentsAsString(String urlAsString)
{
    try
    {
        URL url = new URL(urlAsString);
        String result = IOUtils.toString(url);
        return result;
    }
    catch (Exception e)
    {
        return null;
    }
}


import java.io.*;
import java.net.*;

public class NameOfProgram {
    public static void main(String[] args) {
        URL url;
        InputStream is = null;
        BufferedReader br;
        String line;

        try {
            url = new URL("http://www.stackoverflow.com");
            is = url.openStream();  // throws an IOException
            br = new BufferedReader(new InputStreamReader(is));

            while ((line = br.readLine()) != null) {
                if(line.contains("href="))
                    System.out.println(line.trim());
            }
        } catch (MalformedURLException mue) {
             mue.printStackTrace();
        } catch (IOException ioe) {
             ioe.printStackTrace();
        } finally {
            try {
                if (is != null) is.close();
            } catch (IOException ioe) {
                //exception
            }
        }
    }
}


You would probably need to use regular expressions on the HTML link tags <a href=> and </a>

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜