开发者

HTML Parser fetch link text

I'm using HTML Parser to fetch links from a web page. I need to store the URL, link text and the URL to the parent page containing the link. I have managed to get the link URL as well as the parent URL.

I still ned to get the link text.

<a href="url">link text</a&g开发者_开发知识库t; 

Unfortunately I'm having a hard time figuring it out, any help would be greatly appreciated.

public static List<LinkContainer> findUrls(String resource) {
    String[] tagNames = {"A", "AREA"};
    List<LinkContainer> urls = new ArrayList<LinkContainer>();
    Tag tag;
    String url;
    String sourceUrl;

    try {

        for (String tagName : tagNames) {

            Parser parser = new Parser(resource);
            NodeList nodes = parser.parse(new TagNameFilter(tagName));

            NodeIterator i = nodes.elements();

            while (i.hasMoreNodes()) {
                tag = (Tag) i.nextNode();
                url = tag.getAttribute("href");
                sourceUrl = tag.getPage().getUrl();

                if (RegexUtil.verifyUrl(url)) {
                    urls.add(new LinkContainer(url, null, sourceUrl));
                }
            }
        }

    } catch (ParserException pe) {
        pe.printStackTrace();
    }

    return urls;
}


Have you tried ((LinkTag) tag).getLinkText() ? Personally I prefer n html parser which produces XML according to a well used standard, e.g., xerces or similar. This is what you get from using e.g., http://nekohtml.sourceforge.net/.


You would need to check the children of each A Tag. If you assume that your A tags only have a single child (the text itself), you can use the getFirstChild() method. This should be an instance of TextNode, and you can call getText() on this to get the link text.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜