htmlparser problem

2023-02-21 20:22 问答作者：

using htmlparser (http://htmlparser.sourceforge.net/) I have been trying to extract information (Content1 + Link) from a html table.

sample html:

<td class="xx">
    <a href="http://link">Content1</a>
</td>

java code:

CssSelectorNodeFilter cssFilter = new CssSelectorNodeFilter("td[class=\"xx\"]");
NodeList nodes = parser.parse(cssFilter);

resultSet = new String[nodes.size()][2];

for (int i=0;i<nodes.size();i++) {
    resultSet[i][0]=nodes.elementAt(i).toPlainTextStr开发者_StackOverflow中文版ing().trim();

    LinkTag tag = (LinkTag) (nodes.elementAt(i));
    resultSet[i][1]=tag.getLink();
}

I can extract the first part (the Content1 String) with no problems, but I am having trouble getting the link. It either says I cannot cast on a TextNode (with the code above) or it returns null.

as above - result: TableColumn cannot be cast to LinkTag

LinkTag tag = (LinkTag) (nodes.elementAt(i));
resultSet[i][1]=tag.getLink();

result: TextNode cannot be cast to LinkTag

   LinkTag tag = (LinkTag) (nodes.elementAt(i).getFirstChild());
    resultSet[i][1]=tag.getLink();

result: NullPointer

 LinkTag tag = (LinkTag) (nodes.elementAt(i).getFirstChild().getFirstChild());
    resultSet[i][1]=tag.getLink();

result: returns null

 Tag tag = (Tag) (nodes.elementAt(i));
    resultSet[i][1]=tag.getAttribute("href");

Thanks for any ideas/solutions =)

If you print out the contents of the <TD> tag, you get:

Tag (27[2,8],42[2,23]): td class="xx"
  Txt (42[2,23],56[3,12]): \n
  Tag (56[3,12],75[3,31]): a href="foo.html"
    Txt (75[3,31],78[3,34]): bar
    End (78[3,34],82[3,38]): /a
  Txt (82[3,38],92[4,8]): \n
  End (92[4,8],97[4,13]): /td

Therefore what you want is the sibling of the first child of the TD - though you are then at the mercy of whatever formatting is in the table.

To find the first link in the table data, you can use this code:

public static void main(String[] args) throws Exception {
    Parser parser = new Parser("file:test.html");
    CssSelectorNodeFilter cssFilter = new CssSelectorNodeFilter("td[class=\"xx\"]");
    NodeList nodes = parser.parse(cssFilter);
    String[][] resultSet = new String[nodes.size()][2];
    for (int i=0;i<nodes.size();i++) {
        Node n = nodes.elementAt(i);
        System.out.println(n); // DEBUG remove me!
        resultSet[i][0]=n.toPlainTextString().trim();
        resultSet[i][1]=null;
        Node c = n.getFirstChild();
        while( c!=null ) {
            if( c instanceof LinkTag ) {
                resultSet[i][1] = ((LinkTag) c).getLink();
                break;
            }
            c = c.getNextSibling();
        }

        System.out.println(i+" text :"+resultSet[i][0]); // DEBUG remove me!
        System.out.println(i+" link :"+resultSet[i][1]); // DEBUG remove me!
    } 
}

继续阅读：html-parsing

htmlparser problem

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？