开发者

Parse HTML in Android

I am attempting to parse HTML for specific data but am having issues with return characters, at least I think that's what the problem is. I am using a simple substring method to take apart the HTML as I know beforehand what I am looking for.

Here is my parse metho开发者_运维知识库d:

public static void parse(String response, String[] hashItem, String[][] startEnd) throws Exception
{

    for (i = 0; i < hashItem.length; i++)
    {
        part = response.substring(response.indexOf(startEnd[i][0]) + startEnd[i][0].length());
        value = part.substring(0, part.indexOf(startEnd[i][1]));
        DATABASE.setHash(hashItem[i], value);
    }
}

Here is a sample of the HTML that is giving me issues

<table cellspacing=0 cellpadding=2 class=smallfont>
<tr onclick="lu();" onmouseover="style.cursor='hand'">
<td class=bodybox nowrap>&nbsp;     21,773,177,147 $&nbsp;</td><td></td>
<td class=bodybox nowrap>&nbsp;        629,991,926 F&nbsp;</td><td></td>
<td class=bodybox nowrap>&nbsp;             24,537 P&nbsp;</td><td></td>
<td class=bodybox nowrap>&nbsp;                  0 T&nbsp;</td>
<td></td><td class=bodybox nowrap>&nbsp;RT&nbsp;</td>

There are hidden return characters but when I try to add them into the string that I am trying to use it doesn't work out well, if at all. Is there a method or perhaps a better way to strip hidden characters from the HTML to make it easier to parse? Any help is greatly appreciated as always.


If you want to make parsing very easy, try Jsoup:

This example will download the page, parse and get the text.

Document doc = Jsoup.connect("http://jsoup.org").get();

Elements tds = doc.select("td.bodybox");

for (Element td : tds) {
  String tdText = td.text();
}


You can try with XMLPullParser available in Android. You can use StringBuffer to append characters in between tags.


Try using a regex to gain the information you want: http://java.sun.com/developer/technicalArticles/releases/1.4regex/

You could even use it to remove the hidden characters. Or maybe use String.Replace to remove the newline characters?


You can parse the HTML file using a XMLReader for example as far as i know, check this article http://www.ibm.com/developerworks/xml/library/x-andbene1/

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜