开发者

SAX Parser : Retrieving HTML tags from XML

I have an XML to be parsed, which as given below

<feed>
    <feed_id>12941450184d2315fa63d6358242</feed_id>
    <content> <fieldset><table cellpadding='0'  border='0'  cellspacing='0'  style="clear :both"><tr valign='top' ><td width='35' ><a href='http://mypage.rediff.com/android/32868898'  class='space' onmousedown="return enc(this,'http://track.rediff.com/click?url=___http%3A%2F%2Fmypage.rediff.com%2Fandroid%2F32868898___&service=mypage_feeds&clientip=202.137.232.113&pos=0&feed_id=12941450184d2315fa63d6358242&prc_id=32868898&rowid=674061088')" ><div style='width:25px;height:25px;overflow:hidden;'><img src='http://socialimg04.rediff.com/image.php?uid=32868898&type=thumb'  width='25'  vspace='0'  /></div></a></td> <td><span><a href='http://mypage.rediff.com/android/32868898'  class="space" onmousedown="return enc(this,'http://track.rediff.com/click?url=___http%3A%2F%2Fmypage.rediff.com%2Fandroid%2F32868898___&service=mypage_feeds&clientip=202.137.232.113&pos=0&feed_id=12941450184d2315fa63d6358242&prc_id=32868898&rowid=674061088')" >Android </a> </span><span style='color:#000000 !important;'>testing</span><div class='divtext'></div></td></tr><tr><td height='5' ></td></tr></table></fieldset><br/></content>
    <action>status updated</action>
</feed>

Tag contains HTML contents, which contains the data which i need. I am using a SAX Parser. Here's what i am doing

private Timeline timeLine; //Object
private String tempStr;

public void char开发者_StackOverflow中文版acters(char[] ch, int start, int length)
        throws SAXException {
    tempStr = new String(ch, start, length);
}

public void endElement(String uri, String localName, String qName)
        throws SAXException {
    if (localName.equalsIgnoreCase("content")) {
        if (timeLine != null) {
            timeLine.setContent(tempStr);
        }
}

Will this logic work? If no, how should i extract embedded HTML data from XML using SAX Parser.


You can parse html after all html is also xml.There is a link similar to this in stackoverflow.You can try this How to parse the html content in android using SAX PARSER


On start element, if the element is content, your temp Str buffer should be initialized. else if content already started, capture the current start element and its attributes and update that to the temp Str buffer.

On characters, if content is started, add the charecters to the current string buffer.

On end element if content is started, Capture the end node and add to string buffer.

My Assumption:

The xml will have only one content tag.


If the html is actually xhtml, you can parse it using SAX and extract the xhtml contents of the <content> tag, but not nearly this easily.

You would have to make your handler actually respond to the events that will be raised by all the xhtml tags inside the <content> tag, and either build something resembling a DOM structure, which you could then serialize back out to xml form, or on-the-fly directly write into an xml string buffer replicating the contents.

If you modify your xml so that the html inside the content tag is wrapped in a CDATA element as suggested in How to parse the html content in android using SAX PARSER, something not too far from your code should indeed work.

But you can't just put the contents into your String tempStr variable in the characters method as you're doing. You'll need to have a startElement method that initializes a buffer for the string on seeing the <content> tag, collect into that buffer in the characters method, and put the result somewhere in the endElement for the <content> tag.


I find the solution in this way:

Note: In this solution I want to get the html content between <chapter> tags (<chapter> ... html content ... </chapter>)

DefaultHandler handler = new DefaultHandler() {

    boolean chap = false;

    public char[] temp;
    int chapterStart;
    int chapterEnd;

    public void startElement(String uri, String localName,
            String qName, Attributes attributes)
            throws SAXException {

            System.out.println("Start Element :" + qName);

            if (qName.equalsIgnoreCase("chapter")) {
                chap = true;
            }

        }

        public void endElement(String uri, String localName,
            String qName) throws SAXException {

            if (qName.equalsIgnoreCase("chapter")) {
                System.out.println(new String(temp, chapterStart, chapterEnd-chapterStart));

            }
            System.out.println("End Element :" + qName);

        }

        public void characters(char ch[], int start, int length)
                throws SAXException {

            if (chap) {
                temp = ch;
                chapterStart = start;
                chap = false;
            }
            chapterEnd = start + length;

        }

    };

Update:

My code have a bug. because the length of ch[] in DocumentHandler varies in different situation!

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜