开发者

XML Parsing from Non-XML Document

in a xml/non-xml File there may exist some XML Block that I need to parse and replace with some other string.. The Scenario is something like this..

Some Text
<cnt:use name="abc" call="xyz">
   <cnt:param name="x" value="2" />
</cnt:use>
Some Text

There is no guarantee that the document is a proper XML document. (there may exist some unclosed Tags. or some othe开发者_开发问答r common mistakes that a Stupid people can make while typing HTML). so I can't use SAX or DOM. I can't even pass it to XSLT (am I right ?). So Whats the best way to extract the <cnt:*> part from the non-xml Document. and read it then replace with something else.


I can't even pass it to XSLT (am I right ?).

Right. XSLT operates on an XML Infoset that is the representation of a parsed tree (XML document). And this text isn't in general parsable as XML.

In XSLT 2.0 there is a function parse-text() that can read any text, but the this text must be parsed and until XSLT 3.0 arrives there will not be functions that even vaguely remind such parsing -- and when there are, they would fail, because the text isn't well-formed XML.

The whole problem of extracting peaces of XML out of a non-well-formed XML is ambiguous and not well-defined. For example, if an ending tag is missing, how do you decide where exactly to insert it?


Hmm. The Problem is I've to implementing it in PHP :( . Super Sad.. So taking ideas from TagSoup as mentioned in Mads Hansen's Answer. I've made a Mini SAX Framework on PHP 5.3. https://github.com/neel/SuSAX/blob/master/sax.php.

I am keeping it more like SAX. at the same time I am tracking the tag nesting also. and also keeping a Parse Tree. I've kept a setNsFocus() method that Specifies only which tags to follow.

<?php
error_reporting(255);
ini_set('display_errors','On');
header('Content-Type: text/plain');
class MyParser extends \SuSAX\AbstractParser{
    public function open($tag){
        echo ">> open ".$tag->ns().':'.$tag->name().'/'.$this->indentation().($this->parent() ? $this->parent()->name() : '')."\n";
        return "OO";
    }
    public function close($tag){
        echo ">> close ".$tag->ns().':'.$tag->name().'/'.$this->indentation()."\n";
    }
    public function standalone($tag){
        echo ">> standalone ".$tag->ns().':'.$tag->name().'/'.$this->indentation()."\n";
    }
    }
$text = <<<TEXT
Hallo <b>W<html:i>o</html:i>rld</b>
<cnt:tag x="2" y="1">
<cnt:taga x="2" y="1"></cnt:taga>
</cnt:tag>
I am Here
TEXT;
$parser = new \SuSAX\Parser(new MyParser);
$parser->setNsFocus('cnt');
$parser->setText($text);
$text_ = $parser->parse();
var_dump($text_);
?>


XML Parsing from Non-XML Document

TagSoup - Just Keep On Truckin'

XML Parsing from Non-XML Document

You could use TagSoup to ensure that all of the documents are well-formed.

...a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short.

TagSoup is designed for people who have to process this stuff using some semblance of a rational application design.

By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.

If you are using Saxon, you can make TagSoup your parser by adding the following option:

...you can use the standard Saxon -x org.ccil.cowan.tagsoup.Parser option, after making sure that TagSoup is on your Java classpath.

Also, Taggle, a TagSoup in C++, available now


Actually, you can try to use DOM::loadHTML since that method accepts non-well-formed markup.

http://php.net/domdocument.loadhtml

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜