XML Parsing from Non-XML Document

2023-02-02 04:54 问答作者：

in a xml/non-xml File there may exist some XML Block that I need to parse and replace with some other string.. The Scenario is something like this..

Some Text
<cnt:use name="abc" call="xyz">
   <cnt:param name="x" value="2" />
</cnt:use>
Some Text

There is no guarantee that the document is a proper XML document. (there may exist some unclosed Tags. or some othe开发者_开发问答r common mistakes that a Stupid people can make while typing HTML). so I can't use SAX or DOM. I can't even pass it to XSLT (am I right ?). So Whats the best way to extract the <cnt:*> part from the non-xml Document. and read it then replace with something else.

I can't even pass it to XSLT (am I right ?).

Right. XSLT operates on an XML Infoset that is the representation of a parsed tree (XML document). And this text isn't in general parsable as XML.

In XSLT 2.0 there is a function parse-text() that can read any text, but the this text must be parsed and until XSLT 3.0 arrives there will not be functions that even vaguely remind such parsing -- and when there are, they would fail, because the text isn't well-formed XML.

The whole problem of extracting peaces of XML out of a non-well-formed XML is ambiguous and not well-defined. For example, if an ending tag is missing, how do you decide where exactly to insert it?

Hmm. The Problem is I've to implementing it in PHP :( . Super Sad.. So taking ideas from TagSoup as mentioned in Mads Hansen's Answer. I've made a Mini SAX Framework on PHP 5.3. https://github.com/neel/SuSAX/blob/master/sax.php.

I am keeping it more like SAX. at the same time I am tracking the tag nesting also. and also keeping a Parse Tree. I've kept a setNsFocus() method that Specifies only which tags to follow.

<?php
error_reporting(255);
ini_set('display_errors','On');
header('Content-Type: text/plain');
class MyParser extends \SuSAX\AbstractParser{
    public function open($tag){
        echo ">> open ".$tag->ns().':'.$tag->name().'/'.$this->indentation().($this->parent() ? $this->parent()->name() : '')."\n";
        return "OO";
    }
    public function close($tag){
        echo ">> close ".$tag->ns().':'.$tag->name().'/'.$this->indentation()."\n";
    }
    public function standalone($tag){
        echo ">> standalone ".$tag->ns().':'.$tag->name().'/'.$this->indentation()."\n";
    }
    }
$text = <<<TEXT
Hallo <b>W<html:i>o</html:i>rld</b>
<cnt:tag x="2" y="1">
<cnt:taga x="2" y="1"></cnt:taga>
</cnt:tag>
I am Here
TEXT;
$parser = new \SuSAX\Parser(new MyParser);
$parser->setNsFocus('cnt');
$parser->setText($text);
$text_ = $parser->parse();
var_dump($text_);
?>

TagSoup - Just Keep On Truckin'

You could use TagSoup to ensure that all of the documents are well-formed.

...a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short.

TagSoup is designed for people who have to process this stuff using some semblance of a rational application design.

By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.

If you are using Saxon, you can make TagSoup your parser by adding the following option:

...you can use the standard Saxon -x org.ccil.cowan.tagsoup.Parser option, after making sure that TagSoup is on your Java classpath.

Also, Taggle, a TagSoup in C++, available now

Actually, you can try to use DOM::loadHTML since that method accepts non-well-formed markup.

http://php.net/domdocument.loadhtml

继续阅读：design-patterns php xml xml-parsing

XML Parsing from Non-XML Document

TagSoup - Just Keep On Truckin'

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？

TagSoup - Just Keep On Truckin'

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生 新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？