XML Parsing from Non-XML Document
in a xml/non-xml File there may exist some XML Block that I need to parse and replace with some other string.. The Scenario is something like this..
Some Text
<cnt:use name="abc" call="xyz">
<cnt:param name="x" value="2" />
</cnt:use>
Some Text
There is no guarantee that the document is a proper XML document. (there may exist some unclosed Tags. or some othe开发者_开发问答r common mistakes that a Stupid people can make while typing HTML). so I can't use SAX or DOM. I can't even pass it to XSLT (am I right ?). So Whats the best way to extract the <cnt:*>
part from the non-xml Document. and read it then replace with something else.
I can't even pass it to XSLT (am I right ?).
Right. XSLT operates on an XML Infoset that is the representation of a parsed tree (XML document). And this text isn't in general parsable as XML.
In XSLT 2.0 there is a function parse-text()
that can read any text, but the this text must be parsed and until XSLT 3.0 arrives there will not be functions that even vaguely remind such parsing -- and when there are, they would fail, because the text isn't well-formed XML.
The whole problem of extracting peaces of XML out of a non-well-formed XML is ambiguous and not well-defined. For example, if an ending tag is missing, how do you decide where exactly to insert it?
Hmm. The Problem is I've to implementing it in PHP :( . Super Sad..
So taking ideas from TagSoup
as mentioned in Mads Hansen's Answer. I've made a Mini SAX Framework on PHP 5.3. https://github.com/neel/SuSAX/blob/master/sax.php.
I am keeping it more like SAX. at the same time I am tracking the tag nesting also. and also keeping a Parse Tree. I've kept a setNsFocus()
method that Specifies only which tags to follow.
<?php
error_reporting(255);
ini_set('display_errors','On');
header('Content-Type: text/plain');
class MyParser extends \SuSAX\AbstractParser{
public function open($tag){
echo ">> open ".$tag->ns().':'.$tag->name().'/'.$this->indentation().($this->parent() ? $this->parent()->name() : '')."\n";
return "OO";
}
public function close($tag){
echo ">> close ".$tag->ns().':'.$tag->name().'/'.$this->indentation()."\n";
}
public function standalone($tag){
echo ">> standalone ".$tag->ns().':'.$tag->name().'/'.$this->indentation()."\n";
}
}
$text = <<<TEXT
Hallo <b>W<html:i>o</html:i>rld</b>
<cnt:tag x="2" y="1">
<cnt:taga x="2" y="1"></cnt:taga>
</cnt:tag>
I am Here
TEXT;
$parser = new \SuSAX\Parser(new MyParser);
$parser->setNsFocus('cnt');
$parser->setText($text);
$text_ = $parser->parse();
var_dump($text_);
?>
TagSoup - Just Keep On Truckin'
You could use TagSoup to ensure that all of the documents are well-formed.
...a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short.
TagSoup is designed for people who have to process this stuff using some semblance of a rational application design.
By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.
If you are using Saxon, you can make TagSoup your parser by adding the following option:
...you can use the standard Saxon
-x org.ccil.cowan.tagsoup.Parser
option, after making sure that TagSoup is on your Java classpath.
Also, Taggle, a TagSoup in C++, available now
Actually, you can try to use DOM::loadHTML
since that method accepts non-well-formed markup.
http://php.net/domdocument.loadhtml
精彩评论