开发者

Is there a lax, permissive XML parser for PHP?

I'm looking for a parser that will allow me to successfully parse broken xml, taking a "best guess" approach - for instance.

    <thingy>
       <description>
           something <b>with</b> bogus<br> 
           markup not wrapped in CDATA
       </description>
    </thingy>

Ideally, it will yield a thingy object, with a description property and whatever tag soup inside.

Other suggestions on how to attack the problem (other than having valid markup to start with) welcome.

Non-php solutions (Beautiful Soup (python) for insta开发者_高级运维nce) are not outside the pale, but I'd prefer to stick to the prevailing skill-set in the company

Thanks!


You could use DOMDocument::loadHTML() (or DOMDocument::loadhtmlfile()) to convert your broken XML to proper XML. If you don't like dealing with DOMDocument objectsThen use saveXML() and load the resulting XML string with SimpleXML.

$dom = DOMDocument::loadHTMLfile($filepath);
if (!$dom)
{
    throw new Exception("Could not load the lax XML file");
}
// Now you can work with your XML file using the $dom object.


// If you'd like using SimpleXML, do the following steps.
$xml = new SimpleXML($dom->saveXML());
unset($dom);

I've tried this script:

<?php
$dom = new DOMDocument();
$dom->loadHTMLFile('badformatted.xml');
if (!$dom)
{
    die('error');
}
$nodes = $dom->getElementsByTagName('description');
for ($i = 0; $i < $nodes->length; $i++)
{
    echo "Node content: ".$nodes->item($i)->textContent."\n";
}

The output when executing this from the CLI:

carlos@marmolada:~/xml$ php test.php

Warning: DOMDocument::loadHTMLFile(): Tag thingy invalid in badformatted.xml, line: 1 in /home/carlos/xml/test.php on line 3

Warning: DOMDocument::loadHTMLFile(): Tag description invalid in badformatted.xml, line: 2 in /home/carlos/xml/test.php on line 3
Node content:
                something with bogus
                markup not wrapped in CDATA

carlos@marmolada:~/xml$

edit: some minor corrections and error treatment.

edit2: Change to non-static call to avoid E_STRICT error, added test case.


One alternative is to use the Tidy HTML library (PHP binding here) to clean the HTML first. That survives quite a lot of fairly hideous input, and I've seen people use it for scraping rather ropey HTML before.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜