开发者

Loop over DOMDocument

I am following the suggestion from this question Robust, Mature HTML Parser for PHP, about parsing html that may be malformed with DOMDocument.

Is there any easy way to loop over the parsed document? So I would like to loop over html like this.

$html='<ul>
         <li>value1</li>
         <li>value1</li>
         <li>value3
            <p>subvalue</p>
         </li>
        </ul>
        <p>hello world</p>';

$doc = new DOMDocument();
$doc->loadHTML($html);
???
foreach (??? as $node)
{
  print $node->nodeName.':'.$node->nodeValue;
}

And 开发者_Go百科get results somewhat like this.

 ul:
 li:value1
 li:value2
 li:value3
 p:subvalue
 p:hello world

Using $doc->childNodes by itself doesn't really do what I want. Since it doesn't seem to go down to lower branches in the tree. I used the code suggested by halfdan and I get results like this.

html:
html:value1
         value1
         value3
            subvalue

        hello world


Try this:

$doc = new DOMDocument();
$doc->loadHTML($html);
showDOMNode($doc);

function showDOMNode(DOMNode $domNode) {
    foreach ($domNode->childNodes as $node)
    {
        print $node->nodeName.':'.$node->nodeValue;
        if($node->hasChildNodes()) {
            showDOMNode($node);
        }
    }    
}


I was having issues with elements that had c data, where even elements that didn't have children where returning that they did.

I am not sure why it was.

The work around I found was to change

if($node->hasChildNodes()) {
        showDOMNode($node);
    }

to

if($node->childNodes->length != 1) {
        showDOMNode($node);
    }

And the code now works perfectly.


You need to use PHP Simple HTML DOM Parser and the following code:

<?php
require_once 'simplehtmldom/simple_html_dom.php';

function iterateHtmlElements($html)
{
    $dom = str_get_html($html);
    $dom->set_callback('handleElement');
    $dom->__toString();
    echo "\n";
}

function handleElement(simple_html_dom_node $elem)
{
    if($elem->tag == 'text') {
        echo $elem->innertext();
    }
    else {
        echo "\n" . $elem->tag . ": ";
    }
}

$html='<ul>
         <li>value1</li>
         <li>value1</li>
         <li>value3
            <p>subvalue</p>
         </li>
        </ul>
        <p>hello world</p>';
iterateHtmlElements($html);

It works exactly as expected. I checked it with the input you provided and got the following results:

> php test2.php

ul:
li: value1
li: value1
li: value3
p: subvalue
p: hello world


One way is to walk the tree as follow:

function next_node($node)
{
    if($node->firstChild != null)
    {
        return $node->firstChild;
    }

    if($node->nextSibling != null)
    {
        return $node->nextSibling;
    }

    for($node = $node->parentNode; $node != null; $node = $node->parentNode)
    {
        if($node->nextSibling != null)
        {
            return $node->nextSibling;
        }
    }

    return null;
}

for($node = $doc; $node != null; $node = next_node($node))
{
    // handle node (read-only mode, if you need read-write
    // you have to save all the nodes in an array and then
    // use that array
    //
    ...
}

This works for most documents, however it looks like at times the parentNode is somehow not correctly set and the next_node() function ends up returning the wrong information.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜