Removing inline elements when importing HTML into DOMDocument or SimpleXML?

2022-12-17 03:50 问答作者：

I have an external HTML source that I want to scrape and either transform into a local XML file or add to a MySQL DB.

The external source is mostly normalized and (somewhat) semantic, so that all I need to do is use XPATH to get all td content or all li content, etc. The problem is that occasionally these items use <strong> or <b> or <i> tags to style the elements I need.

This is technically semantic, since the point is to add emphasis to the specific text, and the developer might want to use CSS that isn't the browser default.

The problem is that the actual content I am trying to grab is considered a child of this inline element, so that PHP extensions like simplexml or DOMDocument and DOMNode treat them as such. For example:

<table>
<tr><td>Thing 1</td><td>Thing 2</td></tr>
<tr><td>Thing 3</td><td>Thing 4</td></tr>
<tr><td><strong>Thing 5</strong></td><td><strong>Thing 6</strong></td></tr>
</table>

Will result in:

 [table] =>
    [tr] =>
        [td] => Thing 1
        [td] => Thing 2
    [tr] =>
        [td] => Thing 3
        [td] => Thing 4
    [tr] =>
        [td] => 
            [strong] => Thing 5
        [td] => 
            [strong] => Thing 6

Obviously the above is not quite what simplexml returns, but the above reflects the general problem.

So is there a way, using either a parameter already built into DOMDocument or using an extra sophisticated XPath query to get the contents of the td element with any children (if there are any) stripped of their descendant status and all content treated as the text of the queried element?

Right now, the only solutions I have are to either:

a) have a foreach loop that checks each result, like:

开发者_StackOverflow中文版

$result_text = ($result -> strong) ? $result - strong : $result;

b) using regex to strip any <strong> tags out of the HTML string before importing it into any pre-built classes like simplexml or DOMDocument.

Can't you just use strip_tags() to remove the extra markup?

$table = simplexml_load_string(
    '<table>
        <tr><td>Thing 1</td><td>Thing 2</td></tr>
        <tr><td>Thing 3</td><td>Thing 4</td></tr>
        <tr><td><strong>Thing 5</strong></td><td><strong>Thing 6</strong></td></tr>
    </table>'
);

foreach ($table->xpath('//td') as $td)
{
    $content = strip_tags($td->asXML());
    echo $content, "\n";
}

Please read the first answer to this before parsing html with a regex, if only for amusement sake. XPath is the answer, get the text of the td instead of continuing to parse it. So you'll just search for something like //td and take the results of that completely (instead of continuing the tree building so that you have leaves that say strong or whatever.

If you're using DOMDocument, once you've selected a DOMNode, the property textContent should contain only the text part of it and all it's childen... exactly what you asked for.

$table = '<table>
        <tr><td>Thing 1</td><td>Thing 2</td></tr>
        <tr><td>Thing 3</td><td>Thing 4</td></tr>
        <tr><td><strong>Thing 5</strong></td><td><strong>Thing 6</strong></td></tr>
    </table>';

$dom = new DOMDocument;
$dom->loadHTML($table);
$xpath = new DOMXPath($dom);

$els = $xpath->query('//td');
echo $els->item(4)->textContent; //Thing 5

Alternatively, depending on the type of node, you can check nodeValue as well. I can't recall exactly the difference, but textContent is what you want.

继续阅读：domdocument parsing php simplexml

Removing inline elements when importing HTML into DOMDocument or SimpleXML?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？