PHP DOMDocument - get html source of BODY

2022-12-21 04:41 问答作者：

I'm using PHP's DOMDocument to parse and normalize user-submitted HTML using the loadHTML method to parse the content then getting a well-formed result via saveHTML:

$dom= new DOMDocument();
$dom->loadHTML('<div><p>Hello World');
$well_formed= $dom->saveHTML(); 
echo($well_formed);

This does a beautiful job of parsing the fragment and adding the appropriate closing tags. The problem is that I'm also getting a bunch of tags I don't want such as <!DOCTYPE>, <html>, <head> and <body>. I understand that every well-formed HTML document needs these tags, but the HTML fragment I'm normalizing is going to be in开发者_高级运维serted into an existing valid document.

The quick solution to your problem is to use an xPath expression to grab the body.

$dom= new DOMDocument();
$dom->loadHTML('<div><p>Hello World');      
$xpath = new DOMXPath($dom);
$body = $xpath->query('/html/body');
echo($dom->saveXml($body->item(0)));

A word of warning here. Sometimes loadHTML will throw a warning when it encounters certainly poorly formed HTML documents. If you're parsing those kind of HTML documents, you'll need to find a better html parser [self link warning].

IN your case, you do not want to work with an HTML document, but with an HTML fragment -- a portion of HTML code ;; which means DOMDocument is not quite what you need.

Instead, I would rather use something like HTMLPurifier (quoting) :

HTML Purifier is a standards-compliant HTML filter library written in PHP. HTML Purifier will not only remove all malicious code (better known as XSS) with a thoroughly audited, secure yet permissive whitelist, it will also make sure your documents are standards compliant, something only achievable with a comprehensive knowledge of W3C's specifications.

And, if you try your portion of code :

<div><p>Hello World

Using the demo page of HTMLPurifier, you get this clean HTML as an output :

<div><p>Hello World</p></div>

Much better, isn't it ? ;-)

(Note that HTMLPurfier suppots a wide range of options, and that taking a look at its documentation might not hurt)

Faced with the same problem, I've created a wrapper around DOMDocument called SmartDOMDocument to overcome this and some other shortcomings (such as encoding problems).

You can find it here: http://beerpla.net/projects/smartdomdocument

This was taken from another post and worked perfectly for my use:

$layout = preg_replace('~<(?:!DOCTYPE|/?(?:html|head|body))[^>]*>\s*~i', '', $layout);

TL;DR: $dom->saveHTML($dom->documentElement->lastChild);
Where $dom->documentElement->lastChild is the body-node but could be every other available DOMNode of the document.

Actucally the DOMDocument::saveHTML-method itself is capable of doing what you want. It takes a DOMNode-object as the first argument to output a subset of the document.

$dom = new DOMDocument();
$dom->loadHTML('<div><p>Hello World');
$well_formed= $dom->saveHTML($dom->documentElement->lastChild); 
echo($well_formed);

There are several ways of retrieving the body-node. Here are 2:

$bodyNode = $dom->documentElement->lastChild;
$bodyNode = $dom->getElementsByTagName('body')->item(0);

From the PHP Manual

public DOMDocument::saveHTML(?DOMNode $node = null): string|false
Parameters
node
Optional parameter to output a subset of the document.

https://www.php.net/manual/en/domdocument.savehtml.php

继续阅读：dom domdocument parsing php

PHP DOMDocument - get html source of BODY

From the PHP Manual

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

From the PHP Manual

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？