DOMDocument and HTML entities

2023-03-31 20:03 问答作者：

I'm trying to parse some HTML that includes some HTML entities, like ×

$str = '<a href="http://example.com/"> A &#215; B</a>';

$dom = new DomDocument;
$dom -> substituteEntities = false;
$dom ->loadHTML($str);

$link = $dom ->getElements开发者_C百科ByTagName('a') -> item(0);
$fullname = $link -> nodeValue;
$href = $link -> getAttribute('href');

echo "
fullname: $fullname \n
href: $href\n";

but DomDocument substitutes the text for for A Ã— B.

Is there some way to keep it from taking the & for an HTML entity and make it just leave it alone? I tried to set substituteEntities to false but it doesn't do anything

From the docs:

The DOM extension uses UTF-8 encoding.
Use utf8_encode() and utf8_decode() to work with texts in ISO-8859-1 encoding or Iconv for other encodings.

Assuming you're using latin-1 try:

<?php
header('Content-type:text/html;charset=iso-8859-1');


$str = utf8_encode('<a href="http://example.com/"> A &#215; B</a>');

$dom = new DOMDocument;


$dom -> substituteEntities = false;
$dom ->loadHTML($str);

$link = $dom ->getElementsByTagName('a') -> item(0);
$fullname = utf8_decode($link -> nodeValue);
$href = $link -> getAttribute('href');

echo "
fullname: $fullname \n
href: $href\n";    ?>

This is no direct answer to the question, but you may use UTF-8 instead, which allows you to save glyphs like ÷ or × directly. To use UTF-8 with PHP DOM on the other needs a little hack.

Also, if you are trying to display mathematical formulas (as A × B suggests) have a look at MathML.

Are you sure the & is being substituted to &? If that were the case, you'd see the exact entity, as text, not the garbled response you're getting.

My guess is that it is converted to the actual character, and you're viewing the page with a latin1 charset, which does not contain this character, hence the garbled response.

If I render your example, my output is:

fullname:  A × B 

href: http://example.com/

When viewing this in latin1/iso-8859-1, I see the output you're describing. But when I set the charset to UTF-8, the output is fine.

I fixed my problem with broken entities by converting UTF-8 to UTF-8 with BOM.

继续阅读：character-encoding domdocument php

DOMDocument and HTML entities

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？