PHP DOMDocument nodeValue dumps literal UTF-8 characters instead of encoded
I am experiencing an issue similar to this question:
nodeValue from DomDocument returning weird characters in PHP
The root cause that I have found can be mimicked with mb_convert_encoding()
In my unit tests, this finally caught the issue:
$test = mb_convert_encoding('é', "UTF-8");
$this->assertTrue(mb_check_encoding($test,'UTF-8'),'data is UTF-8');
$th开发者_运维知识库is->assertTrue($this->rw->checkEncoding($test,'UTF-8'),'data is UTF-8');
$this->assertIdentical($test,html_entity_decode('é',ENT_QUOTES,'UTF-8'),'values match');
The raw value of the UTF-8 data appears to be coming over, and the base codepage of the system upon which PHP is running is most likely not UTF-8.
All the way up until parsing (with an HTML5lib implementation that dumps to DOMDocument) the strings stay clean, UTF-8 friendly. Only at the point of pulling data using
$span->nodeValue
do I see a failure in encoding stability.
My guess is that the htmlentities catch for the domdocument export to nodeValue uses an encoding converter, but disregards the inline encoding value.
Given that my issue is with HTML5, I figured it would be directly related to the newness of the implementation, but it appears to be a broader issue. I haven't been able to find any information on this issue specific to DOMDocument via searches, other than the question mentioned at the beginning.
UPDATE
In the name of moving forward, I have switched over from HTML5lib and DOMDocument over to Simple HTML DOM, and it exports cleanly escaped html which I can then parse back into the correct UTF-8 entities.
Also, one function I did not try was
utf8_decode
So that may be a solution for anyone else experiencing this issue. It solved a related issue I was experiencing with AJAX/PHP, solution found on this blog post from 2009: Overcoming AJaX UTF-8 Encoding Limitation (in PHP)
Just used utf8_decode on a nodeValue and it indeed kinda worked, had the problem with special characters not displaying correctly.
However, some characters still remain problematic, such as the simple quote ' and a few others (œ for example)
So using $element->nodeValue will not work, but utf8_decode($element->nodeValue) will - PARTLY.
The functions utf8_decode
and utf8_encode
are not very well named. They literally convert from utf-8
to iso-8859-1
and from iso-8859-1
to utf-8
respectively.
mb_convert_encoding
when called with just utf-8
as argument will normally be similar to using the function utf8_encode
. (Normally being unless you changed the internal code page, which you probably - hopefully - didn't).
Most of PHP's functions expect strings to be iso-8859-1
encoded. However, libxml (Which is the underlying library of php's xml parsing libraries) expects strings to be utf-8
. As such, you can easily end up with mangled encodings, if you aren't cautious.
As for your test, the first line may be deceptive. Since you have a literal é
in your script, the test would change depending on which encoding you have saved the file in. Check your text editor for that.
Hope that clarifies a bit.
精彩评论