UTF-8 encoding problem with XSLT via PHP

2023-01-15 01:53 问答作者：

I'm facing a nasty encoding issue when transforming XML via XSLT through PHP.

The pro开发者_如何学Goblem can be summarised/dumbed down as follows: when I copy a (UTF-8 encoded) XHTML file with an XSLT stylesheet, some characters are displayed wrong. When I just show the same XHTML file, all characters come out correctly.

Following files illustrate the problem:

XHTML

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
        <title>encoding test</title>
    </head>
    <body>
        <p>This is how we d&#239;&#223;&#960;&#955;&#509; &#145;special characters&#146;</p>
    </body>
</html>

XSLT

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    version="1.0">

    <xsl:output method="xml" encoding="UTF-8"/>

    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>

PHP

<?php
  $xml_file = 'encoding_test.xml';
  $xsl_file = 'encoding_test.xsl';

  $xml_doc = new DOMDocument('1.0', 'utf-8');
  $xml_doc->load($xml_file);

  $xsl_doc = new DOMDocument('1.0', 'utf-8');
  $xsl_doc->load($xsl_file);

  $xp = new XsltProcessor();
  $xp->importStylesheet($xsl_doc);

  // alllow to bypass XSLT transformation with bypass=true request parameter
  if ($bypass = $_GET['bypass']) {
    echo file_get_contents($xml_file);
  }
  else {
    echo $xp->transformToXML($xml_doc);
  }
?>

When this script is invoked as such (via e.g. http://localhost/encoding_test/encoding_test.php), all characters in the transformed XHTML document come out ok, except for the  and  character entities (they're opening and closing single quotation marks). I'm not a Unicode expert, but two things strike me:

all other character entities are interpreted correctly (which could imply something about the UTF-8-ness of  and )
yet, when the XHTML file is displayed unmediated (via e.g. http://localhost/encoding_test/encoding_test.php?bypass=true), all characters are displayed properly.

I think I've declared UTF-8 encoding for the output anywhere I could. Do others perhaps see what's wrong and can be righted?

Thanks in advance!

Ron Van den Branden

 and  are no visible Unicode characters.

They are old HTML character references¹ for single quotes, but when you process them using an XSLT processor the processor doesn't see single quotes but the Unicode characters with decimal codes 145 and 146, i.e. U+0090 and U+0091.

These characters are private use (i.e. the usage is not defined by the Unicode consortium) C1 control codes.

The solution is to use the correct Unicode characters ‘ and ’.

¹_{In fact, these are codes that map to Windows-1252 encoding. They are displayed by browsers but they are actually not valid in HTML:}

NOTE -- the above SGML declaration, like that of HTML 2.0, specifies the character numbers 128 to 159 (80 to 9F hex) as UNUSED. This means that numeric character references within that range (e.g. ’) are illegal in HTML. Neither ISO 8859-1 nor ISO 10646 contain characters in that range, which is reserved for control characters.

继续阅读：encoding php utf-8 xslt

UTF-8 encoding problem with XSLT via PHP

XHTML

XSLT

PHP

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

Best solution for private video database [closed]

imessage会显示已读吗？

XHTML

XSLT

PHP

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生 新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

Best solution for private video database [closed]

imessage会显示已读吗？

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？