开发者

UTF-8 encoding problem with XSLT via PHP

I'm facing a nasty encoding issue when transforming XML via XSLT through PHP.

The pro开发者_如何学Goblem can be summarised/dumbed down as follows: when I copy a (UTF-8 encoded) XHTML file with an XSLT stylesheet, some characters are displayed wrong. When I just show the same XHTML file, all characters come out correctly.

Following files illustrate the problem:

XHTML

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
        <title>encoding test</title>
    </head>
    <body>
        <p>This is how we d&#239;&#223;&#960;&#955;&#509; &#145;special characters&#146;</p>
    </body>
</html>

XSLT

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    version="1.0">

    <xsl:output method="xml" encoding="UTF-8"/>

    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>

PHP

<?php
  $xml_file = 'encoding_test.xml';
  $xsl_file = 'encoding_test.xsl';

  $xml_doc = new DOMDocument('1.0', 'utf-8');
  $xml_doc->load($xml_file);

  $xsl_doc = new DOMDocument('1.0', 'utf-8');
  $xsl_doc->load($xsl_file);

  $xp = new XsltProcessor();
  $xp->importStylesheet($xsl_doc);

  // alllow to bypass XSLT transformation with bypass=true request parameter
  if ($bypass = $_GET['bypass']) {
    echo file_get_contents($xml_file);
  }
  else {
    echo $xp->transformToXML($xml_doc);
  }
?>

When this script is invoked as such (via e.g. http://localhost/encoding_test/encoding_test.php), all characters in the transformed XHTML document come out ok, except for the &#145; and &#146; character entities (they're opening and closing single quotation marks). I'm not a Unicode expert, but two things strike me:

  1. all other character entities are interpreted correctly (which could imply something about the UTF-8-ness of &#145; and &#146;)
  2. yet, when the XHTML file is displayed unmediated (via e.g. http://localhost/encoding_test/encoding_test.php?bypass=true), all characters are displayed properly.

I think I've declared UTF-8 encoding for the output anywhere I could. Do others perhaps see what's wrong and can be righted?

Thanks in advance!

Ron Van den Branden


&#145; and &#146; are no visible Unicode characters.

They are old HTML character references1 for single quotes, but when you process them using an XSLT processor the processor doesn't see single quotes but the Unicode characters with decimal codes 145 and 146, i.e. U+0090 and U+0091.

These characters are private use (i.e. the usage is not defined by the Unicode consortium) C1 control codes.

The solution is to use the correct Unicode characters &#x2018; and &#x2019;.

1In fact, these are codes that map to Windows-1252 encoding. They are displayed by browsers but they are actually not valid in HTML:

NOTE -- the above SGML declaration, like that of HTML 2.0, specifies the character numbers 128 to 159 (80 to 9F hex) as UNUSED. This means that numeric character references within that range (e.g. ’) are illegal in HTML. Neither ISO 8859-1 nor ISO 10646 contain characters in that range, which is reserved for control characters.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜