开发者

XSLT Transform with Character 8221

I'm transforming an XML document using javax.xml.transform.Transformer and XSLT. The document contains the characters “ and 开发者_JS百科” (Java Integer Code 8220 and 8221). These are not the normal quotation marks.

When I transform the document, these characters are transformed into “ and ” Now, my struggle is how to convert these back into something that people can read? I tried reading the document with DOMReader and SAXReader using encodings utf-8,utf-16, ascii, etc. No luck.

Your help is very much appreciated. Max.


These are utf-8 characters 201c and 201d. Are you transforming to HTML? If so and your xslt specifies HTML output I'd expect it to output &ldquo and &rldquo, as they're character entity references: http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references Quote from the XSLT spec:

"The html output method may output a character using a character entity reference, if one is defined for it in the version of HTML that the output method is using."

http://www.w3.org/TR/xslt#section-HTML-Output-Method


This input:

<p> “ and ” </p>

With this stylesheet (just identity rule):

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml" encoding="utf-8" omit-xml-declaration="yes"/>
    <xsl:template match="@* | node()">
        <xsl:copy>
            <xsl:apply-templates select="@* | node()" />
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>

Output:

<p> “ and ” </p>

Only Xalan with html serialization method, output:

<p> &ldquo; and &rdquo; </p>

So, if you want a proper renderization you need to output a proper HTML document...

This stylesheet:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="html" encoding="utf-8"/>
    <xsl:template match="@* | node()">
        <xsl:copy>
            <xsl:apply-templates select="@* | node()" />
        </xsl:copy>
    </xsl:template>
    <xsl:template match="/">
        <html>
            <head>
                <title>Test</title>
            </head>
            <body>
                <xsl:apply-templates/>
            </body>
        </html>
    </xsl:template>
</xsl:stylesheet>

Output:

<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
        <title>Test</title>
    </head>
    <body>
        <p> “ and ” </p>
    </body>
</html>

Note: Proper charset encoding declaration.


You need to understand that XSL transformation is applied not to the XML document per se but rather to tree representation of this document(s). Text nodes contain values in particular encoding regardless of how they were represented in input document - after tree is built they are same. During transformation you just create another tree and then it's serialized.

Some of characters like ones that you mentioned require special treatment depending on what destination format you choose. In case of serialization to XML document they are "escaped" and in case of serialization to HTML they are not. This is why first answer gives you a workaround.

However difference between these two methods in regard of escaping is just in the default value for "disable-output-escaping" attribute (XSLT 1.0). In case of XML output it's set to "no" and in case of HTML it's set to "yes".

So in order to fix your issue without changing the whole serialization method you could write something like this when you're copying some value which might contain "special" characters:

<xsl:value-of select="/my/node/text()" disable-output-escaping="yes"/>

P.S. In XSLT 2.0 preferred way to do this kind of things is by using character-map instruction.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜