Leave entity intact in XML + XSLT
I transform XML to (sort of) HTML with XSL stylesheets (using Apache Xalan). In XML there can be entities like —
, which must be left as is. In beginning of XML file I have a doctype which references these entities. What should I do for entity to be left unchanged?
<!DOCTYPE article [
<!ENTITY mdash "—"><!-- 开发者_如何学运维em dash -->
]>
gives me SAXParseException: Recursive entity expansion, 'mdash'
when encountering &mdash
in XML text.
The way to define and use the entity is:
<!DOCTYPE xsl:stylesheet [<!ENTITY mdash "—">]>
<t>Hello — World!</t>
When processed with the simplest possible XSLT stylesheet:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
</xsl:stylesheet>
The correct output (containing mdash) is produced:
Hello — World!
Important:
In XSLT 2.0 it is possible to use the <xsl:character-map>
instruction so that certain, specified characters are represented by entities. In this particular case:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
>
<xsl:output omit-xml-declaration="yes"
use-character-maps="mdash"/>
<xsl:character-map name="mdash">
<xsl:output-character character="—" string="&mdash;" />
</xsl:character-map>
</xsl:stylesheet>
when the above transformation is applied on the same XML document (already shown above), the output is:
Hello — World!
What should I do for entity to be left unchanged?
You can't. Entity references are by necessity resolved to their content by the XML parser before processing with XSLT occurs, because they may contain elements and other content that XPath needs to match. Messing with the DOCTYPE will have no effect.
However if you set <xsl:output encoding="us-ascii">
, the after-processing document should be serialised to the ASCII character set and so the em-dash would have to be encoded to —
.
XSLT 2.0 proposes “character maps” which would allow you to specify that all —
characters have to be encoded to —
or any other sequence, but it couldn't distinguish between a —
that was originally —
in the source and one that was —
. If you don't have XSLT 2.0, you could always try a simple string replace hack on the output document to replace —
with —
. This is dodgy, but OK as long as you know —
will only ever be used in text and attribute value content.
The stipulation “must be left as is” is usually pretty doubtful. It is poor HTML parser indeed that cannot accept plain Unicode characters, or in the worst case where the encoding information gets lost it should at least be able to cope with the numeric character reference.
精彩评论