开发者

Unicode to Windows-1251 Conversion with XML(HTML)-escaping

I have XML-file and need to produce HTML-file with Windows-1251 encoding by applying XSL Transformation. A problem is that Unicode characters of XSL -file are not converted to HTML Unicode Escape Sequence like "ғ" during XSL Transformation, only "?" sign is written instead of them. How can I ask XslCompiledTransform.Transform method to do this conversion? Or is there any method to write HTML-string into Windows-1251 HTML file with applying HTML Unicode Escape Sequences, so that I can perform XSL Transformation to string and then by this method to write to a file with Windows-1251 encoding and with HTML-escaping of all unicode characters (something like Convert("ғ") will return "ғ")?

XmlReader xmlReader = XmlReader.Create(new StringReader("<Data><Name>The Wizard of Wishaw</Name></data>"));

XslCompiledTransform xslTrans = new XslCompiledTransform();
xslTrans.Load("sheet.xsl");

using (XmlTextWriter xmlWriter = new XmlTextWriter("result.html", Encoding.GetEncoding("Windows-1251")))
{
    xslTrans.Transform(xmlReader, xmlWriter); // it writes Windows-1251 HTML-file but does not escape unicode characters, just writes "?" signs
}

Thanks all for help!

UPDATE

My output configuration tag in XSL-file:

<xsl:output method="xml" indent="yes" omit-xml-declaration="yes" />

I do not even hope now that XSL will satisfy my needs. But I wonder that I do not have any method to check if character is acceptable by specified encoding. Something like

Char.IsEncodable('ғ', Encoding.GetEncoding('Windows-1251'))

My current solution is to convert all characters greater than 127 (c > 127) to &#dddd开发者_如何学C; escape strings, but my chief is not satisfied by the solution, because the source of generated HTML-file is not readable.


Do note that XML is both a data model and a serialization format. The data can use different character set than the serialization of this data.

It looks like the key reason to your problem is that your serialization process is trying to limit the character set of the data model, whereas you would like to set the character set of the serialization format. Let's have an example: <band>Motörhead</band> and <band>Mot&#246;rhead</band> are equal XML documents. They have the same structure and exactly the same data. Because of the heavy metal umlaut, the character set of the data is unicode (or something bigger than ASCII) but, because the usage of a character reference &#246;, the character set of the latter serialization form of the document is ASCII. In order to process this data, your XML tools still need to be unicode aware in both cases, but when using the latter serialization, the I/O and file transfer tools don't need to be unicode aware.

My guess is that by telling the XMLTextWriter to use Windows-1251 encoding, it probably in practice tries to limit the character set of the data to the characters contained in Windows-1251 by discarding all the characters outside this character set and writing a ? character instead.

However, since you produce your XML document by an XSL transformation, you can control the character set of the serialization directly in your XSLT document. This is done by adding a encoding attribute to the xsl:output element. Modify it to look like this

<xsl:output method="xml" indent="yes" omit-xml-declaration="yes" encoding="windows-1251"/>

Now the XSLT processor takes care of the serialization to reduced character set and outputs a character reference for all characters in the data that are included in windows-1251.

If changing the character set of the data is really what you need, then you need to process your data with a suitable character conversion library that can guess the most suitable replacement character (like ö -> o).


try to complement your xsl-file with replacement rules a la

<xsl:value-of select="replace(.,'&#1171;','&amp;#1171;')"/>

you may wish to do this using regex patterns instead:

<xsl:value-of select="replace(.,'&#(\d+);','&amp;#$1;')"/>

your problem origins with the xml parser that substitutes the numeric entity reference with the corresponding unicode chars before the transformation takes place. thus the unknown chars (resp. '?') end up in your converted document.

hope this helps,

best regards,

carsten


The correct solution would be to write the file in a Unicode encoding (such as UTF-8) and forget about CP-1251 and all other legacy encodings.

But I will assume that this is not an option for some reason.

The best alternative that I can devise is to do the character replacements in the string before handing it to the XmlReader. You should use the Encoding class to convert the string to an array of bytes in CP-1251, and create your own decoder fallback mechanism. The fallback mechanism can then insert the XML escape sequences. This way you are guarunteed to handle all (and exactly those) characters that are not in CP-1251.

Then you can convert the array of bytes (in CP-1251) into a normal .NET String (in UTF-16) and hand it to your XmlReader. The values that need to be escaped will already be escaped, so the final file should be written correctly.

UPDATE

I just realized the flaw of this method. The XmlWriter will further escape the & characters as &amp;, so the escapes themselves will appear in the final document rather than the characters they represent.

This may require some very complicated solution!

ANOTHER UPDATE

Ignore that last update. Since you are reading the string in as XML, the escapes should be interpreted correctly. This is what I get for trying post quickly rather than thinking through the problem!

My proposed solution should work fine.


Have you tried specifying the encoding in the xsl:output? (http://www.w3schools.com/xsl/el_output.asp)


The safest and most interoperable way to do this is to specify encoding="us-ascii" in your xsl:output element. Most XSLT processors support writing this encoding.

US-ASCII is a completely safe encoding as it is a compatible subset of UTF-8 (you may elect to label the emitted XML as having a "utf-8" encoding, as this will also be true: this can be done by specifying omit-xml-declaration="yes" for your xsl:output and manually prepending an "<?xml version='1.0' encoding='utf-8'?>" declaration to your output).

This approach works because when using US-ASCII encoding, a serializer is forced to use XML's escaping mechanism for characters beyond U+007F, and so will emit them as numeric character references (the "&#.....;" form).

When dealing with environments in which non-standard encodings are required, it is generally a good defensive technique to produce this kind of XML as it is completely conformant and works in practice with even some buggy consuming software.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜