开发者

Enhancing an ASCII protcol with multilingual fields

I am enhancing a piece of software that implements a simple ASCII based protocol.

The protocol is simple... here is an example of what the messages look a little bit like (not the same though, I can't show you the real protocol):

AUTH 1 1 200<CR><LF>

To which we get a response looking similar to

230 DEVICE 1 STATE AUTH 200 OUTPUT 1 NAME "Photo Black"<CR><LF>

The name "Photo Black" comes from a database sqlite d开发者_JAVA技巧atabase. I need to enhance it to support foreign languages. So I've been thinking that the field "Photo Black" needs to be "optionally" encoded as a UTF-8 string between the quotes. I'm wondering if there is a standard for this so that the client application can interpret the string in the quotes and straight away recognize it as either UTF-8 or plain ASCII. I'm not willing to rewrite the protocol, that would be too much work. Just slip in some kind of encoding for clients to recognize some Spanish or Swedish names.

I don't want the field to be always interpreted as UTF-8 either, long story there. You know how in C++ I can type 0xFF and the compiler knows that this is a hex string... is there an equivalent for UTF-8? Sorry I may be jumping the gun but I'm not that familiar with UTF-8 encoding and internationalization in general.


Do you have control over both the server and the client? If not, you can't change the protocol so you won't be able to do it. When you say you're "not wiling to rewrite the protocol" - you're going to have to do so at least to some extent. Whatever you do, you will be changing the protocol.

I'm not sure why you wouldn't want to always interpret the data as UTF-8 either - if it's currently only ASCII, then it would be completely backward compatible to always interpret it as UTF-8, as all ASCII is encoded the same way in UTF-8. Perhaps if you could give more information, we could provide more help.

You could introduce a prefix for UTF-8-encoded strings, e.g. U:

230 DEVICE 1 STATE AUTH 200 OUTPUT 1 NAME U"Photo UTF-8 stuff here Black"<CR><LF>

would that help?

Do you actually have an 8-bit data path? If something is going to mangle the top bit of every byte, then you'll need to consider options like Punycode instead of UTF-8.


Read up on the concept of Ascii Compatible Encoding, or ACE. iDNS is an example. So is/was UTF-7.

Here's the master speaking.

You really can't code-switch in and out of UTF-8. For a nightmare, look up ISO-2022, which attempted to support that sort of thing. Also keep in mind that UTF-8 includes ASCII, but not Latin-1.


Why don't you want the field to be "always interpreted as UTF-8"? You don't say.

If you do have the client interpret the protocol as UTF-8 encoded text, all of the existing output will still work correctly, since UTF-8 is a proper superset of ASCII.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜