开发者

Is there a programming language with full and correct Unicode support?

Most programming languages have some support for Unicode, but all have some more or less documented corner cases, where things won't work correctly.


Examples

Java: reverse() in StringBuilder/StringBuffer work correctly. But length(), charAt(), etc. in String do not if a character needs more than 16bit to encode.

C#: Didn't find a correct reverse method, Length and indexed access return wrong results.

Perl: Same problem.

PHP: Does not have an idea of Unicode at all, mbstring has some better working replacements.


I wonder if there is a programming language, which has full and correct Unicode support? What compromises had to be made there to achieve su开发者_如何学JAVAch a thing?

  • More complex algorithms?
  • Higher memory consumption?
  • Slower performance?

How was it implemented internally?

  • Array of Ints, Linked Lists, etc.
  • Additional buffering

I saw that Python 3 had some pretty big changes in this area. How close is Python 3 now to a correct implementation?


The Java implementation is correct in the sense that is does not violate the Unicode standard; there is no prescription that string indexing work on code points instead of code units, and the behavior is documented. The Unicode standard gives implementors great freedom concerning optimizations as long as no invalid string is leaked. Concerning “full support”, that’s even harder to define. The Unicode standard generally doesn’t require that certain features be implemented to be Unicode-compatible; only that the features that are implemented are implemented according to the standard. Huge parts concerning script processing belong to fonts or the operating system, which programming systems cannot control. If you want to judge about the Unicode support of certain technologies, you can start by testing the following (subjective and non-exhaustive) list of topics:

  • Does the system have a string datatype that uses a Unicode encoding?
  • Are all Unicode (UTF) encodings supported that are described in the standard?
  • Normalization
  • The Bidirectional Algorithm
  • Is UpperCase("ß") = "SS"?
  • Is upper-casing locale sensitive? (e.g. in Turkish, UpperCase("i") = "İ")
  • Are there functions to work with code points instead of code units?
  • Unicode regular expressions
  • Does the system raise exceptions when invalid code unit sequences are encountered during decoding?
  • Access to Unicode Database properties?

I think the Java and .NET answer to these questions is mostly “yes”, while the Python 3.x answer is almost always “no.”


Go, the new language developed at Google invented by Ken Thompson and Rob Pike and the C dialect in Plan9 from Bell Labs were built with Unicode in mind (UTF-8 was invented there, at Bell Labs, by Ken Thompson).


It looks like Perl 6 gets good Unicode support:

perlgeek.de/en/article/5-to-6#post_17

For instance it provides you with three different length methods:

  • bytes (amount of bytes)
  • codes (amount of codepoints)
  • graphs (amount of graphemes)

This gets integrated into Perl's regular expressions as well.

Looks like a step into the right direction to me.


Thought this is 10 years old question,...

Yes. Swift does.

  • Basic string type String performs all character handling at Unicode "Grapheme Cluster" level. Therefore you are enforced to perform every text mutating operations in "Unicode-correct" manner at "human-perceived character" level.

  • The String type is abstracted data type and does not expose its internal representations, but it has interfaces to access Unicode Scalar Values and Unicode Code Units for all of UTF-8, UTF-16, UTF-32 encodings.

  • It also stores breadcrumbs to provide offset conversion between UTF-8 and UTF-16 in amortized O(1) time.

  • Character type also provide decomposition into Unicode Scalar Values.

  • Character type has multiple character classification methods that are based on Unicode semantics. For example, Character.isNewline returns true for all new-lines strings including LF,VT,FF,CR,CR-LF,NEL, ... that are defined in Unicode standard.

  • Though it's abstracted, Swift 5.x internally stores strings in UTF-8 encoded form by default. It's possible to access them in strict O(1) time so you can use UTF-8 based functions without sacrificing performance.

  • "Unicode" in Swift covers "all" characters defined in Unicode standard and not limited to BMP.

  • String, Character and all of their derived view types like UTF8View, UTF16View, UnicodeScalarView conform BidirectionalCollection protocol, so you can iterate components bi-directionally in all supported segmentation levels. They all share same index type so indices obtained from one view can be used on another view if they points correct Grapheme Cluster boundaries.


In Python 3, strings are always unicode (there is bytes for ASCII or similar encodings). I'm not aware of any built-ins not working correctly with them. There may be some, but considered it is out for quite a while, I figure they got about everything needed daily working.

Of course Unicode has higher memory comsumption (UTF-8 not really if you stay within ASCII range, but else...) and I can imagine multiple-length encodings are a pain to handle internally. I don't know anything about the implementation, though. Except that it can't be a linked list, since it has O(1) random access.


The .NET Framework stores char and string data using the UTF-16 encoding. If you assume that all your text lies within the Basic Multilingual Plane, then everything will just work without any special code.

If you regard user-entered strings as blobs and don't try to manipulate them (e.g. most text fields in CRUD apps), then your code will appear to handle characters outside the BMP correctly, because UTF-16 stores them as surrogate pairs. As long as you don't fiddle with the surrogate pairs, then all will be fine.

However, if you want to analyse and manipulate strings while also handling characters outside the BMP correctly, then you have to explicitly code for that possibility. See the StringInfo class for methods to help you process surrogate pairs.

I would guess that Microsoft designed it this way to achieve a balance between performance and correctness. The alternatives would be:

  • Store strings as UTF-32 - poor performance in terms of memory use
  • Make all string functions handle surrogate pairs - very poor performance for manipulation

.NET also contains full support for culture-aware case conversion, comparisons and sorting.


I believe that any language supported on the .NET framework has correct unicode (UTF-16) support.

Also, similar question here


DigitalMars D has datatype dstring which uses UTF32 codepoints, should be enough for most cases.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜