Objective-C How to get unicode character

2023-02-05 03:30 问答作者：

I want to get unicode code point for a given unicode character in Objective-C. NSString said it internal use UTF-16 encoding and said,

The NSString class has two primitive methods—length and characterAtIndex:—that provide the basis for all other methods in its interface. The length method returns the total number of Unicode characters in the string. characterAtIndex: gives access to each character in the string by index, with index values starting at 0.

That seems assume characterAtIndex method is unicode aware. However it return unichar is a 16 bits unsigned int type.

- (unichar)characterAtIndex:(NSUInteger)index

The questions are:

Q1: How it present unicode code point above UFFFF?
Q2: If Q1 make sense, is there method to get unicode code point for a giv开发者_如何转开发en unicode character in Objective-C.

Thx.

The short answer to "Q1: How it present unicode code point above UFFFF?" is: You need to be UTF16 aware and correctly handle Surrogate Code Points. The info and links below should give you pointers and example code that allow you to do this.

The NSString documentation is correct. However, while you said "NSString said it internal use UTF-16 encoding", it's more accurate to say that the public / abstract interface for NSString is UTF16 based. The difference is that this leaves the internal representation of a string a private implementation detail, but the public methods such as characterAtIndex: and length are always in UTF16.

The reason for this is it tends to strike the best balance between older ASCII-centric and Unicode aware strings, largely due to the fact that Unicode is a strict superset of ASCII (ASCII uses 7 bits, for 128 characters, which are mapped to the first 128 Unicode Code Points).

To represent Unicode Code Points that are > U+FFFF, which obviously exceeds what can be represented in a single UTF16 Code Unit, UTF16 uses special Surrogate Code Points to form a Surrogate Pair, which when combined together form a Unicode Code Point > U+FFFF. You can find details about this at:

Unicode UTF FAQ - What are surrogates?
Unicode UTF FAQ - What’s the algorithm to convert from UTF-16 to character codes?
Although the official Unicode UTF FAQ - How do I write a UTF converter? now recommends the use of International Components for Unicode, it used to recommend some code officially sanctioned and maintained by Unicode. Although no longer directly available from Unicode.org, you can still find copies of the "no longer official" example code in various open-source projects: ConvertUTF.c and ConvertUTF.h. If you need to roll your own, I'd strongly recommend examining this code first, as it is well tested.

From the documentation of length:

The number returned includes the individual characters of composed character sequences, so you cannot use this method to determine if a string will be visible when printed or how long it will appear.

From this, I would infer that any characters above U+FFFF would be counted as two characters and would be encoded as a Surrogate Pair (see the relevant entry at http://unicode.org/glossary/).

If you have a UTF-32 encoded string with the character you wish to convert, you could create a new NSString with initWithBytesNoCopy:length:encoding:freeWhenDone: and use the result of that to determine how the character is encoded in UTF-16, but if you're going to be doing much heavy Unicode processing, your best bet is probably to get familiar with ICU (http://site.icu-project.org/).

继续阅读：objective-c unicode

Objective-C How to get unicode character

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？