ICU Unicode Normal vs Fullwidth

2022-12-18 17:42 问答作者：

I am somewhat new to unicode and unicode strings. I'm trying to determine the difference between "fullwidth" symbol and a normal one.

Take these two for example:

Normal: http://www.fileformat.info/info/unicode/char/20a9/index.htm

Fullwidth: http://www.fileformat.info/开发者_JAVA技巧info/unicode/char/ffe6/index.htm

I notice that the fullwidth is defined as U+20A9 and coincidentally 20A9 is the normal one. So what is the value of U?

When using libraries like ICU is there a way to specify always return normal versus full?

Thanks,

U+number is a notational convention for a Unicode code point. There is no 'value' of U.

U+0020, for example, is a space. The value in memory is 32 decimal, 20 hex.

Full width characters are a whole other story.

Back in the days of the 3270, Hanzi took up two positions in memory in the display. So they also took up two columns on the screen. To make things line up neatly, IBM defined a set of 'full-width' (better would have been 'double-width') letters and numbers.

If some ICU API is delivering full-width, you can use the Normalizer to get rid of it. You might also post a ticket to their ticket system, this seems odd.

The 'U' in "U+2049" just denotes that "2049" is a Unicode code point, the value of the Won character in the Unicode codespace. It's a notation used in the Unicode Standard. The "U+" shall be followed by a hexadecimal number, using at least 4 digits, such as "U+1234" or "U+10FFFD".

U+20A9 (₩) is the WON SIGN
U+FFE6 (￦) is the FULLWIDTH WON SIGN

This is a legacy of older character encodings. The "width" affected layout. The Unicode spec says:

Compatibility variants are a subset of compatibility characters, and have the further characteristic that they represent variants of existing, ordinary, Unicode characters. For example, compatibility variants might represent various presentation or styled forms of basic letters: superscript or subscript forms, variant glyph shapes, or vertical presentation forms. They also include halfwidth or fullwidth characters from East Asian character encoding standards, Arabic contextual form glyphs from pre-existing Arabic code pages, Arabic ligatures and ligatures from other scripts, and so on. Compatibility variants also include CJK compatibility ideographs, many of which are minor glyph variants of an encoded unified CJK ideograph.

Including these forms in Unicode allows the conversion of text from (and to) the older encodings without loss of meaning.

References:

General Structure
Southeast Asian Scripts
Annex #11: East Asian Width

继续阅读：icu internationalization string unicode

ICU Unicode Normal vs Fullwidth

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？