C++ UTF-8 lightweight & permissive code?

2023-01-02 12:59 问答作者：

Anyone know of a more permissive license (MIT / public domain) version of this:

http://library.gnome.org/devel/glibmm/unstable/classGlib_1_1ustring.html

('drop-in' replacement for std::string thats UTF-8 aware)

Lightweight, does everything I need and even more 开发者_StackOverflow中文版(doubt I'll use the UTF-XX conversions even)

I really don't want to be carrying ICU around with me.

std::string is fine for UTF-8 storage.
If you need to analyze the text itself, the UTF-8 awareness will not help you much as there are too many things in Unicode that do not work on codepoint base.

Take a look on Boost.Locale library (it uses ICU under the hood):

Reference http://cppcms.sourceforge.net/boost_locale/html/
Tutorial http://cppcms.sourceforge.net/boost_locale/html/tutorial.html
Download https://sourceforge.net/projects/cppcms/files/

It is not lightweight but it allows you handle Unicode correctly and it uses std::string as storage.

If you expect to find Unicode-aware lightweight library to deal with strings, you'll not find such things, because Unicode is not lightweight. And even relatively "simple" stuff like upper-case, lower-case conversion or Unicode normalization require complex algorithms and Unicode data-base access.

If you need an ability to iterate over Code points (that BTW are not characters) take a look on http://utfcpp.sourceforge.net/

Answer to comment:

1) Find file formats for files included by me

std::string::find is perfectly fine for this.

2) Line break detection

This is not a simple issue. Have you ever tried to find a line-break in Chinese/Japanese text? Probably not as space does not separate words. So line-break detection is hard job. (I don't think even glib does this correctly, I think only pango has something like that)

And of course Boost.Locale does this and correctly.

And if you need to do this for European languages only, just search for space or punctuation marks, so std::string::find is more then fine.

3) Character (or now, code point) counting Looking at utfcpp thx

Characters are not code points, for example a Hebrew word Shalom -- "שָלוֹם" consists of 4 characters and 6 Unicode points, where two code points are used for vowels. Same for European languages where singe character and be represented with two code points, for example: "ü" can be represented as "u" and "¨" -- two code points.

So if you are aware of these issues then utfcpp will be fine, otherwise you will not find anything simpler.

I never used, but stumbled upon this UTF-8 CPP library a while ago, and had enough good feelings to bookmark it. It is released on a BSD like license IIUC.

It still relies on std::string for strings and provides lots of utility functions to help checking that the string is really UTF-8, to count the number of characters, to go back or forward by one character … It is really small, lives only in header files: looks really good!

You might be interested in the Flexible and Economical UTF-8 Decoder by Björn Höhrmann but by no mean it's a drop-in replacement for std::string.

继续阅读：glib utf-8

C++ UTF-8 lightweight & permissive code?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？