开发者

Using boost::format %s specifier with UTF-8 strings

We are adding support for UTF8 to an existing application with a large code base. This application uses boost::format(), and the output in non-ASCII characters is not aligning properly. Specifically, when using the %{width}.{length}s specifier, boost::format() counts chars, which does not "do the right thing" with utf8 strings. I think it should be possible to change the string length code (which is probably string::size()) to use utf8len() or something analogous, based on ... something?

In this case, it is not practical to change the existing code base to use UCS开发者_如何转开发2 (or UCS4, or UTF-16, etc), but it is possible to modify boost::format() if necessary. I was hoping someone else had run across this need, and can point me to a possible solution.

Note: I found some web pages on using locales with utf8, but most of that seemed more applicable to converting to/from utf8 and UCS4 in streams.


This is probably too late for you, but maybe it will help someone else. Boost::format accepts a std::locale as an optional template parameter. (see http://www.boost.org/doc/libs/1_55_0/libs/format/doc/format.html). If you pass it a unicod aware locale, such as the boost::locale("en_US.UTF-8"), you should get the desired behavior.

Instead of passing a locale each time to the boost::format constructor, you could also set the default locale of your application, which might help you avoid other problems. If you take this route, I would recomment the use of a boost::locale over a std::locale, as the boost::locale's won't modify your numeric formatting unless you explicity ask it to (docs here).

In general, this is a goto approach for making an application in C++ work nicely with Unicode. If the functionality can use a locale (std::regex, std::sort, boost::format), give it a unicode aware locale, and you should be safe (and if you arent' please tell me, I want to know).

If you are making a small, lightweight application and only care about the 80% case, you may not want to pay the price for including ICU (Internation Components for Unicode) which is the default engine boost locale wraps around when providing unicde support. In this case build Boos using your OS's or Posix unicode support, and your application will remain small and light, but you won't have a lot of unicode support, like multiple collation levels.

For the problem you are describing, Posix support is likely sufficent.


AFAIK Boost Format measures everything in code units even when a UTF-8 based locale is used.

If you can switch to another library, then consider C++20 std::format or the {fmt} formatting library which count width in display width units (similarly to wcswidth) so the alignment is correct. For example

fmt::print("┌{0:─^{2}}┐\n"
           "│{1: ^{2}}│\n"
           "└{0:─^{2}}┘\n", "", "Hello, world!", 20);

prints:

┌────────────────────┐
│   Hello, world!    │
└────────────────────┘

Disclaimer: I'm the author of {fmt} and C++20 std::format

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜