开发者

Polish chars in std::string

I have a problem. I'm writing an app in Polish (with, of course, polish chars) for Linux and I receive 80 warnings when compiling. These are just "warning: multi-character character constant" and "warning: case label value exceeds maximum value for type". I'm usin开发者_StackOverflow中文版g std::string.

How do I replace std::string class?

Please help. Thanks in advance. Regards.


std::stringdoes not define a particular encoding. You can thus store any sequence of bytes in it. There are subtleties to be aware of:

  1. .c_str() will return a null-terminated buffer. If your character set allows null bytes, don't pass this string to functions that take a const char* parameter without a lenght, or your data will be truncated.
  2. A char does not represent a character, but a **byte. IMHO, this is the most problematic nomenclature in computing history. Note that wchar_t does necessarily hold a full character either, depending on UTF-16 normalization.
  3. .size() and .length() will return the number of bytes, not the number of characters.

[edit] The warnings about case labels is related to issue (2). You are using a switch statement with multi-byte characters using type char which can not hold more than one byte.[/edit]

Therefore, you can use std::string in your application, provided that you respect these three rules. There are subtleties involving the STL, including std::find() that are consequences of this. You need to use some more clever string matching algorithms to properly support Unicode because of normalization forms.

However, when writing applications in any language that uses non-ASCII characters (if you're paranoid, consider this anything outside [0, 128)), you need to be aware of encodings in different sources of textual data.

  1. The source-file encoding might not be specified, and might be subject to change using compiler options. Any string literal will be subject to this rule. I guess this is why you are getting warnings.
  2. You will get a variety of character encodings from external sources (files, user input, etc.). When that source specifies the encoding or you can get it from some external source (i.e. asking the user that imports the data), then this is easier. A lot of (newer) internet protocols impose ASCII or UTF-8 unless otherwise specified.

These two issues are not addressed by any particular string class. You just need to convert all any external source to your internal encoding. I suggest UTF-8 all the time, but especially so on Linux because of native support. I strongly recommend to place your string literals in a message file to forget about issue (1) and only deal with issue (2).

I don't suggest using std::wstring on Linux because 100% of native APIs use function signatures with const char* and have direct support for UTF-8. If you use any string class based on wchar_t, you will need to convert to/from std::wstring non-stop and eventually get something wrong, on top of making everything slow(er).

If you were writing an application for Windows, I'd suggest exactly the opposite because all native APIs use const wchar_t* signatures. The ANSI versions of such functions perform an internal conversion to/from const wchar_t*.

Some "portable" libraries/languages use different representations based on the platform. They use UTF-8 with char on Linux and UTF-16 with wchar_t on Windows. I recall reading bout that trick in the Python reference implementation but the article was quite old. I'm not sure if that is true anymore.


On linux you should use multibyte string class provided by a framework you use.

I'd recommend Glib::ustring, from glibmm framework, which stores strings in UTF-8 encoding. If your source files are in UTF-8, then using multibyte string literal in code is as easy as:

ustring alphabet("aąbcćdeęfghijklłmnńoóprsśtuwyzźż");

But you can not build a switch/case statement on multibyte characters using char. I'd recommend using a series of ifs. You can use Glibmm's gunichar, but it's not very readable (You can get correct unicode values for characters using a table from article on Polish alphabet in Wikipedia):

#include <glibmm.h>
#include <iostream>

using namespace std;

int main()
{
        Glib::ustring alphabet("aąbcćdeęfghijklłmnńoóprsśtuwyzźż");
        int small_polish_vovels_with_diacritics_count = 0;
        for ( int i=0; i<alphabet.size(); i++ ) {
                switch (alphabet[i]) {
                        case 0x0105: // ą
                        case 0x0119: // ę
                        case 0x00f3: // ó
                                small_polish_vovels_with_diacritics_count++;
                                break;
                        default:
                                break;
                }
        }
        cout << "There are " << small_polish_vovels_with_diacritics_count
                << " small polish vovels with diacritics in this string.\n"; 
        return 0;
}

You can compile this using:

g++ `pkg-config --cflags --libs glibmm-2.4` progname.cc -o progname


std::string is for ASCII strings. Since your polish strings don't fit in, you should use std::wstring.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜