How can I avoid encoding mixups of strings in a C/C++ API?
I'm working on implementing different APIs in C and C++ and wondered what techniques are available for avoiding that clients get the encoding wrong when receiving strings from the framework or passing them back. For instance, imagine a simple plugin API in C++ which customers can implement to influence translations. It might feature a function like this:
const char *getTranslatedWord( const char *englishWord );
Now, let's say that I'd like to enforce that all strings are passed as UTF-8. Of course I'd document this requirement, but I'd like the compiler to enforce the right encoding, maybe by using dedicated types. For instance, something like this:
class Word {
public:
static Word fromUtf8( const char *data ) { return Word( data ); }
const char *toUtf8() { return m_data; }
private:
Word( const char *data ) : m_data( data ) { }
const char *m_data;
};
I could now use this specialized type in the API:
Word getTranslatedWord( const Word &englishWord );
Unfortunately, it's easy to make this very inefficient. The Word
class lacks proper copy constructors, assignment operators etc.. and I'd like to avoid unnecessary copying of data as much as possible. Also, I see the danger that Word
gets extended with more and more utility functions (like length
or fromLatin1
or substr
etc.) and I'd rather not write Yet Anot开发者_如何学运维her String Class. I just want a little container which avoids accidental encoding mixups.
I wonder whether anybody else has some experience with this and can share some useful techniques.
EDIT: In my particular case, the API is used on Windows and Linux using MSVC 6 - MSVC 10 on Windows and gcc 3 & 4 on Linux.
You could pass arround a std::pair instead of a char*:
struct utf8_tag_t{} utf8_tag;
std::pair<const char*,utf8_tag_t> getTranslatedWord(std::pair<const char*,utf8_tag_t> englishWord);
The generated machine code should be identical on a decent modern compiler that uses the empty base class optimization for std::pair.
I don't bother with this though. I'd just use char*s and document that the input has to be utf8. If the data could come from an untrusted source, you're going to have to check the encoding at runtime anyway.
I suggest that you use std::wstring
.
Check out this other question for details .
The ICU project provides a Unicode support library for C++.
精彩评论