How can I avoid encoding mixups of strings in a C/C++ API?

2022-12-31 18:00 问答作者：

I'm working on implementing different APIs in C and C++ and wondered what techniques are available for avoiding that clients get the encoding wrong when receiving strings from the framework or passing them back. For instance, imagine a simple plugin API in C++ which customers can implement to influence translations. It might feature a function like this:

const char *getTranslatedWord( const char *englishWord );

Now, let's say that I'd like to enforce that all strings are passed as UTF-8. Of course I'd document this requirement, but I'd like the compiler to enforce the right encoding, maybe by using dedicated types. For instance, something like this:

class Word {
public:
  static Word fromUtf8( const char *data ) { return Word( data ); }
  const char *toUtf8() { return m_data; }

private:
  Word( const char *data ) : m_data( data ) { }

  const char *m_data;
};

I could now use this specialized type in the API:

Word getTranslatedWord( const Word &englishWord );

Unfortunately, it's easy to make this very inefficient. The Word class lacks proper copy constructors, assignment operators etc.. and I'd like to avoid unnecessary copying of data as much as possible. Also, I see the danger that Word gets extended with more and more utility functions (like length or fromLatin1 or substr etc.) and I'd rather not write Yet Anot开发者_如何学运维her String Class. I just want a little container which avoids accidental encoding mixups.

I wonder whether anybody else has some experience with this and can share some useful techniques.

EDIT: In my particular case, the API is used on Windows and Linux using MSVC 6 - MSVC 10 on Windows and gcc 3 & 4 on Linux.

You could pass arround a std::pair instead of a char*:

struct utf8_tag_t{} utf8_tag;
std::pair<const char*,utf8_tag_t> getTranslatedWord(std::pair<const char*,utf8_tag_t> englishWord);

The generated machine code should be identical on a decent modern compiler that uses the empty base class optimization for std::pair.

I don't bother with this though. I'd just use char*s and document that the input has to be utf8. If the data could come from an untrusted source, you're going to have to check the encoding at runtime anyway.

I suggest that you use std::wstring.

Check out this other question for details .

The ICU project provides a Unicode support library for C++.

继续阅读：c encoding string

How can I avoid encoding mixups of strings in a C/C++ API?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？