Code to strip diacritical marks using ICU
Can somebody please provide some sample code to strip diacritical marks (i.e., replace characters having accents, umlauts, etc., with their unaccented, unumlauted, etc., character equivalents, e.g., every accented é
would become a plain ASCII e
) from a UnicodeString开发者_开发技巧
using the ICU library in C++? E.g.:
UnicodeString strip_diacritics( UnicodeString const &s ) {
UnicodeString result;
// ...
return result;
}
Assume that s
has already been normalized. Thanks.
ICU lets you transliterate a string using a specific rule. My rule is NFD; [:M:] Remove; NFC
: decompose, remove diacritics, recompose. The following code takes an UTF-8 std::string
as an input and returns another UTF-8 std::string
:
#include <unicode/utypes.h>
#include <unicode/unistr.h>
#include <unicode/translit.h>
std::string desaxUTF8(const std::string& str) {
// UTF-8 std::string -> UTF-16 UnicodeString
UnicodeString source = UnicodeString::fromUTF8(StringPiece(str));
// Transliterate UTF-16 UnicodeString
UErrorCode status = U_ZERO_ERROR;
Transliterator *accentsConverter = Transliterator::createInstance(
"NFD; [:M:] Remove; NFC", UTRANS_FORWARD, status);
accentsConverter->transliterate(source);
// TODO: handle errors with status
// UTF-16 UnicodeString -> UTF-8 std::string
std::string result;
source.toUTF8String(result);
return result;
}
After more searching elsewhere:
UErrorCode status = U_ZERO_ERROR;
UnicodeString result;
// 's16' is the UTF-16 string to have diacritics removed
Normalizer::normalize( s16, UNORM_NFKD, 0, result, status );
if ( U_FAILURE( status ) )
// complain
// code to convert UTF-16 's16' to UTF-8 std::string 's8' elided
string buf8;
buf8.reserve( s8.length() );
for ( string::const_iterator i = s8.begin(); i != s8.end(); ++i ) {
char const c = *i;
if ( isascii( c ) )
buf8.push_back( c );
}
// result is in buf8
which is O(n).
精彩评论