开发者

Char with accent to char without accent in C

Hey guys. simple question: how to remove accents from a char? Like ã -> a, and é -> e. I asked in another question how to convert utf-8 to ascii but this is unnecessary, since I only need to treat those situations.

I tried:

char comando;
i开发者_C百科f( comando == 'ç' || comando == 'Ç') {
        comando = 'c';
        return comando;
    }

But it gives me this error: "comparison is always false due to limited range of data type".

I can't be certain about the version of GCC that my teacher is going to compile my program, but she will run it on Linux (Ubuntu probably). And I can't use the standard lib. :(

Thanks!


In supplement to the other answers, try this for size:

#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>

int main(int argc, char** argv)
{
    wchar_t* x = calloc(100, sizeof(wchar_t));
    char*    y = calloc(100, sizeof(char));

    printf("Input something: ");
    fread(y, 1, 99, stdin);

    mbstowcs(x, y, 100);

    if ( x[0] = L'è' )
    {
        printf("Ohhh, french character!\n");
    }


    free(y); free(x);

    return 0;
}

This code shows you two things: firstly, how to convert a multi-byte string you have read in into a wide character string. From there, you can handle nearly every character that exists (theoretically at least).

Having done this, you simply need a map of characters and their transform which will allow you to parse each character and map it to something else. See the other answers for this

Some notes: I've deliberately used fread() on stdin - ctrl+D when done typing input. This is to prevent a buffer overflow attack you would be vulnerable to using scanf if you passed the result to a function (see NOP sled).

Secondly, I have blindly assumed y's input will be mostly single byte. The fact is, if in the multi-byte string two bytes are being used per character, 100 char characters = 50 wchar_t characters. I could check lengths etc too, but that's beyond the scope of this example.


The C standard says that the character constants such as 'ç' are integer constants:

§6.4.4.4/9

An integer character constant has type int. The value of an integer character constant containing a single character that maps to a single-byte execution character is the numerical value of the representation of the mapped character interpreted as an integer.

If the char type is signed on your machine (it is on Linux), then when comando contains 'ç' and is promoted to integer, it becomes a negative integer, whereas 'ç' is a positive integer. Hence the warning from the compiler.


For an 8-bit character set, by far the fastest way to do such an operation is to create a table of 256 bytes, where each position contains the unaccented version of the character.

int unaccented(int c)
{
     static const char map[256] =
     {
          '\x00', '\x01', ...
          ...
          '0',    '1',    '2', ...
          ...
          'A',    'B',    'C', ...
          ...
          'a',    'b',    'c', ...
          ...
          'A',    'A',    'A', ... // 0xC0 onwards...
          ...
          'a',    'a',    'a', ... // 0xE0 onwards...
          ...
     };
     if (c < 0 || c > 255)
         return EOF;
     else
         return map[c];
}

Of course, you'd write a program - probably a script - to generate the table of data, rather than doing it manually. In the range 0..127, the character at position x is the character with code x (so map['A'] == 'A').

If you are allowed to exploit C99, you can improve the table by using designated initializers:

static const char map[] =
{
    ['\x00'] = '\x00', ...
    ['A']    = 'A', ...
    ['a']    = 'a', ...
    ['å']    = 'a', ...
    ['Å']    = 'A', ...
    ['ÿ']    = 'y', ...
};

It isn't entirely clear what you should do with diphthongs letters such as 'æ' or 'ß' that have no ASCII equivalent; however, the simple rule of 'when in doubt, do not change it' can be applied sensibly. They aren't accented characters, but neither are they ASCII characters.

This does not work so well for UTF-8. For that, you need more specialized tables driven from data in the Unicode standard.

Also note that you should coerce any 'char' value to 'unsigned char' before calling this. That said, the code could also attempt to deal with abusers. However, it is hard to distinguish 'ÿ' (0xFF) from EOF when people are not careful in calling the function. The C standard character test macros are required to support all valid character values (when converted to unsigned char) and EOF as inputs - this follows that design.

§7.4/1

In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF. If the argument has any other value, the behavior is undefined.


You mentioned in another similar question that this was easy enough to do in other languages that you know. If I were you and couldn't find a good way to do this with available code in C and needed to do so in C I would write a program in another language to generate a C function that would do the conversion for you. As long as you can cycle through all characters this shouldn't be too difficult, though it may be large code. I'd probably do this for utf-16, and just have a simple wrapper function that took utf-8, converted them to utf-16, and called the conversion function.

The conversion function would just be made of a very large switch/case statement, and the default case would be for characters that didn't convert.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜