开发者

Fastest way to filter punctuation in C

I need to filter punctuation from UTF-8 strings quickly in C. The strings could be long and they are quite numerous. The function I'm using currently seems very inefficient:

char *filter(char *mystring){
    char *p;
    while ((p = strchr(mystring,'.')) != NULL)
        strcpy(p, p+1);
    while ((p = strchr(mystring,'开发者_如何学Python,')) != NULL)
        ...etc etc etc...
    ...etc...
    return mystring;
}

As you can see it iterates through the string for each punctuation mark. Is there a simple library function that can complete this efficiently for all punctuation marks?


A more efficient algorithm is:

#include <ctype.h>

char *filter(char *mystring)
{
    char *in = mystring;
    char *out = mystring;

    do {
        if (!ispunct(*in))
            *out++ = *in;
    } while (*in++);

    return mystring;
}

It isn't UTF-8 specific though - it's whatever the current locale is. (Your original wasn't UTF-8 specific, either).

If you wish to make it UTF-8, you could replace ispunct() with a function that will take a char * and determine if it starts with a (potentially multi-byte) UTF-8 character that's some kind of punctuation mark (and call it with in instead of *in).


The ICU libraries have C bindings, and include a regex library that correctly handles Unicode \pP punctuation.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜