开发者

In C, how can one convert HTML strings to C strings?

Is there a common routine or library available?

e.g. ' has to become '开发者_开发技巧.


This isn't particularly hard, assuming you only care about &#xx; style entities. The bare-bones, let-everyone-else-worry-about-the-memory-management, mechanical, what's-a-regex way:

int hex_to_value(char hex) {
    if (hex >= '0' && hex <= '9') { return hex - '0'; }
    if (hex >= 'A' && hex <= 'F') { return hex - 'A' + 10; }
    if (hex >= 'a' && hex <= 'f') { return hex - 'f' + 10; }
    return -1;
}

void unescape(char* dst, const char* src) {
    // Write the translated version of the text at 'src', to 'dst'.
    // All sequences of '&#xx;', where x is a hex digit, are replaced
    // with the corresponding single byte.
    enum { NONE, AND, AND_HASH, AND_HASH_EX, AND_HASH_EX_EX } mode;
    char first_hex, second_hex, translated;
    mode m = NONE;
    while (*src) {
        char c = *src++;
        switch (m) {
            case NONE:
            if (c == '&') { m = AND; }
            else { *dst++ = c; m = NONE; }
            break;

            case AND:
            if (c == '#') { m = AND_HASH; }
            else { *dst++ = '&'; *dst++ = c; m = NONE; }
            break;

            case AND_HASH:
            translated = hex_to_value(c);
            if (translated != -1) { first_hex = c; m = AND_HASH_EX; }
            else { *dst++ = '&'; *dst++ = '#'; *dst++ = c; m = NONE; }
            break;

            case AND_HASH_EX:
            translated = hex_to_value(c);
            if (translated != -1) {
                second_hex = c;
                translated = hex_to_value(first_hex) << 4 | translated;
                m = AND_HASH_EX_EX;
            } else {
                *dst++ = '&'; *dst++ = '#'; *dst++ = first_hex; *dst++ = c;
                m = NONE;
            }
            break;

            case AND_HASH_EX_EX:
            if (c == ';') { *dst++ = translated; }
            else { 
                *dst++ = '&'; *dst++ = '#';
                *dst++ = first_hex; *dst++ = second_hex; *dst++ = c;
            }
            m = NONE;
            break;
        }
    }
}

Tedious, and way more code than seems reasonable, but not hard :)


I'd try to parse the number out from the string and then convert it to a number using atoi and then cast it to a character.

This is something I wrote in ~20 seconds so it's completely contrived:

  char html[] = "&#39;";
  char* pch = &html[2];
  int n = 0;
  char c = 0;

  pch[2] = '\0';
  n = atoi(pch);
  c = n;

now c is '. Also I don't really know about html strings... so I might be missing something


There is "GNU recode" - command line program and a library. http://recode.progiciels-bpi.ca/index.html

Among other things it can encode/decode HTML characters.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜