In C, how can one convert HTML strings to C strings?
Is there a common routine or library available?
e.g. '
has to become '
开发者_开发技巧.
This isn't particularly hard, assuming you only care about &#xx;
style entities. The bare-bones, let-everyone-else-worry-about-the-memory-management, mechanical, what's-a-regex way:
int hex_to_value(char hex) {
if (hex >= '0' && hex <= '9') { return hex - '0'; }
if (hex >= 'A' && hex <= 'F') { return hex - 'A' + 10; }
if (hex >= 'a' && hex <= 'f') { return hex - 'f' + 10; }
return -1;
}
void unescape(char* dst, const char* src) {
// Write the translated version of the text at 'src', to 'dst'.
// All sequences of '&#xx;', where x is a hex digit, are replaced
// with the corresponding single byte.
enum { NONE, AND, AND_HASH, AND_HASH_EX, AND_HASH_EX_EX } mode;
char first_hex, second_hex, translated;
mode m = NONE;
while (*src) {
char c = *src++;
switch (m) {
case NONE:
if (c == '&') { m = AND; }
else { *dst++ = c; m = NONE; }
break;
case AND:
if (c == '#') { m = AND_HASH; }
else { *dst++ = '&'; *dst++ = c; m = NONE; }
break;
case AND_HASH:
translated = hex_to_value(c);
if (translated != -1) { first_hex = c; m = AND_HASH_EX; }
else { *dst++ = '&'; *dst++ = '#'; *dst++ = c; m = NONE; }
break;
case AND_HASH_EX:
translated = hex_to_value(c);
if (translated != -1) {
second_hex = c;
translated = hex_to_value(first_hex) << 4 | translated;
m = AND_HASH_EX_EX;
} else {
*dst++ = '&'; *dst++ = '#'; *dst++ = first_hex; *dst++ = c;
m = NONE;
}
break;
case AND_HASH_EX_EX:
if (c == ';') { *dst++ = translated; }
else {
*dst++ = '&'; *dst++ = '#';
*dst++ = first_hex; *dst++ = second_hex; *dst++ = c;
}
m = NONE;
break;
}
}
}
Tedious, and way more code than seems reasonable, but not hard :)
I'd try to parse the number out from the string and then convert it to a number using atoi
and then cast it to a character.
This is something I wrote in ~20 seconds so it's completely contrived:
char html[] = "'";
char* pch = &html[2];
int n = 0;
char c = 0;
pch[2] = '\0';
n = atoi(pch);
c = n;
now c is '
. Also I don't really know about html strings... so I might be missing something
There is "GNU recode" - command line program and a library. http://recode.progiciels-bpi.ca/index.html
Among other things it can encode/decode HTML characters.
精彩评论