Removing bytes in a dump or utf-8 in c

2023-03-19 07:51 问答作者：

I have a "C" program in my firestation that captures incoming packets to the station printer. The program then scans the packet and sends and audible alert for what apparatus is due on the call. The county recently started using UTF-8 packets and the c program can not deal with all the extra "00" in the data flow. I need to either ignore the 00 or set the program to handle UTF-8. I have looked for days and there is nothing concrete on how to handle utf-8 that a novice such as my self can handle. Below is the interpret part of the program.

72 00 65 00 61 00 74 00 68 00 69 00 6e 00 67 00 later in packet

43 4f 44 45 53 45 54 3d 55 54 46 38 0a 40 50 4a beginning of packet

***void compressUtf16 (char *buff, size_t count) {
int i;
for (i = 0; i < count; i++)
    buff[i] = buff[i*2];     // for xx 00 xx 00 xx 00 ...

}* { u_int i=0; char *searcher = 0; char c; int j; int locflag; static int locationtripped = 0;

    static char currentline[256]; 
static int currentlinepos = 0;
static char lastdispatched[256];
static char dispatchstring[256];

char betastring[256];

static int a = 0;
static int e = 0;
static int pe = 0; 
static int md = 0;

static int pulse = 0;

static char location[128];
static char type[16];
static char station[16]; 

static 开发者_如何学CFILE *fp;
static int printoutscanning = 0;
static char printoutID[20];
static char printoutfileID[32];

static FILE *dbg;

if(pulse) {
    if(pulse == 80) {
        sprintf(betastring, "beta a a a");
        printf("betastring: \"%s\"\n", betastring);
        system(betastring);
        pulse = 0; 
    } else
        pulse++;
}

    if(header->len > 96) {
        for(i=55; (i < header->caplen + 1 ) ; i++) {
            c = pkt_data[i-1];

        if(c == 13 || c == 10) {
            currentline[currentlinepos] = 0;
            currentlinepos = 0;
            j = strlen(currentline);
            if(j && (j > 1)) { 
                if(strlen(printoutfileID) && printoutscanning) {
                    dbg = fopen(printoutfileID, "a");
                    fprintf(dbg, "%s\n", currentline); 
                    fclose(dbg);
                }

                if(!printoutscanning) {
                    searcher = 0;
                    searcher = strstr(currentline, "INCIDENT HISTORY DETAIL:"); 
                    if(searcher) {
                        searcher = searcher + 26;
                        strncpy(printoutID, searcher, 9);
                        printoutID[9] = 0;
                        printoutscanning = 1; 
                        a = 0;
                 e = 0;
                        pe = 0;
                        md = 0;
            for(j = 0; j < 128; j++)
                            location[j] = 0; 
                        for(j = 0; j < 16; j++) {
                            type[j] = 0;
                            station[j] = 0;
                        }
                        sprintf(printoutfileID, "calls/%s %.6d.txt", printoutID, header-> ts.tv_usec);
                        dbg = fopen(printoutfileID, "a");
                        fprintf(dbg, "%s\n", currentline);
                        fclose(dbg);
                    }

UTF-8, except for the zero code point itself, will not have any zero bytes in it. The first byte of all multi-byte encodings (non-ASCII code points) will always start with the 11 bit pattern, with subsequent bytes always starting with the 10 bit pattern.

As you can see from the following table, U+0000 is the only code point that can give you a zero byte in UTF-8.

+----------------+----------+----------+----------+----------+
| Unicode        | Byte 1   | Byte 2   | Byte 3   | Byte 4   |
+----------------+----------+----------+----------+----------+
| U+0000-007F    | 0xxxxxxx |          |          |          |
| U+0080-07FF    | 110yyyxx | 10xxxxxx |          |          |
| U+0800-FFFF    | 1110yyyy | 10yyyyxx | 10xxxxxx |          |
| U+10000-10FFFF | 11110zzz | 10zzyyyy | 10yyyyxx | 10xxxxxx |
+----------------+----------+----------+----------+----------+

UTF-16 will intersperse zero bytes between your otherwise ASCII bytes but it's then a simple matter of throwing away every second byte. Whether that's 0, 2, 4, ... or 1, 3, 5, ... depends on whether your UTF-16 encoding is big-endian or little-endian.

I see from your sample that your data stream does indicate UTF-8 (43 4f 44 45 53 45 54 3d 55 54 46 38 translates to the text CODESET=UTF8) but I'll guarantee you it's lying :-)

The segment 72 00 65 00 61 00 74 00 68 00 69 00 6e 00 67 00 is UTF-16 for reathing, presumably a word segment since I'm not familiar with that word (in English, anyway).

I would suggest you clarify with whoever is generating that data since it's clearly erroneous. As to how you process the UTF-16, I've covered that above. Provided it's ASCII data in there (the alternate bytes are always zero), you can just throw away those alternates with something like:

// Process a UTF16 buffer containing ASCII-only characters.
// buff is the buffer, count is the quantity of UTF-16 chars.
// Will change buffer.

void compressUtf16 (char *buff, size_t count) {
    int i;
    for (i = 0; i < count; i++)
        buff[i] = buff[i*2];     // for xx 00 xx 00 xx 00 ...
}

And, if you're using the other endian UTF-16, simply change:

buff[i] = buff[i*2];     // for xx 00 xx 00 xx 00 ...

into:

buff[i] = buff[i*2+1];   // for 00 xx 00 xx 00 xx ...

继续阅读：c

Removing bytes in a dump or utf-8 in c

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？