How can I display Unicode strings while debugging on linux?
I have been working for some years now as C++ Developer using MS Visual Studio as working platform. Since I privately prefer to use linux, I recently took the chance to move my working environment to linux as well. Since I have been optimizing my windows environment for several years now, of course it turns out several things are missing or not working as expected. Thus I have some questions for which I could not come with useful answers yet.
Lets start the following problem different questions will probably follow later. The problem is something I have already stumbled upon several times, whenever I was forced to debug platform specific bugs on non windows platforms.
Simply speaking: How can I display Unicode (UCS2 encoded) strings while debugging on linux?
Now some more details I have figured so far. Our Lib uses interally a Unicode based
String class, which encodes every char as a 16bit Unicode value (but we do not
support multiword encodings, thus we basically can only use the UCS2 encodable subset
of UTF16, but this encompasses nearly all used scripts anyway).
This already poses one problem, since most platforms (i.e. linux / unix) consider
wchar_t
types to consist of 4 bytes while on windows it is only 2 bytes, thus
I cannot simply cast the internal string buffer to (wchar_t *
), so I am not sure,
if this would really help any debugger.
For gdb I have figured, that I can call functions from the debugged code, to print debug messages. Thus I inserted a special function into our lib, that can arbitrarily transform the string data and write it to a new buffer. Currently I transcode our internal buffer to utf8, since I expect this to be the mostly likely to work.
But so far this solves the problem only partially: If the string is latin, then I now get a readable output (whereas one cannot directly print the latin data if it is 16 bit encoded), but I also have to deal with other scripts (f.e. CJK (a.k.a. Hansi / Kanji), cyrillic, greek ...) and with dealing I mean I have to specifically debug data using such scripts, since the used scripts directly influence the control flow. Ofcourse in these cases I only see the ISO chars that correspond to the multiple bytes that make up a utf8 char, which makes debugging CJK data even more cryptic then correctly displayed strings would be.
Generally gdb allows to set several host and target encodings, thus It should be possible, to send the correct encoded utf8 data stream to the console.
But of course I'd prefer to use an IDE for debugging. Currently I am trying to make friends with eclipse and CDT, but for debugging I have also tested kdgb. In both applications I could so far only obtain incorrectly decoded utf8 data. On the other hand I once debugged a java project in eclipse on a windows platform and all internal strings were displayed correctly (but this application was not using our lib and the corresponding strings), thus at least in some situations eclipse can display unicode chars correctly.
The most annoying point for me is, that so far I could not even come up with any proof, that displaying true unicode data (i.e. non ISO chars) is working in any setup on linux (i.e. even the gdb scripts for QStrings I have found, seem to only display latin chars and skip the remainder), but of course nearly every linux application seems to support unicode data, thus there must be people around, that debug true unicode data on linux platforms and I really cannot imagine, that they are all reading hexcodes instead of directly displaying unicode strings.
Thus any pointers to setups that allow debugging of unicode strings, based on any other string classes (f.e. QString) a开发者_开发技巧nd / or IDE would also be appreciated.
The simple script "wchar.gdb" mentioned by Charles Salvia above has helped me, but a few years later it was hard to find (link in article broken), therefore I'll paste ist here. The script also demonstrates some barely known macro capabilities built into gdb.
define wchar_print
echo "
set $i = 0
while (1 == 1)
set $c = (char)(($arg0)[$i++])
if ($c == '\0')
loop_break
end
printf "%c", $c
end
echo "\n
end
document wchar_print
wchar_print <wstr>
Print ASCII part of <wstr>, which is a wide character string of type wchar_t*.
end
Most Linux distros tend to have excellent Unicode support. However, I would say that using UTF16 on Linux is a mistake. I realize this would be natural, coming from a Windows environment, but it will just make things more difficult for you on Linux.
As long as your locale is set to Unicode, it's trivial to output UTF-32 strings, (wchar_t strings) using wprintf
or wcout
, and of course you can output UTF-8 strings using normal output facilities. However, with UTF-16 you are essentially limited to building a custom string class that uses int16_t
, which, as you've discovered, is going to be difficult to print in a debugger.
You mentioned that you created a function which translates the UTF-16 to UTF-8 for the purposes of debugging, but the variable-length characters make it difficult to deal with. Why not simply make a function that translates the UTF16 to UTF32, so each Unicode codepoint is one character? This way you can use wide character output to read the strings. GDB doesn't allow you to output wide-character strings by default, but you can fix that using this simple script.
I assume you are under X ? Are the proper fonts installed?
If on the console, are you using a framebuffer as terminal device? A VGA textmode mode can only show 256/512 chars max. (the 512 case iirc eating up a bit of the colorspace)
Current gdb dersions can display 16 bit wide character data directly: If your program does not use wchar_t (32 Bit) data type at all, e.g. it uses ICU libraries (International Components for Unicode) with wide data type UChar (16 bit), you can set gcc option -fshort-wchar, defining wchar_t and wide literals (L"abc", L'd') as unsigned short (16 Bit). As a consequence, no wchar_t glibc functions must be called. If at least one wchar_t dummy variable is defined in the target program, gdb can display wchar_t (16 bit) character data. example gdb session:
short-wchar.c:
#include <wchar.h>
wchar_t wchr;
main() { printf("sizeof(L'a') = %d\n", sizeof(L'a')); return 0; }
gcc -g -fshort-wchar short-wchar.c -o short-wchar
# terminal session encoding utf-8 assumed
gdb short-wchar
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
(gdb) show charset
The host character set is "auto; currently UTF-8".
The target character set is "auto; currently UTF-8".
The target wide character set is "auto; currently UTF-32".
(gdb) set target-wide-charset UTF-16
(gdb) p L"Škoda"
$1 = L"Škoda"
(gdb) p (wchar_t*) (some UChar string)
....
One reason for using 16-Bit wchar_t on all platforms is cross-platform consistency, see ICU, OCI (Oracle Call Interface in wide mode) and Java data type char.
精彩评论