When did C++ compilers start considering more than two hex digits in string literal character escapes?
I've got a (generated) literal string in C++ that may contain characters that need to be escaped using the \x
notation. For example:
char foo[] = "\xABEcho";
However, g++ (version 4.1.2 if it matters) throws an error:
test.cpp:1: error: hex escape sequence out of range
The compiler appears to be considering the Ec
characters as part of the preceding hex number (because they look like hex digits). Since a four digit hex number won't fit in a char
, an error is raised. Obviously for a wide string literal L"\xABEcho"
the first character would be U+ABEC, followed by L"ho"
.
It seems this has changed sometime in the past couple of decades and I never noticed. I'm almost certain that old C compilers would only consider two hex digits after \x
, and not look any further.
I can think of one workaround for this:
char foo[] = "\xAB""Echo";
but that's a bit ugly. So I have three questions:
When did this change?
Why doesn't the compiler only accept >2-digit hex escapes for wide string literals?
- 开发者_运维问答
Is there a workaround that's less awkward than the above?
GCC is only following the standard. #877: "Each [...] hexadecimal escape sequence is the longest sequence of characters that can constitute the escape sequence."
I have found answers to my questions:
C++ has always been this way (checked Stroustrup 3rd edition, didn't have any earlier). K&R 1st edition did not mention
\x
at all (the only character escapes available at that time were octal). K&R 2nd edition states:'\xhh'
where hh is one or more hexadecimal digits (0...9, a...f, A...F).
so it appears this behaviour has been around since ANSI C.
While it might be possible for the compiler to only accept >2 characters for wide string literals, this would unnecessarily complicate the grammar.
There is indeed a less awkward workaround:
char foo[] = "\u00ABEcho";
The
\u
escape accepts four hex digits always.
Update: The use of \u
isn't quite applicable in all situations because most ASCII characters are (for some reason) not permitted to be specified using \u
. Here's a snippet from GCC:
/* The standard permits $, @ and ` to be specified as UCNs. We use
hex escapes so that this also works with EBCDIC hosts. */
else if ((result < 0xa0
&& (result != 0x24 && result != 0x40 && result != 0x60))
|| (result & 0x80000000)
|| (result >= 0xD800 && result <= 0xDFFF))
{
cpp_error (pfile, CPP_DL_ERROR,
"%.*s is not a valid universal character",
(int) (str - base), base);
result = 1;
}
I'm pretty sure that C++ has always been this way. In any case, CHAR_BIT
may be greater than 8, in which case '\xABE'
or '\xABEc'
could be valid.
I solved this by specifying the following char with \xnn too. Unfortunatly, you have to use this for as long as there are char in the [a..f] range. ex. "\xnneceg" is replaced by "\xnn\x65\x63\x65g"
These are wide-character literals.
char foo[] = "\x00ABEcho";
Might be better.
Here's some information, not gcc, but still seems to apply.
http://publib.boulder.ibm.com/infocenter/iadthelp/v7r0/index.jsp?topic=/com.ibm.etools.iseries.pgmgd.doc/cpprog624.htm
This link includes the important line:
Specifying
\xnn
in a wchar_t string literal is equivalent to specifying\x00nn
This may also be helpful.
http://www.gnu.org/s/hello/manual/libc/Extended-Char-Intro.html#Extended-Char-Intro
I also ran into this problem. I found that I could add a space at the end of the second hex digit and then get rid of the space by following the space with a backspace '\b'. Not exactly desirable but it seemed to work.
"Julius C\xE6sar the conqueror of the frana\xE7 \bais"
精彩评论