开发者

When did C++ compilers start considering more than two hex digits in string literal character escapes?

I've got a (generated) literal string in C++ that may contain characters that need to be escaped using the \x notation. For example:

char foo[] = "\xABEcho";

However, g++ (version 4.1.2 if it matters) throws an error:

test.cpp:1: error: hex escape sequence out of range

The compiler appears to be considering the Ec characters as part of the preceding hex number (because they look like hex digits). Since a four digit hex number won't fit in a char, an error is raised. Obviously for a wide string literal L"\xABEcho" the first character would be U+ABEC, followed by L"ho".

It seems this has changed sometime in the past couple of decades and I never noticed. I'm almost certain that old C compilers would only consider two hex digits after \x, and not look any further.

I can think of one workaround for this:

char foo[] = "\xAB""Echo";

but that's a bit ugly. So I have three questions:

  • When did this change?

  • Why doesn't the compiler only accept >2-digit hex escapes for wide string literals?

  • 开发者_运维问答

    Is there a workaround that's less awkward than the above?


GCC is only following the standard. #877: "Each [...] hexadecimal escape sequence is the longest sequence of characters that can constitute the escape sequence."


I have found answers to my questions:

  • C++ has always been this way (checked Stroustrup 3rd edition, didn't have any earlier). K&R 1st edition did not mention \x at all (the only character escapes available at that time were octal). K&R 2nd edition states:

    '\xhh'
    

    where hh is one or more hexadecimal digits (0...9, a...f, A...F).

    so it appears this behaviour has been around since ANSI C.

  • While it might be possible for the compiler to only accept >2 characters for wide string literals, this would unnecessarily complicate the grammar.

  • There is indeed a less awkward workaround:

    char foo[] = "\u00ABEcho";
    

    The \u escape accepts four hex digits always.

Update: The use of \u isn't quite applicable in all situations because most ASCII characters are (for some reason) not permitted to be specified using \u. Here's a snippet from GCC:

/* The standard permits $, @ and ` to be specified as UCNs.  We use
     hex escapes so that this also works with EBCDIC hosts.  */
  else if ((result < 0xa0
            && (result != 0x24 && result != 0x40 && result != 0x60))
           || (result & 0x80000000)
           || (result >= 0xD800 && result <= 0xDFFF))
    {
      cpp_error (pfile, CPP_DL_ERROR,
                 "%.*s is not a valid universal character",
                 (int) (str - base), base);
      result = 1;
    }


I'm pretty sure that C++ has always been this way. In any case, CHAR_BIT may be greater than 8, in which case '\xABE' or '\xABEc' could be valid.


I solved this by specifying the following char with \xnn too. Unfortunatly, you have to use this for as long as there are char in the [a..f] range. ex. "\xnneceg" is replaced by "\xnn\x65\x63\x65g"


These are wide-character literals.

char foo[] = "\x00ABEcho";

Might be better.

Here's some information, not gcc, but still seems to apply.

http://publib.boulder.ibm.com/infocenter/iadthelp/v7r0/index.jsp?topic=/com.ibm.etools.iseries.pgmgd.doc/cpprog624.htm

This link includes the important line:

Specifying \xnn in a wchar_t string literal is equivalent to specifying \x00nn

This may also be helpful.

http://www.gnu.org/s/hello/manual/libc/Extended-Char-Intro.html#Extended-Char-Intro


I also ran into this problem. I found that I could add a space at the end of the second hex digit and then get rid of the space by following the space with a backspace '\b'. Not exactly desirable but it seemed to work.

"Julius C\xE6sar the conqueror of the frana\xE7 \bais"

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜