Semantics of comparison of char objects
While I was reading through some old code today, I noticed the following assert
line:
assert(('0' <= hexChar && hexChar <= '9')
|| ('A' <= hexChar && hexChar <= 'F')
|| ('a' <= hexChar && hexChar <= 'f'));
The purpose is to assert that hexChar
is a hexadecimal digit ([0-9A-Fa-f]). It does this by relying on an ASCII-like ordering of char
objects representing 'A'
, 'B'
, ..., 'F'
and 'a'
, 'b'
, ..., 'f'
.
I began wondering whether this always does what I intended, given that the execution character set is implementation-defined.
The C++ standard in Section 2.3, Character sets, mentions:
The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character (respectively, null wide character), whose representation has all zero bits. For each basic execution character set, the values of the members shall be non-negative and distinct from one another. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous. The execution character set and the execution wide-character set are implementation-defined supersets of the basic execution character set and the basic execution wide-character set, respectively. The values of the members of the execution character sets and the sets of additional members are locale-specific.
I interpret this to mean that ('0' <= hexChar && hexChar <= '9')
is okay because '0'
, '1'
, ..., '9'
are digits and each has a value one greater than the previous. However, the order of other basic source characters with respect to one another is still implementation-defined.
Is this a correct statement? Knowing nothing about the C++ compiler (so not knowing the implementation开发者_运维百科 details), do I need to rewrite the assert
as the following?
assert(('0' <= hexChar && hexChar <= '9')
|| ('A' == hexChar || 'B' == hexChar || 'C' == hexChar || 'D' == hexChar || 'E' == hexChar || 'F' == hexChar)
|| ('a' == hexChar || 'b' == hexChar || 'c' == hexChar || 'd' == hexChar || 'e' == hexChar || 'f' == hexChar));
The first line, comparison against the values of '0'
and '9'
is 100% portable. It's guaranteed by the C language to behave identically for all implementations.
The second and third lines are in principle implementation-defined, but there has never been, and never will be, an implementation where their behavior differs. The only non-ISO646-compatible character encoding that has ever been used with the C language (and the only reason C allows non-ISO646-compatible encodings) is EBCDIC, which places the letters 'A'
through 'F'
exactly where they should fall as hexadecimal values (in general the letters are discontiguous in EBCDIC, but A-F are one contiguous group).
With that said, unless you need to support legacy mainframes, there is no value in trying to handle basic character encoding "portably" in C. char
is 8 bits, the values 0-127 are ASCII, and the values 128-255 are part of a locale- or data-specific multibyte character encoding which we'll someday be able to assume is always UTF-8.
To your first question: yes.
To your second question: perhaps, but probably you should consider using the C library isxdigit
function or a C++ locale variant of this.
Technically, it's entirely legal for a C++ compiler to use some other character encoding. However, the reality is that you almost certainly won't find a platform where this code doesn't work. This is especially true since the new dominant character encodings are Unicode-based, like UTF-16, and Unicode shares all the ASCII values for all characters in the ASCII set. The only reason this is implementation-defined is for very, very old legacy platforms that still existed when this part of the Standard was written- and you'd have to substantially refactor your code to run on any platform that is non-ASCII.
精彩评论