Issues with text parsing, Character looks like a longer 'hyphen' and has 3 ASCII values
Here is the devilish character ‐
; inspecting it开发者_Go百科 I got 3 ASCII values:
ASCII code 226 128 147
Now I want to some how use this character in my regular expression.
None of those is an ASCII value, because the ASCII range is 0 through 127, and nothing higher. Code point U+2010 HYPHEN in UTF-8 is written with the three byte values you list there, as revealed by:
$ perl -CS -e 'print "\x{2010}"' | perl -C0 -ne 'printf "%vd\n",$_'
226.128.144
You can get the name and character properties of that code point using the uniprops script:
$ uniprops U+2010
U+2010 ‹‐› \N{ HYPHEN }:
\pP \p{Pd}
All Any Assigned InGeneralPunctuation Common Zyyy Dash Dash_Punctuation Pd P General_Punctuation Gr_Base Grapheme_Base Graph GrBase Hyphen Punct Pat_Syn Pattern_Syntax PatSyn Print Punctuation
Other common code points with the Unicode Dash
property include these shown by the unichars script:
$ unichars '\p{Dash}'
- 45 002D HYPHEN-MINUS
‐ 8208 2010 HYPHEN
‑ 8209 2011 NON-BREAKING HYPHEN
‒ 8210 2012 FIGURE DASH
– 8211 2013 EN DASH
— 8212 2014 EM DASH
― 8213 2015 HORIZONTAL BAR
⁓ 8275 2053 SWUNG DASH
⁻ 8315 207B SUPERSCRIPT MINUS
₋ 8331 208B SUBSCRIPT MINUS
− 8722 2212 MINUS SIGN
It's probably Unicode. The right answer is to use Unicode throughout. You'll ultimately get in a lot of trouble if you try to treat Unicode strings as ASCII.
精彩评论