Encoding differences between tr (ubuntu) and tr (mac)
echo – | tr ’ X
On my mac: correctly prints –
On Ubuntu 10.10: prints XX�
Why?
For reference, the first character is 'EN DASH' (U+2013), the second is 'RIGHT SINGLE QUOTATION MARK' (U+2019)
EDIT: I now use Perl instead. The magic incantation that I use is:
perl -CSD -p -Mutf8 -e 'tr/A-ZÁÉÍÓÚ开发者_如何学CÀÈÌÒÙÄËÏÖÜÂÊÎÔÛÇa-zäëïöüáéíóúàèìòùâèìòùñçßœ/ /cs'
It means: except for all letters I could think of, replace any character by a space, and squeeze the spaces.
GNU coreutils tr
has issues with multi-byte characters - it considers them a sequence of single byte characters instead.
Taking this into account, what you see on Linux is "expected":
$ echo – | od -t x1
0000000 e2 80 93 0a
0000004
$ echo ’ | od -t x1
0000000 e2 80 99 0a
0000004
$ echo – | tr ’ X | od -t x1
0000000 58 58 93 0a
0000004
EDIT:
For a set of UTF-8 compatible utilities, you might want to have a look at the Heirloom Toolchest. It contains a tr
implementation that apparently has UTF-8 support, although I have not tested it.
精彩评论