开发者

Encoding differences between tr (ubuntu) and tr (mac)

echo – | tr ’ X

On my mac: correctly prints –

On Ubuntu 10.10: prints XX�

Why?

For reference, the first character is 'EN DASH' (U+2013), the second is 'RIGHT SINGLE QUOTATION MARK' (U+2019)

EDIT: I now use Perl instead. The magic incantation that I use is:

perl -CSD -p -Mutf8 -e 'tr/A-ZÁÉÍÓÚ开发者_如何学CÀÈÌÒÙÄËÏÖÜÂÊÎÔÛÇa-zäëïöüáéíóúàèìòùâèìòùñçßœ/ /cs'

It means: except for all letters I could think of, replace any character by a space, and squeeze the spaces.


GNU coreutils tr has issues with multi-byte characters - it considers them a sequence of single byte characters instead.

Taking this into account, what you see on Linux is "expected":

$ echo – | od -t x1
0000000 e2 80 93 0a
0000004
$ echo ’ | od -t x1
0000000 e2 80 99 0a
0000004
$ echo – | tr ’ X | od -t x1
0000000 58 58 93 0a
0000004

EDIT:

For a set of UTF-8 compatible utilities, you might want to have a look at the Heirloom Toolchest. It contains a tr implementation that apparently has UTF-8 support, although I have not tested it.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜