Consistent implementation of tr?

2023-01-13 05:32 问答作者：

I have a ksh script that generates a long, random string using /dev/urandom and tr:

STRING="$(cat /dev/urandom|tr -dc 'a-zA-Z0-9-_'|fold -w 64 |head -1)"

On the Linux and AIX servers where I used this it resulted in 64 characters of upper and lower case alpha chars, digits, dash and underscore characters. Example:

W-uch3_4fbnk34u2nc08w_nj23n089023ncNjxz979823n23-n88h30pmLCxkMKj

When I used the script on Solaris the ranges were interpreted as literals and it resulted in strings from the set aAzZ09-_. Example:

AA0z9_aZ-a-z00aZ9_azAZa0zZza9-Az0-_za-9aa0az_a0z-0a0z000-A9Z_0a

Oddly, on this Solaris server the man page for tr indicates that the syntax used should have produced the desired result.

The idea is to use /dev/urandom to produce a pseudo-random string from which we extract characters so that the result a) does not contain spaces and b) does not contain s开发者_如何学Chell special characters. The string will be used on the command line as an argument later on in the script. We don't want to use classes like :alnum: because locale can convert these into multi-byte values that don't work on the command line. This ksh one-liner did the trick perfectly on a great many installations until we got to Solaris.

We have temporarily converted this to a somewhat nasty Perl regex. Is there a syntax for tr or some other utility or ksh built-in that will perform this task consistently across UNIX variants and is universally installed? Doesn't have to be a one-liner but simplicity is appreciated.

Update: We tried the Locale settings with no luck. Waiting on results of using xpg6 version.

$ uname -a
SunOS hostname 5.10 Generic_142900-04 sun4u sparc SUNW,SPARC-Enterprise
$ cat /dev/urandom | tr -dc "a-zA-Z0-9-_" | fold -w 64 | head -1 | sed 's/^-/_/'
0-a9-z9a_zzZAa_a_0az-9_z0a_90Z_9az09aZzZAa-9aa_-__za0ZA9_ZzzZazA
$ set | grep '^L[AC]'
LANG=C
LC_ALL=C
LC_COLLATE=en_US
LC_CTYPE=en_US
LC_MESSAGES=en_US
LC_MONETARY=en_US
LC_NUMERIC=en_US
LC_TIME=en_US
$ export LC_CTYPE="$LC_ALL" LC_MESSAGES="$LC_ALL"
$ set | grep '^L[AC]'
LANG=C
LC_ALL=C
LC_COLLATE=en_US
LC_CTYPE=C
LC_MESSAGES=C
LC_MONETARY=en_US
LC_NUMERIC=en_US
LC_TIME=en_US
$ cat /dev/urandom | tr -dc "a-zA-Z0-9-_" | fold -w 64 | head -1 | sed 's/^-/_/'
0900z9az99_a0za09__0zA0_Z--Z_-Aa-AaA9zAZz-Aa90A00z__ZzA9A-Z0aA_-
$ unset LC_ALL; export LC_COLLATE=C LC_NUMERIC=C LC_TIME=C
$ set | grep '^L[AC]'
LANG=C
LC_COLLATE=C
LC_CTYPE=C
LC_MESSAGES=C
LC_MONETARY=en_US
LC_NUMERIC=C
LC_TIME=C
$ cat /dev/urandom | tr -dc "a-zA-Z0-9-_" | fold -w 64 | head -1 | sed 's/^-/_/'
_AA9aA_Za-A0-AZa_A-0ZA--a_za-a9zZZz__a0az_-0A-9-0aA-0za00A-__9-0
$ unset LANG LC_COLLATE LC_NUMERIC LC_TIME
$ set | grep '^L[AC]'
LC_CTYPE=C
LC_MESSAGES=C
LC_MONETARY=en_US
$ cat /dev/urandom | tr -dc "a-zA-Z0-9-_" | fold -w 64 | head -1 | sed 's/^-/_/'
_-_9zz9Z-Z-Z-Z_0_a9zzzZZaAa--9_zAZaaAZz-ZaAZ09Z-_z-zz09ZZAzAz0Z0
$ unset LC_CTYPE LC_MESSAGES LC_MONETARY
$ set | grep '^L[AC]'
$ cat /dev/urandom | tr -dc "a-zA-Z0-9-_" | fold -w 64 | head -1 | sed 's/^-/_/'
_0aAa9_Z_a_Z--_Az-aa0ZA0ZzZ-9Aa9-Z0--0A_Z0Zaz-AA_Zz0z---Z_99z_a9
$ export LANG=C LC_ALL=C LC_COLLATE=C LC_CTYPE=C LC_MESSAGES=C LC_MONETARY=C LC_NUMERIC=C LC_TIME=C
$ set | grep '^L[AC]'
LANG=C
LC_ALL=C
LC_COLLATE=C
LC_CTYPE=C
LC_MESSAGES=C
LC_MONETARY=C
LC_NUMERIC=C
LC_TIME=C
$ cat /dev/urandom | tr -dc "a-zA-Z0-9-_" | fold -w 64 | head -1 | sed 's/^-/_/'
Za_000z9aa--aA00zAAZza0AA90090--z0a00_zZ9ZA0_---aZZ09a0ZA0_0zZaa
$ cat /dev/urandom | tr -dc "[a-z][A-Z][0-9]-_" | fold -w 64 | head -1 | sed 's/^-/_/'
x7dni9gIXVF6AHQc3B-H6hjnBVHChJ9zM-z5EQ5UEruATI_NNFaCoVLOqM6gVaT5
$

Of course, on Linux that last version spits out square brackets.

If you set your path to /usr/xpg6/bin/ then it'll work as expected The locale seems to have no affect here. A cross platform hack is:

tr -dc '[a-z][A-Z][0-9]_-' < /dev/urandom | tr -d '][' | fold -w64 | head -n1

What you've observed is not a different between operating systems, but different machines having different locale settings. Your Solaris machine has LC_COLLATE set to a non-default value, which is a sure recipe for the kind of problems you have.

Locale settings are set from the environment as follows:

If the environment variable LC_ALL is set, its value is used for all categories.
Otherwise, if LC_FOO is set, its value is used for category LC_FOO.
Otherwise, if LANG is set, its value is used for categories that weren't explicitly set.
The default locale is called C. On Unix systems, POSIX is a synonym for C.

The main locale categories are:

LC_CTYPE indicates the character set and encoding used for file names, file contents and terminal I/O. You should carefully preserve this setting unless you know it's inaccurate (e.g. because a particular file format specifies a particular encoding).
LC_MESSAGES is the language of the messages that the user sees. You should preserve this setting. If you really need to parse an error message, set LC_MESSAGES=C.
LC_COLLATE indicates the sorting order of characters. It's nearly always undesirable in scripts. Most values other than C cause trouble, such as A-Z matching lowercase letters.
Occasionally LC_NUMERIC may cause trouble because numbers may be printed with different punctuation, and LC_TIME influences the way some commands show a date and time. The other categories hardly ever matter in scripts.

Here's a reasonable strategy for scripts (warning, typed directly into the browser):

unset LANGUAGE  # a GNU-specific setting
if [ -n "$LC_ALL" ]; then
  export LC_CTYPE="$LC_ALL" LC_MESSAGES="$LC_ALL"
  unset LC_ALL
elif [ -n "$LANG" ]; then
  export LC_COLLATE=C LC_NUMERIC=C LC_TIME=C
else
  unset LC_COLLATE LC_NUMERIC LC_TIME
fi

Standard shell utilities obey the locale settings. Perl doesn't unless you tell it to.

Try:

LANG=C tr -dc 'a-zA-Z0-9-_'

also try specifying the full path to tr (and compare the results from /usr/bin/tr to the xpg version).

What is the difference between -c ("values") and -C ("characters") on Solaris? On Linux they're the same.

An aside: Are you able to use head -c 64 to replace fold -w 64 |head -1? Also, you can eliminate cat: tr ... < /dev/urandom | ...

Ultimately, depending on availability one of these may work for you (but the character set may be a little different than what you want):

base64 /dev/urandom | head -c 64

uuencode /dev/urandom | head -c 64

继续阅读：locale scripting shell

Consistent implementation of tr?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？