Using find or grep to locate filenames with accented characters from a different encoding system (Windows to Linux)

2023-01-28 19:36 问答作者：

I tried to tag late onto a question similar to mine (Find Non-UTF8 Filenames on Linux File System) to elicit further replies, with no luck so far, so here goes again...

I have the same problem as the OP in the link above and convmv is a great tool to fix one's own filesystem. My question is therefore academic, but I find it unsatisfactory (in fact I can't believe) that 'find' is not able to find non standard ascii characters.

Is there anyone out there that would know what combination of options to use to find filenames that contain non standard characters on what seems to be a unicode FS, in my case the characters seem to be 8bits extended ascii rather than unicode, the files come from a Windows machine (iso-8859-1) and I regularly need to fetch them. I'd love to see how find and/or grep can do the same as convmv.

Sample files:

> ls
Abc�def ÉÈéèáà-rest everest éverest

> ls -b
Abc\251def  ÉÈéèáà-rest  everest  éverest

First file comes from Windows (or simulated with touch $(printf "Abc\xA9def")).

> find . -regex '.*[^a-zA-Z./].*'
./ÉÈéèáà-rest

> ls | egrep '[^a-zA-Z]'
ÉÈéèáà-rest

Missing almost all of them (the hyphen saved that file, can be seen with coloured grep). Whatever is happening here is not what I would expect: neither find nor grep is able to take an accented letter as be开发者_开发问答ing outside the range provided [^a-zA-Z./].

> find . -regex '.*é.*'
./éverest
./ÉÈéèáà-rest

> ls | egrep 'é'
ÉÈéèáà-rest
éverest

> ls | egrep '[é]'
ÉÈéèáà-rest
éverest

> find . -regex '.*[é].*'
./éverest
./ÉÈéèáà-rest

Bizarrely both are able to pick up a standard accent when provided (including in the range). Any find or grep trial with \xA9, \0251 or \o251 fails (no match).

> ls | fgrep e
Abc�def
ÉÈéèáà-rest
everest
éverest

Looking for a non-controversial character shows all files with grep, as I would have expected.

> find . -regex '.*e.*'
./éverest
./ÉÈéèáà-rest
./everest

> find . -name '*e*'
./éverest
./ÉÈéèáà-rest
./everest

find, however, is very discriminatory: even looking up a normal character, it seems to me that it eliminates filenames that contain characters outside the range of acceptable characters for the filesystem's name encoding schema.

As far as I am concerned if the file is in the filesystem, then find should find it, right? But maybe there's a feature I don't know about?

Any insights would be very much appreciated.

Jander answered to the same question I posted on Super User

Jander's answer does the job perfectly, for those interested in getting more out of this, here is one more tip.

With LANG=C, find displays non-ascii characters with question marks. To convert that back to their normal display with that file system, just pipe the output to cat.

LANG=C find . -regex '.*[^a-zA-Z./-].*'
./??verest
./????????????-rest
./Abc?def

LANG=C find . -regex '.*[^a-zA-Z./-].*' | cat
./éverest
./ÉÈéèáà-rest
./Abc�def

继续阅读：find grep regex

Using find or grep to locate filenames with accented characters from a different encoding system (Windows to Linux)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？