Using find or grep to locate filenames with accented characters from a different encoding system (Windows to Linux)
I tried to tag late onto a question similar to mine (Find Non-UTF8 Filenames on Linux File System) to elicit further replies, with no luck so far, so here goes again...
I have the same problem as the OP in the link above and convmv is a great tool to fix one's own filesystem. My question is therefore academic, but I find it unsatisfactory (in fact I can't believe) that 'find' is not able to find non standard ascii characters.
Is there anyone out there that would know what combination of options to use to find filenames that contain non standard characters on what seems to be a unicode FS, in my case the characters seem to be 8bits extended ascii rather than unicode, the files come from a Windows machine (iso-8859-1) and I regularly need to fetch them. I'd love to see how find and/or grep can do the same as convmv.
Sample files:
> ls
Abc�def ÉÈéèáà-rest everest éverest
> ls -b
Abc\251def ÉÈéèáà-rest everest éverest
First file comes from Windows (or simulated with touch $(printf "Abc\xA9def")
).
> find . -regex '.*[^a-zA-Z./].*'
./ÉÈéèáà-rest
> ls | egrep '[^a-zA-Z]'
ÉÈéèáà-rest
Missing almost all of them (the hyphen saved that file, can be seen with coloured grep). Whatever is happening here is not what I would expect: neither find nor grep is able to take an accented letter as be开发者_开发问答ing outside the range provided [^a-zA-Z./].
> find . -regex '.*é.*'
./éverest
./ÉÈéèáà-rest
> ls | egrep 'é'
ÉÈéèáà-rest
éverest
> ls | egrep '[é]'
ÉÈéèáà-rest
éverest
> find . -regex '.*[é].*'
./éverest
./ÉÈéèáà-rest
Bizarrely both are able to pick up a standard accent when provided (including in the range). Any find or grep trial with \xA9, \0251 or \o251 fails (no match).
> ls | fgrep e
Abc�def
ÉÈéèáà-rest
everest
éverest
Looking for a non-controversial character shows all files with grep, as I would have expected.
> find . -regex '.*e.*'
./éverest
./ÉÈéèáà-rest
./everest
> find . -name '*e*'
./éverest
./ÉÈéèáà-rest
./everest
find, however, is very discriminatory: even looking up a normal character, it seems to me that it eliminates filenames that contain characters outside the range of acceptable characters for the filesystem's name encoding schema.
As far as I am concerned if the file is in the filesystem, then find should find it, right? But maybe there's a feature I don't know about?
Any insights would be very much appreciated.
Jander answered to the same question I posted on Super User
Jander's answer does the job perfectly, for those interested in getting more out of this, here is one more tip.
With LANG=C, find displays non-ascii characters with question marks. To convert that back to their normal display with that file system, just pipe the output to cat.
LANG=C find . -regex '.*[^a-zA-Z./-].*'
./??verest
./????????????-rest
./Abc?def
LANG=C find . -regex '.*[^a-zA-Z./-].*' | cat
./éverest
./ÉÈéèáà-rest
./Abc�def
精彩评论