Replacing all special characters but the dot in a string also replaces the dot
I am trying to replace special characters by an underscore in a given string (a badly formatted file path) but I cannot get it to work.
Here is the code:
import string, re
from unidecode import unidecode
punc = string.punctuation
punc = string.punctuation.replace(r'.','') # remove the dot from that string
pattern = re.compile(rf'[{punc}]')
# I also tried this as pattern; but it doesn't help:
# pattern = r'[' + punc + ']'
test_string = r"\\some\random.path${}[]~(éè%&)ç\file.txt"
test_string = unidecode(test_string) # kick off accented letters
print(re.sub(pattern, '_', test_string))
>: \\some\random_path_______ee___c\file_txt
Actually, because the 'dot' is not in the pattern string, I cannot understand why it has been replaced? (I don't want it to be replaced)
More strangely, if I shuffle the punctuation string:
from random import shuffle
punc = string.punctuation
punc = string.punctuation.replace(r'.','') # remove the dot
# shuffle punctuation:
punc = list(punc)
shuffle(punc)
punc = ''.join(punc)
pattern = re.compile(rf开发者_如何学Go'[{punc}]')
it sometimes raise an error such as:
Traceback (most recent call last):
File "/tmp/ipykernel_3429192/3014469097.py", line 1, in <cell line: 1>
pattern = re.compile(rf'[' + punc +']')
File "/usr/lib/python3.10/re.py", line 251, in compile
return _compile(pattern, flags)
File "/usr/lib/python3.10/re.py", line 303, in _compile
p = sre_compile.compile(pattern, flags)
File "/usr/lib/python3.10/sre_compile.py", line 788, in compile
p = sre_parse.parse(p, flags)
File "/usr/lib/python3.10/sre_parse.py", line 969, in parse
raise source.error("unbalanced parenthesis")
error: unbalanced parenthesis
or, after some other shuffling which doesn't raise the above error, I got:
print(re.sub(pattern, '_', test_string))
>: \\some\random.path${}[]~(ee%&)c\file.txt
pattern
>: re.compile(r'[|)_&{;=^\'-~]@,["><$:/!}*\#+(?%`]', re.UNICODE)
here it doesn't seem to work at all.
Also, as mentioned in the first code block and here, I also tried not to use re.compile()
by directly using: pattern = r'[' + punc + ']'
but it doesn't help.
This may also be interesting:
for i in range(len(punc)):
punc = punc[:-1]
pattern = r'[' + punc + ']'
print(f'{i}: pattern: {pattern} replaced_str: ', re.sub(pattern, '_', test_string))
0: pattern: [!"#$%&'()*+,-/:;<=>?@[\]^_`{|}] replaced_str: \\some\random_path_____~_ee___c\file_txt
1: pattern: [!"#$%&'()*+,-/:;<=>?@[\]^_`{|] replaced_str: \\some\random_path__}__~_ee___c\file_txt
2: pattern: [!"#$%&'()*+,-/:;<=>?@[\]^_`{] replaced_str: \\some\random_path__}__~_ee___c\file_txt
3: pattern: [!"#$%&'()*+,-/:;<=>?@[\]^_`] replaced_str: \\some\random_path_{}__~_ee___c\file_txt
4: pattern: [!"#$%&'()*+,-/:;<=>?@[\]^_] replaced_str: \\some\random_path_{}__~_ee___c\file_txt
5: pattern: [!"#$%&'()*+,-/:;<=>?@[\]^] replaced_str: \\some\random_path_{}__~_ee___c\file_txt
6: pattern: [!"#$%&'()*+,-/:;<=>?@[\]] replaced_str: \\some\random_path_{}__~_ee___c\file_txt
Traceback (most recent call last):
File "/tmp/ipykernel_3401192/4037584865.py", line 4, in <cell line: 1>
print(f'{i}: pattern: {pattern} replaced_str: ', re.sub(pattern, '_', test_string))
File "/usr/lib/python3.10/re.py", line 209, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "/usr/lib/python3.10/re.py", line 303, in _compile
p = sre_compile.compile(pattern, flags)
File "/usr/lib/python3.10/sre_compile.py", line 788, in compile
p = sre_parse.parse(p, flags)
File "/usr/lib/python3.10/sre_parse.py", line 955, in parse
p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
File "/usr/lib/python3.10/sre_parse.py", line 444, in _parse_sub
itemsappend(_parse(source, state, verbose, nested + 1,
File "/usr/lib/python3.10/sre_parse.py", line 550, in _parse
raise source.error("unterminated character set",
error: unterminated character set
In addition to the why, how could I achieve that goal properly?
Tested with Python 3.9, 3.10 and 3.11.
Ref: https://docs.python.org/3/library/string.html
This is nice (not tested yet, I'll come latter to edit and share my results): Best way to strip punctuation from a string but it actually remove the special char, it doesn't replace them. And it doesn't explain why my solution is working in such a weird way.
精彩评论