开发者

Replacing all special characters but the dot in a string also replaces the dot

I am trying to replace special characters by an underscore in a given string (a badly formatted file path) but I cannot get it to work.

Here is the code:

import string, re
from unidecode import unidecode

punc = string.punctuation
punc = string.punctuation.replace(r'.','') # remove the dot from that string
pattern = re.compile(rf'[{punc}]')
# I also tried this as pattern; but it doesn't help:
# pattern = r'[' + punc + ']' 

test_string = r"\\some\random.path${}[]~(éè%&)ç\file.txt"
test_string = unidecode(test_string) # kick off accented letters

print(re.sub(pattern, '_', test_string))
>: \\some\random_path_______ee___c\file_txt

Actually, because the 'dot' is not in the pattern string, I cannot understand why it has been replaced? (I don't want it to be replaced)

More strangely, if I shuffle the punctuation string:

from random import shuffle

punc = string.punctuation
punc = string.punctuation.replace(r'.','') # remove the dot

# shuffle punctuation:
punc = list(punc)
shuffle(punc)
punc = ''.join(punc)

pattern = re.compile(rf开发者_如何学Go'[{punc}]')

it sometimes raise an error such as:

Traceback (most recent call last):

  File "/tmp/ipykernel_3429192/3014469097.py", line 1, in <cell line: 1>
    pattern = re.compile(rf'[' + punc +']')

  File "/usr/lib/python3.10/re.py", line 251, in compile
    return _compile(pattern, flags)

  File "/usr/lib/python3.10/re.py", line 303, in _compile
    p = sre_compile.compile(pattern, flags)

  File "/usr/lib/python3.10/sre_compile.py", line 788, in compile
    p = sre_parse.parse(p, flags)

  File "/usr/lib/python3.10/sre_parse.py", line 969, in parse
    raise source.error("unbalanced parenthesis")

error: unbalanced parenthesis

or, after some other shuffling which doesn't raise the above error, I got:

print(re.sub(pattern, '_', test_string))
>: \\some\random.path${}[]~(ee%&)c\file.txt

pattern 
>: re.compile(r'[|)_&{;=^\'-~]@,["><$:/!}*\#+(?%`]', re.UNICODE)

here it doesn't seem to work at all.

Also, as mentioned in the first code block and here, I also tried not to use re.compile() by directly using: pattern = r'[' + punc + ']' but it doesn't help.

This may also be interesting:

for i in range(len(punc)):
    punc = punc[:-1]
    pattern = r'[' + punc + ']'
    print(f'{i}: pattern: {pattern} replaced_str: ',  re.sub(pattern, '_', test_string))
    
0: pattern: [!"#$%&'()*+,-/:;<=>?@[\]^_`{|}] replaced_str:  \\some\random_path_____~_ee___c\file_txt
1: pattern: [!"#$%&'()*+,-/:;<=>?@[\]^_`{|] replaced_str:  \\some\random_path__}__~_ee___c\file_txt
2: pattern: [!"#$%&'()*+,-/:;<=>?@[\]^_`{] replaced_str:  \\some\random_path__}__~_ee___c\file_txt
3: pattern: [!"#$%&'()*+,-/:;<=>?@[\]^_`] replaced_str:  \\some\random_path_{}__~_ee___c\file_txt
4: pattern: [!"#$%&'()*+,-/:;<=>?@[\]^_] replaced_str:  \\some\random_path_{}__~_ee___c\file_txt
5: pattern: [!"#$%&'()*+,-/:;<=>?@[\]^] replaced_str:  \\some\random_path_{}__~_ee___c\file_txt
6: pattern: [!"#$%&'()*+,-/:;<=>?@[\]] replaced_str:  \\some\random_path_{}__~_ee___c\file_txt
Traceback (most recent call last):

  File "/tmp/ipykernel_3401192/4037584865.py", line 4, in <cell line: 1>
    print(f'{i}: pattern: {pattern} replaced_str: ',  re.sub(pattern, '_', test_string))

  File "/usr/lib/python3.10/re.py", line 209, in sub
    return _compile(pattern, flags).sub(repl, string, count)

  File "/usr/lib/python3.10/re.py", line 303, in _compile
    p = sre_compile.compile(pattern, flags)

  File "/usr/lib/python3.10/sre_compile.py", line 788, in compile
    p = sre_parse.parse(p, flags)

  File "/usr/lib/python3.10/sre_parse.py", line 955, in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)

  File "/usr/lib/python3.10/sre_parse.py", line 444, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,

  File "/usr/lib/python3.10/sre_parse.py", line 550, in _parse
    raise source.error("unterminated character set",

error: unterminated character set

In addition to the why, how could I achieve that goal properly?

Tested with Python 3.9, 3.10 and 3.11.

Ref: https://docs.python.org/3/library/string.html

This is nice (not tested yet, I'll come latter to edit and share my results): Best way to strip punctuation from a string but it actually remove the special char, it doesn't replace them. And it doesn't explain why my solution is working in such a weird way.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜