How to completely sanitize a string of illegal characters in python?

2022-12-13 23:12 问答作者：

I have a feature of my program where the user can upload a csv file, which my program goes through and uses as input. I have one user complaining about a problem where his input is throwing up an error. The error is caused by there being an illegal character that is encoded wrong. The characters is below:

�

Sometimes it appears as a diamond with a "?" in the middle, sometimes it appears as a double diamond with "?" in the middle, sometimes it appears as "\xa0", and sometimes it appears as "\xa0\xa0".

In my program if I do:开发者_如何学编程

print str_with_weird_char

The string will show up in my terminal with the diamond "?" in place of the weird character. If I copy+paste that string into ipython, it will exit with this message:

In [1]: g="blah��blah"
WARNING: 
********
You or a %run:ed script called sys.stdin.close() or sys.stdout.close()!
Exiting IPython!

notice how the diamond "?" is double now. For some reason copy+paste makes it double...

In the django traceback page, it looks like this:

UnicodeDecodeError at /chris/import.html
('ascii', 'blah \xa0 BLAH', 14, 15, 'ordinal not in range(128)')

The thing that messes me up is that I can't do anything with this string without it throwing an exception. I tried unicode(), I tried str(), I tried .encode(), I tried .encode("utf-8"), no matter what it throws up an error.

What can I do it get this thing to be a working string?

You can pass, "ignore" to skip invalid characters in .encode/.decode like "ILLEGAL".decode("utf8","ignore")

>>> "ILLEGA\xa0L".decode("utf8")
...
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 6: unexpected code byte

>>> "ILLEGA\xa0L".decode("utf8","ignore")
u'ILLEGAL'
>>>

Declare the coding on the second line of your script. It really has to be second. Like

#!/usr/bin/python
# coding=utf-8

This might be enough to solve your problem all by itself. If not, see str.encode('utf-8') and str.decode('utf-8').

you can also use:

python3 -c "import urllib, sys ; print urllib.quote_plus(sys.stdin.read())";

taken from https://wiki.python.org/moin/Powerful%20Python%20One-Liners

** ps, in the website it's pointed to use python, but I tested in python3 and it works just fine

The only way to do it (at least in python2) is to use unicodedata.normalize:

unicodedata.normalize('NFKD', text).encode('utf-8', 'ignore')

decode('utf-8', 'ignore') will just raise exception.

继续阅读：python unicode

How to completely sanitize a string of illegal characters in python?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？