How to replace Unicode values using re in Python?
How to replace unicode values using re in Python ? I'm looking for something like this:
line.replace('Ã','')
line.replace('¢','')
line.replace('â','')
Or is there any way which will replace all the non-ASCII characters from a file. Actually I converted PDF fil开发者_运维百科e to ASCII, where I'm getting some non-ASCII characters [e.g. bullets in PDF]
Please help me.
Edit after feedback in comments.
Another solution would be to check the numeric value of each character and see if they are under 128, since ascii goes from 0 - 127. Like so:
# coding=utf-8
def removeUnicode():
text = "hejsanäöåbadasd wodqpwdk"
asciiText = ""
for char in text:
if(ord(char) < 128):
asciiText = asciiText + char
return asciiText
import timeit
start = timeit.Timer("removeUnicode()", "from __main__ import removeUnicode")
print "Time taken: " + str(start.timeit())
Here's an altered version of jd
's answer with benchmarks:
# coding=utf-8
def removeUnicode():
text = u"hejsanäöåbadasd wodqpwdk"
if(isinstance(text, str)):
return text.decode('utf-8').encode("ascii", "ignore")
else:
return text.encode("ascii", "ignore")
import timeit
start = timeit.Timer("removeUnicode()", "from __main__ import removeUnicode")
print "Time taken: " + str(start.timeit())
Output first solution using a str
string as input:
computer:~ Ancide$ python test1.py
Time taken: 5.88719677925
Output first solution using a unicode
string as input:
computer:~ Ancide$ python test1.py
Time taken: 7.21077990532
Output second solution using a str
string as input:
computer:~ Ancide$ python test1.py
Time taken: 2.67580914497
Output second solution using a unicode
string as input:
computer:~ Ancide$ python test1.py
Time taken: 1.740680933
Conclusion
Encoding is the faster solution and encoding the string is less code; Thus the better solution.
Why you want to replace if you have
title.decode('latin-1').encode('utf-8')
or if you want to ignore
unicode(title, errors='replace')
You have to encode your Unicode string to ASCII, ignoring any error that occurs. Here's how:
>>> u'uéa&à'.encode('ascii', 'ignore')
'ua&'
Try to pass re.UNICODE flag to params. Like this:
re.compile("pattern", re.UNICODE)
For more info see manual page.
精彩评论