How to remove extended ascii using python?
In trying to fix up a PML (Palm Markup Language) file, it appears as if my test file has non-ASCII characters which is causing MakeBook to compla开发者_JAVA技巧in. The solution would be to strip out all the non-ASCII chars in the PML.
So in attempting to fix this in python, I have
import unicodedata, fileinput
for line in fileinput.input():
print unicodedata.normalize('NFKD', line).encode('ascii','ignore')
However, this results in an error that line must be "unicode, not str". Here's a file fragment.
\B1a\B \tintense, disordered and often destructive rage†.†.†.\t
Not quite sure how to properly pass line in to be processed at this point.
Try print line.decode('iso-8859-1').encode('ascii', 'ignore')
-- that should be much closer to what you want.
You would like to treat line
as ASCII-encoded data, so the answer is to decode it to text using the ascii codec:
line.decode('ascii')
This will raise errors for data that is not in fact ASCII-encoded. This is how to ignore those errors:
line.decode('ascii', 'ignore')
.
This gives you text, in the form of a unicode
instance. If you would rather work with (ascii-encoded) data rather than text, you may re-encode it to get back a str
or bytes
instance (depending on your version of Python):
line.decode('ascii', 'ignore').encode('ascii')
To drop non-ASCII characters use line.decode(your_file_encoding).encode('ascii', 'ignore')
. But probably you'd better use PLM escape sequences for them:
import re
def escape_unicode(m):
return '\\U%04x' % ord(m.group())
non_ascii = re.compile(u'[\x80-\uFFFF]', re.U)
line = u'\\B1a\\B \\tintense, disordered and often destructive rage\u2020.\u2020.\u2020.\\t'
print non_ascii.sub(escape_unicode, line)
This outputs \B1a\B \tintense, disordered and often destructive rage\U2020.\U2020.\U2020.\t
.
Dropping non-ASCII and control characters with regular expression is easy too (this can be safely used after escaping):
regexp = re.compile('[^\x09\x0A\x0D\x20-\x7F]')
regexp.sub('', line)
When reading from a file in Python you're getting byte strings, aka "str" in Python 2.x and earlier. You need to convert these to the "unicode" type using the decode
method. eg:
line = line.decode('latin1')
Replace 'latin1' with the correct encoding.
精彩评论