How do I remove hex values in a python string with regular expressions?

2023-02-15 01:27 问答作者：

I have a cell array in matlab

columns = {'MagX', 'MagY', 'MagZ', ...
           'AccelerationX',  'AccelerationX',  'AccelerationX', ...
           'AngularRateX', 'AngularRateX', 'AngularRateX', ...
           'Temperature'}

I use these scripts which make use of matlab's hdf5write function to save the array in the hdf5 format.

I then read in the the hdf5 file into python using pytables. The cell array comes in as a numpy array of strings. I convert to a list and this is the output:

>>>columns
['MagX\x00\x00\x00\x08\x01\x008\xe6\x7f',
 'MagY\x00\x7f\x00\x00\x00\xee\x0b9\xe6\x7f',
 'MagZ\x00\x00\x00\x00\x001',
 'AccelerationX',
 'AccelerationY',
 'AccelerationZ',
 'AngularRateX',
 'AngularRateY',
 'AngularRateZ',
 'Temperature']

These hex values pop into the strings from somewhere and I'd like to remove them. They don't always appear on the first three items of the list and I need a nice way to deal with them or to find out why they are there in the first place.

>>>print columns[0]
Mag8�
>>>columns[0]
'MagX\x00\x00\x00开发者_JS百科\x08\x01\x008\xe6\x7f'
>>>repr(columns[0])
"'MagX\\x00\\x00\\x00\\x08\\x01\\x008\\xe6\\x7f'"
>>>print repr(columns[0])
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'

I've tried using a regular expression to remove the hex values but have little luck.

>>>re.sub('(\w*)\\\\x.*', '\1', columns[0])
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
>>>re.sub('(\w*)\\\\x.*', r'\1', columns[0])
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
>>>re.sub(r'(\w*)\\x.*', '\1', columns[0])
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
>>>re.sub('([A-Za-z]*)\x00', r'\1', columns[0])
'MagX\x08\x018\xe6\x7f'
>>>re.sub('(\w*?)', '\1', columns[0])
'\x01M\x01a\x01g\x01X\x01\x00\x01\x00\x01\x00\x01\x08\x01\x01\x01\x00\x018\x01\xe6\x01\x7f\x01'

Any suggestions on how to deal with this?

You can remove all non-word characters in the following way:

>>> re.sub(r'[^\w]', '', 'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f')
'MagX8'

The regex [^\w] will match any character that is not a letter, digit, or underscore. By providing that regex in re.sub with an empty string as a replacement you will delete all other characters in the string.

Since there may be other characters you want to keep, a better solution might be to specify a larger range of characters that you want to keep that excludes control characters. For example:

>>> re.sub(r'[^\x20-\x7e]', '', 'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f')
'MagX8'

Or you could replace [^\x20-\x7e] with the equivalent [^ -~], depending on which seems more clear to you.

To exclude all characters after this first control character just add a .*, like this:

>>> re.sub(r'[^ -~].*', '', 'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f')
'MagX'

They're not actually in the string: you have unescaped control characters, which Python displays using the hexadecimal notation - that's why you see a unusual symbol when you print the value.

You should simply be able to remove the extra levels of quoting in your regular expression but you might also simply rely on something like the regexp module's generic whitespace class, which will match whitespace characters other than tabs and spaces:

>>> import re
>>> re.sub(r'\s', '?', "foo\x00bar")
'foo\x00bar'
>>> print re.sub(r'\s', '?', "foo\x00bar")
foobar

I use this one a bit to replace all input whitespace runs, including non-breaking space characters, with a single space:

>>> re.sub(r'[\xa0\s]+', ' ', input_str)

You can also do this without importing re. E.g. if you're content to keep only ascii characters:

good_string = ''.join(c if ord(c) < 129 else '?' for c in bad_string)

继续阅读：hdf5 pytables python string

How do I remove hex values in a python string with regular expressions?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？