开发者

How do I remove a \ from a string in python

I'm having trouble getting a replace() to work

I've tried my_string.replac开发者_如何学JAVAe('\\', '') and re.sub('\\', '', my_string), but neither one works.

I thought \ was the escape code for backslash, am I wrong?

The string in question looks like

'<2011315123.04C6DACE618A7C2763810@\x82\xb1\x82\xea\x82\xa9\x82\xe7\x8c\xa9\x82\xa6\x82\xe9\x82\xbe\x82\xeb\x82\xa4>'

or print my_string <2011315123.04C6DACE618A7C2763810@???ꂩ?猩???邾?낤>

Yes, it's supposed to look like garbage, but I'd rather get '<2011315123.04C6DACE618A7C2763810@82b182ea82a982e78ca982a682e982be82eb82a4>'


You don't have any backslashes in your string. What you don't have, you can't remove.

Consider what you are showing as '\x82' ... this is a one-byte string.

>>> s = '\x82'
>>> len(s)
1
>>> ord(s)
130
>>> hex(ord(s))
'0x82'
>>> print s
é # my sys.stdout.encoding is 'cp850'
>>> print repr(s)
'\x82'
>>>

What you'd "rather get" ('x82') is meaningless.

Update The "non-ascii" part of the string (bounded by @ and >) is actually Japanese text written mostly in Hiragana and encoded using shift_jis. Transcript of IDLE session:

>>> y = '\x82\xb1\x82\xea\x82\xa9\x82\xe7\x8c\xa9\x82\xa6\x82\xe9\x82\xbe\x82\xeb\x82\xa4'
>>> print y.decode('shift_jis')
これから見えるだろう

Google Translate produces "Can not you see the future" as the English translation.

In a comment on another answer, you say:

I just need ascii

and

What I'm doing with it is seeing how far apart the two strings are using nltk.edit_distance(), so this will give me a multiple of the true distance. Which is good enough for me.

Why do you think you need ASCII? Edit distance is defined quite independently of any alphabet.

For a start, doing nonsensical transformations of your strings won't give you a consistent or predicable multiple of the true distance. Secondly, out of the following:

x
repr(x)
repr(x).replace('\\', '')
repr(x).replace('\\x', '') # if \ is noise, so is x
x.decode(whatever_the_encoding_is)

why do you choose the third?

Update 2 in response to comments:

(1) You still haven't said why you think you need "ascii". nltk.edit_distance doesn't require "ascii" -- the args are said to be "strings" (whatever that means) but the code will work with any 2 sequences of objects for which != works. In other words, why not just use the first of the above 5 options?

(2) Accepting up to 100% inflation of the edit distance is somwhat astonishing. Note that your currently chosen method will use 4 symbols (hex digits) per Japanese character. repr(x) uses 8 symbols per character. x (the first option) uses 2.

(3) You can mitigate the inflation effect by normalising your edit distance. Instead of comparing distance(s1, s2) with a number_of_symbols threshold, compare distance(s1, s2) / float(max(len(s1), len(s2))) with a fraction threshold. Note normalisation is usually used anyway ... the rationale being that the dissimilarity between 20-symbol strings with an edit distance of 4 is about the same as that between 10-symbol strings with an edit distance of 2, not twice as much.

(4) nltk.edit_distance is the most shockingly inefficient pure-Python implementation of edit_distance that I've ever seen. This implementation by Magnus Lie Hetland is much better, but still capable of improvement.


This works i think if you really want to just strip the "\"

>>> a = '<2011315123.04C6DACE618A7C2763810@\x82\xb1\x82\xea\x82\xa9\x82\xe7\x8c\xa9\x82\xa6\x82\xe9\x82\xbe\x82\xeb\x82\xa4>'
>>> repr(a).replace("\\","")[1:-1]
'<2011315123.04C6DACE618A7C2763810@x82xb1x82xeax82xa9x82xe7x8cxa9x82xa6x82xe9x82xbex82xebx82xa4>'
>>> 

But like the answer above, what you get is pretty much meaningless.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜