Special Character problem in regexp by python
I apply some regular expression on xml file to find and replace values. Normally it works.(I heard the voices saying "use xml parsers". Meanwhile I can not.) But if there is a special character in the value, it ruins everything.
Think I have a xml file like below:
<fieldset>
<idle1>
<value>something\\n</value>
</idle1>
<idle2>
<value>blabla</value>
</idle2>
</fieldset>
If I try to replace value in "<idle2><value>
" node, value of "<idle1><value>
" node becomes "something\n". And when it comes to writing to file, xml becomes:
<fieldset>
<idle1>
<value>something
</value>
</idle1>
<idle2>
<value>blabla</value>
</idle2>
</fieldset>
Well both in search and replace i use "r" string literal. But it seems not working. I solve the problem. For every search and replace, I replace "\n"s with "\\n
" and then I write result to the file. But it is 开发者_开发技巧not an efficient way to use.
Is there something I could not see? I just want to write "\\n
" to the files. Is this so much for me to want it?
Edit: here is my regexs':
for search :
self.searchPattern=(<fieldset>)(.*?)(<idle2>)(.*?)(<value>)(.*?)(</value>)(.*?)(</idle2>)(.*?)(</fieldset>)
for replace :
self.replacePattern=`\g<1>\g<2>\g<3>\g<4><value>denemeasdasd\\\\n</value>\g<8>\g<9>\g<10>\g<11>`
this is the python code for search:
self.pattern = re.compile(r''''''+self.searchPattern+'''''', flags = re.S | re.U)
and this is for replacing
outtext = self.pattern.sub(r''''''+self.replacePattern+'''''',r''''''+self.match.group(0)+'''''')
I don't understand your explanations.
Personnaly, I wrote this:
import re
RE = ('(^([ \t]+)<(idle2)>(?:\n|\r\n?)[ \t]+<value>)'
'(.*?)'
'(?=</value>(?:\n|\r\n?)\\2</\\3>)')
print repr(ch),'\n'
print ch
print '\n-------------------------------------------------'
print repr(re.sub(RE,'\\1AAA',ch,flags = re.M)) , '\n'
print re.sub(RE,'\\1-----HHHHHHXXXXXXX-------',ch,flags = re.M)
result
'<fieldset>\n <idle1>\n <value>something\\n</value>\n </idle1>\n <idle2>\n <value>blabla</value>\n </idle2>\n</fieldset>'
<fieldset>
<idle1>
<value>something\n</value>
</idle1>
<idle2>
<value>blabla</value>
</idle2>
</fieldset>
-------------------------------------------------
'<fieldset>\n <idle1>\n <value>something\\n</value>\n </idle1>\n <idle2>\n <value>AAA</value>\n </idle2>\n</fieldset>'
<fieldset>
<idle1>
<value>something\n</value>
</idle1>
<idle2>
<value>-----HHHHHHXXXXXXX-------</value>
</idle2>
</fieldset>
Is it what you want ?
I find it best when dealing with unpredictable data sources to whitelist valid characters. So along with whatever other regular expression replace you have going on, remove anything that's not whitelisted i.e. a-z 0-9 : , . -
Look at your data and determine the appropriate whitelist for your task.
精彩评论