Removing strings in brackets for unicode lines - python
i've got some problems with my regex and removing my the strongs bounded by brackets.
here's my code:
import sys, re
import codecs
reload(sys)
sys.setdefaultencoding('utf-8')
reader = codecs.open("input",'r','utf-8')
p = re.compile('s/[\[\(].+?[\]\)]//g', re.DOTALL)
# i've also tried several regex but it didn't work
# p = re.compile('\{\{*?.*?\}\}', re.DOTALL)
# p = re.compile('\{\{*.*?\}\}', re.DOTA开发者_开发技巧LL)
for row in reader:
if ("(" in row) and (")" not in row):
continue
if row.count("(") != row.count(")"):
continue
else:
row2 = p.sub('', row)
print row2
for the input textfiles it looks something like this:
가시 돋친(신랄한)평 spinosity
가장 완전한 (같은 종류의 것 중에서) unabridged
(알코올이)표준강도(50%) 이하의 underproof
(암초 awash
치명적인(fatal) capital
열을) 전도하다 transmit
the required output should look like this:
가시 돋친평 spinosity
가장 완전한 unabridged
표준강도 이하의 underproof
치명적인 capital
Would this work for you?
# -*- coding: utf-8 -*-
import sys, re
import codecs
#reload(sys)
#sys.setdefaultencoding('utf-8')
#prepareing the examples to work on
writer = codecs.open("input.txt",'w','utf-8')
examples = [u'가시 돋친(신랄한)평 spinosity',
u'가장 완전한 (같은 종류의 것 중에서)',
u'알코올이)표준강도(50%) 이하의 underproof',
u'(암초 awash',
u'치명적인(fatal) capital']
for exampl in examples:
writer.write(exampl+"\n")
writer.write(exampl)
writer.close()
reader = codecs.open("input.txt",'r','utf-8')
#order of patterns is important,
#if you remove brackets first, the other won't find anything
patterns_to_remove = [r"\(.{1,}\)",r"[\(\)]"]
#one pattern would work just fine, with the loop is a bit more clear
#pat = r"(\(.{1,}\))|([\(\)])"
#for row in reader:
# row = re.sub(pat,'',row)#,re.U)
# print row
reader.seek(0)
for row in reader:
for pat in patterns_to_remove:
row = re.sub(pat,'',row)#,re.U)
print row
reader.close()
精彩评论