How can I turn a list of words in a text file into a regex to filter out?
I'm trying to filter out some text for certain keywords that are found in a text file. I was thinking about just parsing the file line by line, take each word and then merge them together with a pipe "|" then using that string inside re.sub.
Any better开发者_运维知识库 more efficient ideas are welcome.
Something like this without regexp?
import string
keyset = set(open('keywords.txt').read().splitlines())
for lineno,line in enumerate(open('textfile.txt')):
result = [kw
for kw in keyset
for w in line.split()
if kw in w and w.strip(string.punctuation) == kw]
if result:
print "%5s (%s): %s" % (lineno,', '.join(result), line),
Something like the following?
import re
with file('keywords.txt', 'r') as k:
kwords = sorted(k.read().strip().split(), lambda x: (len(x), x))
searchstring = r'\s?\b(' + '|'.join(kwords) + r')\b'
with file('textfile.txt', 'r') as t:
text = t.read()
newtext, _ = re.subn(searchstring, '', text).lstrip()
精彩评论