Python, loop through lines in a file; if line equals line in another file, return original line
Text file 1 has the following format:
'WORD': 1
'MULTIPLE WORDS': 1
'WORD': 2
etc.
I.e., a word separated by a colon followed by a number.
Text file 2 has the following format:
'WORD'
'WORD'
etc.
I need to extract single words (i.e., only WORD not MULTIPLE WORDS) from File 1 and, if they match a word in File 2, return the word from File 1 along with its value.
I have some poorly functioning code:
def GetCounts(file1, file2):
target_conte开发者_JAVA百科nts = open(file1).readlines() #file 1 as list--> 'WORD': n
match_me_contents = open(file2).readlines() #file 2 as list -> 'WORD'
ls_stripped = [x.strip('\n') for x in match_me_contents] #get rid of newlines
match_me_as_regex= re.compile("|".join(ls_stripped))
for line in target_contents:
first_column = line.split(':')[0] #get the first item in line.split
number = line.split(':')[1] #get the number associated with the word
if len(first_column.split()) == 1: #get single word, no multiple words
""" Does the word from target contents match the word
from match_me contents? If so, return the line from
target_contents"""
if re.findall(match_me_as_regex, first_column):
print first_column, number
#OUTPUT: WORD, n
WORD, n
etc.
Because of the use of regex, the output is shotty. The code will return 'asset, 2', for example, since re.findall() will match 'set' from match_me. I need to match the target_word with the entire word from match_me to block the bad output resulting from partial regex matches.
If file2
is not humongous, slurp them into a set:
file2=set(open("file2").read().split())
for line in open("file1"):
if line.split(":")[0].strip("'") in file2:
print line
I guess by "poorly functioning" you mean speed wise? Because I tested and it does appear to work.
You could make things more efficient by making a set
of the words in file2:
word_set = set(ls_stripped)
And then instead of findall
you'd see if it's in the set:
in_set = just_word in word_set
Also feels cleaner than a regex.
It seems like this may simply be a special case of grep. If file2 is essentially a list of patterns, and the output format is the same as file1, then you might be able to just do this:
grep -wf file2 file1
The -w
tells grep to match only whole words.
This is how I'd do this. I don't have a python interpreter on hand, so there may be a couple typos.
One of the main things you should remember when coming to Python (especially if coming from Perl) is that regular expressions are usually a bad idea: the string methods are powerful and very fast.
def GetCounts(file1, file2):
data = {}
for line in open(file1):
try:
word, n = line.rsplit(':', 1)
except ValueError: # not enough values
#some kind of input error, go to next line
continue
n = int(n.strip())
if word[0] == word[-1] == "'":
word = word[1:-1]
data[word] = n
for line in open(file2):
word = line.strip()
if word[0] == word[-1] == "'":
word = word[1:-1]
if word in data:
print word, data[word]
import re, methodcaller
re_target = re.compile(r"^'([a-z]+)': +(\d+)", re.M|re.I)
match_me_contents = open(file2).read().splitlines()
match_me_contents = set(map(methodcaller('strip', "'"), match_me_contents))
res = []
for match in re_target.finditer(open(file1).read()):
word, value = match.groups()
if word in match_me_contents:
res.append((word, value))
My two input files:
file1.txt
:
'WORD': 1
'MULTIPLE WORDS': 1
'OTHER': 2
file2.txt
:
'WORD'
'NONEXISTENT'
If file2.txt
is guaranteed not to have multiple words on a line, then there is no need to explicitly filter these from the first file. This will be done by the membership test:
# Build a set of what words we can return a count for.
with open('file2.txt', 'r') as f:
allowed_words = set(word.strip() for word in f)
# See which of them exist in the first file.
with open('file1.txt', 'r') as f:
for line in f:
word, count = line.strip().split(':')
# This assumes that strings with a space (multiple words) do not exist in
# the second file.
if word in allowed_words:
print word, count
And running this gives:
$ python extract.py
'WORD' 1
If file2.txt
might contain multiple words, simply modify the test in the loop:
# Build a set of what words we can return a count for.
with open('file2.txt', 'r') as f:
allowed_words = set(word.strip() for word in f)
# See which of them exist in the first file.
with open('file1.txt', 'r') as f:
for line in f:
word, count = line.strip().split(':')
# This prevents multiple words from being selected.
if word in allowed_words and not ' ' in word:
print word, count
Note I haven't bothered stripping the quotes from the words. I'm not sure if this is necessary - it depends on whether the input is guaranteed to have them or not. It would be trivial to add them.
Something else you should consider is case-sensitivity. If lowercase and uppercase words should be treated as the same, then you should convert all input to uppercase (or lowercase, it does not matter which) prior to doing any testing.
EDIT: It would probably be more efficient to remove multiple words from the set of allowed words, rather than doing the check on every line of file1
:
# Build a set of what words we can return a count for.
with open('file2.txt', 'r') as f:
allowed_words = set(word.strip() for word in f if not ' ' in f)
# See which of them exist in the first file.
with open('file1.txt', 'r') as f:
for line in f:
word, count = line.strip().split(':')
# Check if the word is allowed.
if word in allowed_words:
print word, count
This is what I came up with:
def GetCounts(file1, file2):
target_contents = open(file1).readlines() #file 1 as list--> 'WORD': n
match_me_contents = set(open(file2).read().split('\n')) #file 2 as list -> 'WORD'
for line in target_contents:
word = line.split(': ')[0] #get the first item in line.split
if " " not in word:
number = line.split(': ')[1] #get the number associated with the word
if word in match_me_contents:
print word, number
Changes from your version:
- Moved to set from a regex
- Went to split instead of readlines to get rid of newlines without extra processing
- Changed from splitting the word into words and checking if the length of that is one to simply checking if a space is in the "word" directly
- This could cause a bug if the "space" isn't an actual space though.This could be fixed with a regex for "\s" or equivalent instead, however with a performance penalty.
- Added a space into line.split(': ') so that that way number won't be prefixed with a space
- This could cause a bug if there is not a space before the number.
- Moved
number = line.split(': ')[1]
after the checking to see if the word contains spaces for efficiency purposes, minor though the speed difference would be (almost certainly the bulk of the time would be spent checking is a work was in the target)
The potential bugs would only occur however if the actual input is not in the format you presented.
Let's exploit the similarity of the file format to Python expression syntax:
from ast import literal_eval
with file("file1") as f:
word_values = ast.literal_eval('{' + ','.join(line for line in f) + '}')
with file("file2") as f:
expected_words = set(ast.literal_eval(line) for line in f)
word_values = {k: v for (k, v) in word_values if k in expected_words}
精彩评论