Python, loop through lines in a file; if line equals line in another file, return original line

2023-03-31 11:12 问答作者：

Text file 1 has the following format:

'WORD': 1
'MULTIPLE WORDS': 1
'WORD': 2

etc.

I.e., a word separated by a colon followed by a number.

Text file 2 has the following format:

'WORD'
'WORD'

etc.

I need to extract single words (i.e., only WORD not MULTIPLE WORDS) from File 1 and, if they match a word in File 2, return the word from File 1 along with its value.

I have some poorly functioning code:

def GetCounts(file1, file2):
    target_conte开发者_JAVA百科nts  = open(file1).readlines()  #file 1 as list--> 'WORD': n
    match_me_contents = open(file2).readlines()   #file 2 as list -> 'WORD'
    ls_stripped = [x.strip('\n') for x in match_me_contents]  #get rid of newlines

    match_me_as_regex= re.compile("|".join(ls_stripped))   

    for line in target_contents:
        first_column = line.split(':')[0]  #get the first item in line.split
        number = line.split(':')[1]   #get the number associated with the word
        if len(first_column.split()) == 1: #get single word, no multiple words 
            """ Does the word from target contents match the word
            from match_me contents?  If so, return the line from  
            target_contents"""
            if re.findall(match_me_as_regex, first_column):  
                print first_column, number

#OUTPUT: WORD, n
         WORD, n
         etc.

Because of the use of regex, the output is shotty. The code will return 'asset, 2', for example, since re.findall() will match 'set' from match_me. I need to match the target_word with the entire word from match_me to block the bad output resulting from partial regex matches.

If file2 is not humongous, slurp them into a set:

file2=set(open("file2").read().split())
for line in open("file1"):
    if line.split(":")[0].strip("'") in file2:
        print line

I guess by "poorly functioning" you mean speed wise? Because I tested and it does appear to work.

You could make things more efficient by making a set of the words in file2:

word_set = set(ls_stripped)

And then instead of findall you'd see if it's in the set:

in_set = just_word in word_set

Also feels cleaner than a regex.

It seems like this may simply be a special case of grep. If file2 is essentially a list of patterns, and the output format is the same as file1, then you might be able to just do this:

grep -wf file2 file1

The -w tells grep to match only whole words.

This is how I'd do this. I don't have a python interpreter on hand, so there may be a couple typos.

One of the main things you should remember when coming to Python (especially if coming from Perl) is that regular expressions are usually a bad idea: the string methods are powerful and very fast.

def GetCounts(file1, file2):
    data = {}
    for line in open(file1):
        try:
            word, n = line.rsplit(':', 1)
        except ValueError: # not enough values
            #some kind of input error, go to next line
            continue
        n = int(n.strip())
        if word[0] == word[-1] == "'":
            word = word[1:-1]
        data[word] = n

    for line in open(file2):
        word = line.strip()
        if word[0] == word[-1] == "'":
            word = word[1:-1]
        if word in data:
            print word, data[word]

import re, methodcaller

re_target = re.compile(r"^'([a-z]+)': +(\d+)", re.M|re.I)
match_me_contents = open(file2).read().splitlines()
match_me_contents = set(map(methodcaller('strip', "'"), match_me_contents))

res = []
for match in re_target.finditer(open(file1).read()):
    word, value = match.groups()
    if word in match_me_contents:
        res.append((word, value))

My two input files:

file1.txt:

'WORD': 1
'MULTIPLE WORDS': 1
'OTHER': 2

file2.txt:

'WORD'
'NONEXISTENT'

If file2.txt is guaranteed not to have multiple words on a line, then there is no need to explicitly filter these from the first file. This will be done by the membership test:

# Build a set of what words we can return a count for.
with open('file2.txt', 'r') as f:
    allowed_words = set(word.strip() for word in f)

# See which of them exist in the first file.
with open('file1.txt', 'r') as f:
    for line in f:
        word, count = line.strip().split(':')

        # This assumes that strings with a space (multiple words) do not exist in
        # the second file.
        if word in allowed_words:
            print word, count

And running this gives:

$ python extract.py
'WORD' 1

If file2.txt might contain multiple words, simply modify the test in the loop:

# Build a set of what words we can return a count for.
with open('file2.txt', 'r') as f:
    allowed_words = set(word.strip() for word in f)

# See which of them exist in the first file.
with open('file1.txt', 'r') as f:
    for line in f:
        word, count = line.strip().split(':')

        # This prevents multiple words from being selected.
        if word in allowed_words and not ' ' in word:
            print word, count

Note I haven't bothered stripping the quotes from the words. I'm not sure if this is necessary - it depends on whether the input is guaranteed to have them or not. It would be trivial to add them.

Something else you should consider is case-sensitivity. If lowercase and uppercase words should be treated as the same, then you should convert all input to uppercase (or lowercase, it does not matter which) prior to doing any testing.

EDIT: It would probably be more efficient to remove multiple words from the set of allowed words, rather than doing the check on every line of file1:

# Build a set of what words we can return a count for.
with open('file2.txt', 'r') as f:
    allowed_words = set(word.strip() for word in f if not ' ' in f)

# See which of them exist in the first file.
with open('file1.txt', 'r') as f:
    for line in f:
        word, count = line.strip().split(':')

        # Check if the word is allowed.
        if word in allowed_words:
            print word, count

This is what I came up with:

def GetCounts(file1, file2):
    target_contents  = open(file1).readlines()  #file 1 as list--> 'WORD': n
    match_me_contents = set(open(file2).read().split('\n'))   #file 2 as list -> 'WORD'  
    for line in target_contents:
        word = line.split(': ')[0]  #get the first item in line.split
        if " " not in word:
            number = line.split(': ')[1]   #get the number associated with the word
            if word in match_me_contents:  
                print word, number

Changes from your version:

Moved to set from a regex
Went to split instead of readlines to get rid of newlines without extra processing
Changed from splitting the word into words and checking if the length of that is one to simply checking if a space is in the "word" directly
- This could cause a bug if the "space" isn't an actual space though.This could be fixed with a regex for "\s" or equivalent instead, however with a performance penalty.
Added a space into line.split(': ') so that that way number won't be prefixed with a space
- This could cause a bug if there is not a space before the number.
Moved number = line.split(': ')[1] after the checking to see if the word contains spaces for efficiency purposes, minor though the speed difference would be (almost certainly the bulk of the time would be spent checking is a work was in the target)

The potential bugs would only occur however if the actual input is not in the format you presented.

Let's exploit the similarity of the file format to Python expression syntax:

from ast import literal_eval
with file("file1") as f:
  word_values = ast.literal_eval('{' + ','.join(line for line in f) + '}')
with file("file2") as f:
  expected_words = set(ast.literal_eval(line) for line in f)
word_values = {k: v for (k, v) in word_values if k in expected_words}

继续阅读：python regex text-processing

Python, loop through lines in a file; if line equals line in another file, return original line

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？