Python - Optimal code to find preceding and following five words from a given point in a line

2023-02-19 18:34 问答作者：

I'm trying to write code to find the 5 words on either side of a particular phrase. Easy enough, but I have to do this on a massive volume of data, thus the code needs to be optimal!

for file in listing:
    file2 = open('//home/user/Documents/Corpus/Files/'+file,'r')
    for line in file2:
        linetrigrams = trigram_split(line)
        for trigram in linetrigrams:
            if trigram in trigrams:
                line2 = line.replace(trigram,'###').split('###')
                window = (line2[0].split()[-5:] + line2[1].split()[:5])
                for item in window:
                    if item in mostfreq:
                      开发者_StackOverflow  matrix[trigram][mostfreq[item]] += 1

Any suggestions for doing this much faster? It might be that I'm working with entirely the wrong data structures here. trigram_split() just gives all trigrams from the line (which are the units I need to create vectors for). 'Trigrams' is basically a list of about a million trigrams that I'm concerned with creating vectors for. Window gets the 5 words preceding and following the trigram (if that trigram is in the list), and then checks to see if they're in the list MostFreq (which is a dictionary of 1000 words as keys, each corresponding to an integer [0-100] as the stored value). This is then used to update the Matrix (which is a dictionary with lists ([0] * 1000) as the stored values). The corresponding value in the pseudo-matrix is incremented this way.

Several significant factors to consider when weighing various approaches:

Multiline vs single line
Length of the lines
Length of the search pattern
Rate of the search matching
What to do if there are fewer than 5 words before/after
How to handle non-word, non-space characters (newlines and punctuation)
Case insensitive?
What to do with overlapping matches? eg If the text is We are the knights who say NI! NI NI NI NI NI NI NI NI and you search for NI what do you return? Will this happen to you?
What if ### is in your data?
Would you rather miss some, or return extra wrong results? There may be some tradeoffs, particularly with messy real world data.

You could try a regular expression...

import re
zen = """Beautiful is better than ugly. \
Explicit is better than implicit. \
Simple is better than complex. \
Complex is better than complicated. \
Flat is better than nested. \
Sparse is better than dense. \
Readability counts. \
Special cases aren't special enough to break the rules. \
Although practicality beats purity. \
Errors should never pass silently. \
Unless explicitly silenced. \
In the face of ambiguity, refuse the temptation to guess. \
There should be one-- and preferably only one --obvious way to do it. \
Although that way may not be obvious at first unless you're Dutch. \
Now is better than never. \
Although never is often better than *right* now. \
If the implementation is hard to explain, it's a bad idea. \
If the implementation is easy to explain, it may be a good idea. \
Namespaces are one honking great idea -- let's do more of those!"""

searchvar = 'Dutch'
dutchre = re.compile(r"""((?:\S+\s*){,5})(%s)((?:\S+\s*){,5})""" % searchvar, re.IGNORECASE | re.MULTILINE)
print dutchre.findall(zen)
#[("obvious at first unless you're ", 'Dutch', '. Now is better than ')]

Alternate approach, which leads to worse results IMO...

def splitAndFind(text, phrase):
    text2 = text.replace(phrase, "###").split("###")
    if len(text2) > 1:
        return ((text2[0].split()[-5:], text2[1].split()[:5]))
print splitAndFind(zen, 'Dutch')
#(['obvious', 'at', 'first', 'unless', "you're"],
# ['.', 'Now', 'is', 'better', 'than'])

In iPython you can time it easily:

timeit dutchre.findall(zen)
1000 loops, best of 3: 814 us per loop

timeit 'Dutch' in zen
1000000 loops, best of 3: 650 ns per loop

timeit zen.find('Dutch')
1000000 loops, best of 3: 812 ns per loop

timeit splitAndFind(zen, 'Dutch')
10000 loops, best of 3: 18.8 us per loop

You could use the re module

import re

f = open('filepath', 'r')
txt = f.read()
# 'hey' is the search phrase
phrase = 'hey'
matches = re.findall(r'(\w+\s+\w+\s+\w+\s+\w+\s+\w+)\s+%s\s+(\w+\s+\w+\s+\w+\s+\w+\s+\w+)' % phrase, txt)

That should give you all the matches for the file. You would need to os.walk to get all the files.

Usually, massic data would be stored in files; loading such files in memory only to search them might slow your search down - so search the file on disk. However, for looking up the 5 words preceeding the phrase you will need random access, be aware of that.

As for searching, the Boyer–Moore string search algorithm is usually faster than the naive approach. If you are working on random access files, this also may avoid the need to read the whole file into memory.

Especially if your phrase changes often but the data doesn't, then consider using some full text search engine as in Full text search engine for Python - however, this is likely to be overkill for this sort of homework assignment.

I tried this script on a 18MB file:

for line in f:
    if phrase in line:
        line2 = line.replace(phrase,'###').split('###')
        result.append(line2[0].split()[-5:] + line2[1].split()[:5])

Elapsed time: 1.12602651429 s

If speed was really important I'd write a python C extension module to do the hard work. This can give huge speed improvements.

继续阅读：python search string

Python - Optimal code to find preceding and following five words from a given point in a line

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？