开发者

Python - Optimal code to find preceding and following five words from a given point in a line

I'm trying to write code to find the 5 words on either side of a particular phrase. Easy enough, but I have to do this on a massive volume of data, thus the code needs to be optimal!

for file in listing:
    file2 = open('//home/user/Documents/Corpus/Files/'+file,'r')
    for line in file2:
        linetrigrams = trigram_split(line)
        for trigram in linetrigrams:
            if trigram in trigrams:
                line2 = line.replace(trigram,'###').split('###')
                window = (line2[0].split()[-5:] + line2[1].split()[:5])
                for item in window:
                    if item in mostfreq:
                      开发者_StackOverflow  matrix[trigram][mostfreq[item]] += 1

Any suggestions for doing this much faster? It might be that I'm working with entirely the wrong data structures here. trigram_split() just gives all trigrams from the line (which are the units I need to create vectors for). 'Trigrams' is basically a list of about a million trigrams that I'm concerned with creating vectors for. Window gets the 5 words preceding and following the trigram (if that trigram is in the list), and then checks to see if they're in the list MostFreq (which is a dictionary of 1000 words as keys, each corresponding to an integer [0-100] as the stored value). This is then used to update the Matrix (which is a dictionary with lists ([0] * 1000) as the stored values). The corresponding value in the pseudo-matrix is incremented this way.


Several significant factors to consider when weighing various approaches:

  • Multiline vs single line
  • Length of the lines
  • Length of the search pattern
  • Rate of the search matching
  • What to do if there are fewer than 5 words before/after
  • How to handle non-word, non-space characters (newlines and punctuation)
  • Case insensitive?
  • What to do with overlapping matches? eg If the text is We are the knights who say NI! NI NI NI NI NI NI NI NI and you search for NI what do you return? Will this happen to you?
  • What if ### is in your data?
  • Would you rather miss some, or return extra wrong results? There may be some tradeoffs, particularly with messy real world data.

You could try a regular expression...

import re
zen = """Beautiful is better than ugly. \
Explicit is better than implicit. \
Simple is better than complex. \
Complex is better than complicated. \
Flat is better than nested. \
Sparse is better than dense. \
Readability counts. \
Special cases aren't special enough to break the rules. \
Although practicality beats purity. \
Errors should never pass silently. \
Unless explicitly silenced. \
In the face of ambiguity, refuse the temptation to guess. \
There should be one-- and preferably only one --obvious way to do it. \
Although that way may not be obvious at first unless you're Dutch. \
Now is better than never. \
Although never is often better than *right* now. \
If the implementation is hard to explain, it's a bad idea. \
If the implementation is easy to explain, it may be a good idea. \
Namespaces are one honking great idea -- let's do more of those!"""

searchvar = 'Dutch'
dutchre = re.compile(r"""((?:\S+\s*){,5})(%s)((?:\S+\s*){,5})""" % searchvar, re.IGNORECASE | re.MULTILINE)
print dutchre.findall(zen)
#[("obvious at first unless you're ", 'Dutch', '. Now is better than ')]

Alternate approach, which leads to worse results IMO...

def splitAndFind(text, phrase):
    text2 = text.replace(phrase, "###").split("###")
    if len(text2) > 1:
        return ((text2[0].split()[-5:], text2[1].split()[:5]))
print splitAndFind(zen, 'Dutch')
#(['obvious', 'at', 'first', 'unless', "you're"],
# ['.', 'Now', 'is', 'better', 'than'])

In iPython you can time it easily:

timeit dutchre.findall(zen)
1000 loops, best of 3: 814 us per loop

timeit 'Dutch' in zen
1000000 loops, best of 3: 650 ns per loop

timeit zen.find('Dutch')
1000000 loops, best of 3: 812 ns per loop

timeit splitAndFind(zen, 'Dutch')
10000 loops, best of 3: 18.8 us per loop


You could use the re module

import re

f = open('filepath', 'r')
txt = f.read()
# 'hey' is the search phrase
phrase = 'hey'
matches = re.findall(r'(\w+\s+\w+\s+\w+\s+\w+\s+\w+)\s+%s\s+(\w+\s+\w+\s+\w+\s+\w+\s+\w+)' % phrase, txt)

That should give you all the matches for the file. You would need to os.walk to get all the files.


Usually, massic data would be stored in files; loading such files in memory only to search them might slow your search down - so search the file on disk. However, for looking up the 5 words preceeding the phrase you will need random access, be aware of that.

As for searching, the Boyer–Moore string search algorithm is usually faster than the naive approach. If you are working on random access files, this also may avoid the need to read the whole file into memory.

Especially if your phrase changes often but the data doesn't, then consider using some full text search engine as in Full text search engine for Python - however, this is likely to be overkill for this sort of homework assignment.


I tried this script on a 18MB file:

for line in f:
    if phrase in line:
        line2 = line.replace(phrase,'###').split('###')
        result.append(line2[0].split()[-5:] + line2[1].split()[:5])

Elapsed time: 1.12602651429 s


If speed was really important I'd write a python C extension module to do the hard work. This can give huge speed improvements.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜