Optimization of Function with Dictionary and Zip()

2023-02-03 21:00 问答作者：

I have the following function:

def filetxt():
    word_freq = {}
    lvl1      = []
    lvl2      = []
    total_t   = 0
    users     = 0
    text      = []

    for l in range(0,500):
        # Open File
        if os.path.exists("C:/Twitter/json/user_" + str(l) + ".json") == True:
            with open("C:/Twitter/json/user_" + str(l) + ".json", "r") as f:
                text_f = json.load(f)
                users = users + 1
                for i in range(len(text_f)):
                    text.append(text_f[str(i)]['text'])
                    total_t = total_t + 1
        else:
            pass

    # Filter
    occ = 0
    import string
    for i in range(len(text)):
        s = text[i] # Sample string
        a = re.findall(r'(RT)',s)
        b = re.findall(r'(@)',s)
        occ = len(a) + len(b) + occ
        s = s.encode('utf-8')
        out = s.translate(string.maketrans("",""), string.punctuation)


        # Create Wordlist/Dictionary
        word_list = text[i].lower().split(None)

        for word in word_list:
            word_freq[word] = word_freq.get(word, 0) + 1

        keys = word_freq.keys()

        numbo = range(1,len(keys)+1)
        WList = ', '.join(keys)
        NList = str(numbo).strip('[]')
        WList = WList.split(", ")
        NList = NList.split(", ")
        W2N = dict(zip(WList, NList))

        for k in range (0,len(word_list)):
            word_list[k] = W2N[word_list[k]]
        for i in range (0,len(word_list)-1):
            lvl1.append(word_list[i])
            lvl2.append(word_list[i+1])

I have used the profiler to find that it seems the greatest CPU time is spent on the zip() function and the join and split parts of the code, I'm looking to see if there is any way I have overlooked that I could potentially clean up the code to make it more optimized, since the greatest lag seems to be in how I am working with the dictionaries and the zip() function. Any help would be appreciated thanks!

p.s. The basic purpose of this function is that a I load in files which contain 20 or so tweets in them, so I am most likely going to end up with about 20k - 50k files being sent through this function. The output is that I produce a list of all the dist开发者_StackOverflow中文版inct words in a tweet, followed by which words linked to what, e.g:

1 "love"
2 "pasa"
3 "mirar"
4 "ants"
5 "kers"
6 "morir"
7 "dreaming"
8 "tan"
9 "rapido"
10 "one"
11 "much"
12 "la"
...
10 1
13 12
1 7
12 2
7 3
2 4
3 11
4 8
11 6
8 9
6 5
9 20
5 8
20 25
8 18
25 9
18 17
9 2
...

I think you want something like:

import string
from collections import defaultdict
rng = xrange if xrange else range

def filetxt():
    users     = 0
    total_t   = 0
    occ       = 0

    wordcount = defaultdict(int)
    wordpairs = defaultdict(lambda: defaultdict(int))
    for filenum in rng(500):
        try:
            with open("C:/Twitter/json/user_" + str(filenum) + ".json",'r') as inf:
                users += 1
                tweets = json.load(inf)
                total_t += len(tweets)

                for txt in (r['text'] for r in tweets):
                    occ += txt.count('RT') + txt.count('@')
                    prev = None
                    for word in txt.encode('utf-8').translate(None, string.punctuation).lower().split():
                        wordcount[word] += 1
                        wordpairs[prev][word] += 1
                        prev = word
        except IOError:
            pass

I hope you don't mind I took the liberty of modifying your code to something that I would more likely write.

from itertools import izip
def filetxt():
    # keeps track of word count for each word.
    word_freq = {}
    # list of words which we've found
    word_list = []
    # mapping from word -> index in word_list
    word_map  = {}
    lvl1      = []
    lvl2      = []
    total_t   = 0
    users     = 0
    text      = []

    ####### You should replace this with a glob (see: glob module)
    for l in range(0,500):
        # Open File
        try:
            with open("C:/Twitter/json/user_" + str(l) + ".json", "r") as f:
                text_f = json.load(f)
                users = users + 1
                # in this file there are multiple tweets so add the text
                # for each one.
                for t in text_f.itervalues():
                    text.append(t)  ## CHECK THIS
        except IOError:
            pass

    total_t = len(text)
    # Filter
    occ = 0
    import string
    for s in text:
        a = re.findall(r'(RT)',s)
        b = re.findall(r'(@)',s)
        occ += len(a) + len(b)
        s = s.encode('utf-8')
        out = s.translate(string.maketrans("",""), string.punctuation)


        # make a list of words that are in the text s
        words = s.lower().split(None)

        for word in word_list:
            # try/except is quicker when we expect not to miss
            # and it will be rare for us not to have
            # a word in our list already.
            try:
                word_freq[word] += 1
            except KeyError:
                # we've never seen this word before so add it to our list
                word_freq[word] = 1
                word_map[word] = len(word_list)
                word_list.append(word)


        # little trick to get each word and the word that follows
        for curword, nextword in zip(words, words[1:]):
            lvl1.append(word_map[curword])
            lvl2.append(word_map[nextword])

What is is going to do is give you the following. lvl1 will give you a list of numbers corresponding to words in word_list. so word_list[lvl1[0]] will be the first word in the first tweet you processed. lvl2[0] will be index of the word that follows the lvl1[0] so you can say, world_list[lvl2[0]] is the word that follows word_list[lvl1[0]]. This code basically maintains word_map, word_list and word_freq as it builds this.

Please note that the way you were doing this before, specifically the way you were creating W2N will not work properly. Dictionaries do not maintain order. Ordered dictionaries are coming in 3.1 but just forget about it for now. Basically when you were doing word_freq.keys() it was changing every time you added a new word so there was no consistency. See this example,

>>> x = dict()
>>> x[5] = 2
>>> x
{5: 2}
>>> x[1] = 24
>>> x
{1: 24, 5: 2}
>>> x[10] = 14
>>> x
{1: 24, 10: 14, 5: 2}
>>>

So 5 used to be the 2nd one, but now it's the 3rd.

I also updated it to use a 0 index instead of 1 index. I don't know why you were using range(1, len(...)+1) rather than just range(len(...)).

Regardless, you should get away from thinking about for loops in the traditional sense of C/C++/Java where you do loops over numbers. You should consider that unless you need an index number then you don't need it.

Rule of Thumb: if you need an index, you probably need the element at that index and you should be using enumerate anyways. LINK

Hope this helps...

A few things. These lines are weird for me when put together:

WList = ', '.join(keys)
<snip>
WList = WList.split(", ")

That should be Wlist = list(keys).

Are you sure you want to optimize this? I mean, is it really so slow that it's worth your time? And finally, a description of what the script should do would be great, instead of letting us decipher it from the code :)

继续阅读：optimization profiling python zip

Optimization of Function with Dictionary and Zip()

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？