Optimization of Function with Dictionary and Zip()
I have the following function:
def filetxt():
word_freq = {}
lvl1 = []
lvl2 = []
total_t = 0
users = 0
text = []
for l in range(0,500):
# Open File
if os.path.exists("C:/Twitter/json/user_" + str(l) + ".json") == True:
with open("C:/Twitter/json/user_" + str(l) + ".json", "r") as f:
text_f = json.load(f)
users = users + 1
for i in range(len(text_f)):
text.append(text_f[str(i)]['text'])
total_t = total_t + 1
else:
pass
# Filter
occ = 0
import string
for i in range(len(text)):
s = text[i] # Sample string
a = re.findall(r'(RT)',s)
b = re.findall(r'(@)',s)
occ = len(a) + len(b) + occ
s = s.encode('utf-8')
out = s.translate(string.maketrans("",""), string.punctuation)
# Create Wordlist/Dictionary
word_list = text[i].lower().split(None)
for word in word_list:
word_freq[word] = word_freq.get(word, 0) + 1
keys = word_freq.keys()
numbo = range(1,len(keys)+1)
WList = ', '.join(keys)
NList = str(numbo).strip('[]')
WList = WList.split(", ")
NList = NList.split(", ")
W2N = dict(zip(WList, NList))
for k in range (0,len(word_list)):
word_list[k] = W2N[word_list[k]]
for i in range (0,len(word_list)-1):
lvl1.append(word_list[i])
lvl2.append(word_list[i+1])
I have used the profiler to find that it seems the greatest CPU time is spent on the zip()
function and the join
and split
parts of the code, I'm looking to see if there is any way I have overlooked that I could potentially clean up the code to make it more optimized, since the greatest lag seems to be in how I am working with the dictionaries and the zip()
function. Any help would be appreciated thanks!
p.s. The basic purpose of this function is that a I load in files which contain 20 or so tweets in them, so I am most likely going to end up with about 20k - 50k files being sent through this function. The output is that I produce a list of all the dist开发者_StackOverflow中文版inct words in a tweet, followed by which words linked to what, e.g:
1 "love"
2 "pasa"
3 "mirar"
4 "ants"
5 "kers"
6 "morir"
7 "dreaming"
8 "tan"
9 "rapido"
10 "one"
11 "much"
12 "la"
...
10 1
13 12
1 7
12 2
7 3
2 4
3 11
4 8
11 6
8 9
6 5
9 20
5 8
20 25
8 18
25 9
18 17
9 2
...
I think you want something like:
import string
from collections import defaultdict
rng = xrange if xrange else range
def filetxt():
users = 0
total_t = 0
occ = 0
wordcount = defaultdict(int)
wordpairs = defaultdict(lambda: defaultdict(int))
for filenum in rng(500):
try:
with open("C:/Twitter/json/user_" + str(filenum) + ".json",'r') as inf:
users += 1
tweets = json.load(inf)
total_t += len(tweets)
for txt in (r['text'] for r in tweets):
occ += txt.count('RT') + txt.count('@')
prev = None
for word in txt.encode('utf-8').translate(None, string.punctuation).lower().split():
wordcount[word] += 1
wordpairs[prev][word] += 1
prev = word
except IOError:
pass
I hope you don't mind I took the liberty of modifying your code to something that I would more likely write.
from itertools import izip
def filetxt():
# keeps track of word count for each word.
word_freq = {}
# list of words which we've found
word_list = []
# mapping from word -> index in word_list
word_map = {}
lvl1 = []
lvl2 = []
total_t = 0
users = 0
text = []
####### You should replace this with a glob (see: glob module)
for l in range(0,500):
# Open File
try:
with open("C:/Twitter/json/user_" + str(l) + ".json", "r") as f:
text_f = json.load(f)
users = users + 1
# in this file there are multiple tweets so add the text
# for each one.
for t in text_f.itervalues():
text.append(t) ## CHECK THIS
except IOError:
pass
total_t = len(text)
# Filter
occ = 0
import string
for s in text:
a = re.findall(r'(RT)',s)
b = re.findall(r'(@)',s)
occ += len(a) + len(b)
s = s.encode('utf-8')
out = s.translate(string.maketrans("",""), string.punctuation)
# make a list of words that are in the text s
words = s.lower().split(None)
for word in word_list:
# try/except is quicker when we expect not to miss
# and it will be rare for us not to have
# a word in our list already.
try:
word_freq[word] += 1
except KeyError:
# we've never seen this word before so add it to our list
word_freq[word] = 1
word_map[word] = len(word_list)
word_list.append(word)
# little trick to get each word and the word that follows
for curword, nextword in zip(words, words[1:]):
lvl1.append(word_map[curword])
lvl2.append(word_map[nextword])
What is is going to do is give you the following. lvl1 will give you a list of numbers corresponding to words in word_list
. so word_list[lvl1[0]]
will be the first word in the first tweet you processed. lvl2[0]
will be index of the word that follows the lvl1[0]
so you can say, world_list[lvl2[0]]
is the word that follows word_list[lvl1[0]]
. This code basically maintains word_map
, word_list
and word_freq
as it builds this.
Please note that the way you were doing this before, specifically the way you were creating W2N
will not work properly. Dictionaries do not maintain order. Ordered dictionaries are coming in 3.1 but just forget about it for now. Basically when you were doing word_freq.keys()
it was changing every time you added a new word so there was no consistency. See this example,
>>> x = dict()
>>> x[5] = 2
>>> x
{5: 2}
>>> x[1] = 24
>>> x
{1: 24, 5: 2}
>>> x[10] = 14
>>> x
{1: 24, 10: 14, 5: 2}
>>>
So 5 used to be the 2nd one, but now it's the 3rd.
I also updated it to use a 0 index instead of 1 index. I don't know why you were using range(1, len(...)+1)
rather than just range(len(...))
.
Regardless, you should get away from thinking about for
loops in the traditional sense of C/C++/Java where you do loops over numbers. You should consider that unless you need an index number then you don't need it.
Rule of Thumb: if you need an index, you probably need the element at that index and you should be using enumerate
anyways. LINK
Hope this helps...
A few things. These lines are weird for me when put together:
WList = ', '.join(keys)
<snip>
WList = WList.split(", ")
That should be Wlist = list(keys)
.
Are you sure you want to optimize this? I mean, is it really so slow that it's worth your time? And finally, a description of what the script should do would be great, instead of letting us decipher it from the code :)
精彩评论