开发者

Random List of millions of elements in Python Efficiently

I have read this answer potentially as the best way to randomize a list of strings in Python. I'm开发者_JAVA技巧 just wondering then if that's the most efficient way to do it because I have a list of about 30 million elements via the following code:

import json
from sets import Set
from random import shuffle

a = []

for i in range(0,193):
    json_data = open("C:/Twitter/user/user_" + str(i) + ".json")
    data = json.load(json_data)
    for j in range(0,len(data)):
        a.append(data[j]['su'])
new = list(Set(a))
print "Cleaned length is: " + str(len(new))

## Take Cleaned List and Randomize it for Analysis
shuffle(new)

If there is a more efficient way to do it, I'd greatly appreciate any advice on how to do it.

Thanks,


A couple of possible suggestions:

import json
from random import shuffle

a = set()
for i in range(193):
    with open("C:/Twitter/user/user_{0}.json".format(i)) as json_data:
        data = json.load(json_data)
        a.update(d['su'] for d in data)

print("Cleaned length is {0}".format(len(a)))

# Take Cleaned List and Randomize it for Analysis
new = list(a)
shuffle(new)

.

  • the only way to know if this is faster is to profile it!
  • do you prefer sets.Set to the built-in set() for a reason?
  • I have introduced a with clause (preferred way of opening files, as it guarantees they get closed)
  • it did not appear that you were doing anything with 'a' as a list except converting it to a set; why not make it a set from the start?
  • rather than iterate on an index, then do a lookup on the index, I just iterate on the data items...
  • which makes it easily rewriteable as a generator expression


If you think you're going to do shuffle, you're probably better off using the solution from this file. For realz.

randomly mix lines of 3 million-line file

Basically the shuffle algorithm has a very low period (meaning it can't hit all the possible combinations of 3 million files, let alone 30 million). If you can load the data in memory then your best bet is as they say. Basically assign a random number to each line and sort that badboy.

See this thread. And here, I did it for you so you didn't mess anything up (that's a joke),

import json
import random
from operator import itemgetter

a = set()
for i in range(0,193):
    json_data = open("C:/Twitter/user/user_" + str(i) + ".json")
    data = json.load(json_data)
    a.update(d['su'] for d in data)

print "Cleaned length is: " + str(len(new))

new = [(random.random(), el) for el in a]
new.sort()
new = map(itemgetter(1), new)


I don't know if it will be any faster but you could try numpy's shuffle.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜