Python: create a distribution from a list based on number of items that fall within certain ranges
I tagged this question with poisson
as I am not sure if it will be helpful in this case.
I need to create a distribution (probably formatted as an image in the end) from a list of data.
For example:
data = [1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5, 10, 10, 10, 22, 30, 30, 35, 46, 58, 59, 59]
such that the data can be used to create a visual distribution. I might, for example in this case, say that the ranges are in 10 and there needs to be at least 3 items in each range to be a valid point.
With this example data, I would expect the result to be analogous to
ditribution = [1, 开发者_如何学编程2, 4, 6]
since I have > 3 items in ranges 0-9, 10-19, 30-39 and 50-59. Using that result I could generate an image that has the sections segmented out (darker color) that exist in my final distribution. An example of the type of image I am trying to create can be seen below and would have been generated with far more data. Ignore the blue line for now.
I know how to do this the brute force way of iterating over every item in the list and doing my calculation like that. But, my data set may have hundreds of thousands, or even millions of numbers. My range (10) and my required number of items (3) will likely be much larger in a real world example.
Thanks for any help.
If data
is always sorted, a compact approach might be:
import itertools as it
d = [k+1 for k, L in
((k, len(list(g))) for k, g in it.groupby(data,key=lambda x:x//10))
if L>=3]
If data
isn't sorted, or if you don't know, use sorted(data)
as the first argument to itertools.groupby
, instead of just data
.
If you prefer a less dense/compact approach, you can of course expand this, e.g. to:
def divby10(x): return x//10
distribution = []
for k, g in it.groupby(data, key=divby10):
L = len(list(g))
if L < 3: continue
distribution.append(k+1)
In either case, the mechanism is that groupby
first applies the callable passed as key=
to each item in the iterable passed as its first argument, to obtain each item's"key"; for each consecutive group of items which have the same "key", groupby
yields a tuple with two items: the value of the key, and an iterable over all items in said group.
Here, the key is obtained by dividing an item by 10 (with truncation); len(list(g))
is the number of consecutive items with that "key". Since the items must be consecutive, you need the data to be sorted (and, it's simpler to just sort it, than sort it "by value divided by 10 with truncation";-).
Since data
might be very lengthy, you may want to look into using numpy. It provides many useful functions for numerical work, it requires less memory to store data
in a numpy array than a Python list[*], and, since many of the numpy functions call C functions under the hood, you may be able to obtain some speed gains:
import numpy as np
data = np.array([1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5, 10, 10, 10, 22, 30, 30, 35, 46, 58, 59, 59])
hist,bins=np.histogram(data,bins=np.linspace(0,60,7))
print(hist)
# [11 3 1 3 1 3]
distribution=np.where(hist>=3)[0]+1
print(distribution)
# [1 2 4 6]
[*] -- Note: In the code above, a Python list was formed in the process of defining data
. So the maximum memory requirement here is actually greater than if you had just used a Python list. The memory should get freed however, if there are no other references to the Python list. Alternatively, if the data is stored on disk, numpy.loadtxt
can be used to read it directly into a numpy array.
This sounds like a job for some form of histogram. Presorting should not be necessary in order to accomplish this. I discuss using a variant of bucket sort to group nearby elements here, though you'll need to adjust this algorithm to suit your purposes. Note that you do not need to store the numbers themselves in the buckets in order to form a histogram
精彩评论