Python: create a distribution from a list based on number of items that fall within certain ranges

2023-01-12 14:53 问答作者：

I tagged this question with poisson as I am not sure if it will be helpful in this case.

I need to create a distribution (probably formatted as an image in the end) from a list of data.

For example:

data = [1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5, 10, 10, 10, 22, 30, 30, 35, 46, 58, 59, 59]

such that the data can be used to create a visual distribution. I might, for example in this case, say that the ranges are in 10 and there needs to be at least 3 items in each range to be a valid point.

With this example data, I would expect the result to be analogous to

ditribution = [1, 开发者_如何学编程2, 4, 6]

since I have > 3 items in ranges 0-9, 10-19, 30-39 and 50-59. Using that result I could generate an image that has the sections segmented out (darker color) that exist in my final distribution. An example of the type of image I am trying to create can be seen below and would have been generated with far more data. Ignore the blue line for now.

I know how to do this the brute force way of iterating over every item in the list and doing my calculation like that. But, my data set may have hundreds of thousands, or even millions of numbers. My range (10) and my required number of items (3) will likely be much larger in a real world example.

Python: create a distribution from a list based on number of items that fall within certain ranges

Thanks for any help.

If data is always sorted, a compact approach might be:

import itertools as it

d = [k+1 for k, L in
         ((k, len(list(g))) for k, g in it.groupby(data,key=lambda x:x//10))
     if L>=3]

If data isn't sorted, or if you don't know, use sorted(data) as the first argument to itertools.groupby, instead of just data.

If you prefer a less dense/compact approach, you can of course expand this, e.g. to:

def divby10(x): return x//10

distribution = []
for k, g in it.groupby(data, key=divby10):
    L = len(list(g))
    if L < 3: continue
    distribution.append(k+1)

In either case, the mechanism is that groupby first applies the callable passed as key= to each item in the iterable passed as its first argument, to obtain each item's"key"; for each consecutive group of items which have the same "key", groupby yields a tuple with two items: the value of the key, and an iterable over all items in said group.

Here, the key is obtained by dividing an item by 10 (with truncation); len(list(g)) is the number of consecutive items with that "key". Since the items must be consecutive, you need the data to be sorted (and, it's simpler to just sort it, than sort it "by value divided by 10 with truncation";-).

Since data might be very lengthy, you may want to look into using numpy. It provides many useful functions for numerical work, it requires less memory to store data in a numpy array than a Python list[*], and, since many of the numpy functions call C functions under the hood, you may be able to obtain some speed gains:

import numpy as np

data = np.array([1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5, 10, 10, 10, 22, 30, 30, 35, 46, 58, 59, 59])

hist,bins=np.histogram(data,bins=np.linspace(0,60,7))
print(hist)
# [11  3  1  3  1  3]

distribution=np.where(hist>=3)[0]+1
print(distribution)
# [1 2 4 6]

[*] -- Note: In the code above, a Python list was formed in the process of defining data. So the maximum memory requirement here is actually greater than if you had just used a Python list. The memory should get freed however, if there are no other references to the Python list. Alternatively, if the data is stored on disk, numpy.loadtxt can be used to read it directly into a numpy array.

This sounds like a job for some form of histogram. Presorting should not be necessary in order to accomplish this. I discuss using a variant of bucket sort to group nearby elements here, though you'll need to adjust this algorithm to suit your purposes. Note that you do not need to store the numbers themselves in the buckets in order to form a histogram

继续阅读：distribution poisson python

Python: create a distribution from a list based on number of items that fall within certain ranges

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？