Binning into timeslots - Is there a better way than using list comp?

2023-01-02 21:03 问答作者：

I have a dataset of events (tweets to be specific) that I am trying to bin / discretize. The following code seems to work fine so far (assuming 100 bins):

HOUR = timedelta(hours=1)
start = datetime.datetime(2009,01,01)
z = [dt + x*HOUR for x in xrange(1, 100)]

But then, I came across this fateful line at python docs 'This makes possible an idiom for clustering a data series into n-length groups using zip(*[iter(s)]*n)'. The zip idiom does indeed work 开发者_开发知识库- but I can't understand how (what is the * operator for instance?). How could I use to make my code prettier? I'm guessing this means I should make a generator / iterable for time that yields the time in graduations of an HOUR?

I will try to explain zip(*[iter(s)]*n) in terms of a simpler example:

imagine you have the list s = [1, 2, 3, 4, 5, 6]

iter(s) gives you a listiterator object that will yield the next number from s each time you ask for an element.

[iter(s)] * n gives you the list with iter(s) in it n times e.g. [iter(s)] * 2 = [<listiterator object>, <listiterator object>] - the key here is that these are 2 references to the same iterator object, not 2 distinct iterator objects.

zip takes a number of sequences and returns a list of tuples where each tuple contains the ith element from each of the sequences. e.g. zip([1,2], [3,4], [5,6]) = [(1, 3, 5), (2, 4, 6)] where (1, 3, 5) are the first elements from the parameters passed to zip and (2, 4, 6) are the second elements from the parameters passed to zip.

The * in front of *[iter(s)]*n converts the [iter(s)]*n from being a list into being multiple parameters being passed to zip. so if n is 2 we get zip(<listiterator object>, <listiterator object>)

zip will request the next element from each of its parameters but because these are both references to the same iterator this will result in (1, 2), it does the same again resulting in (3, 4) and again resulting in (5, 6) and then there are no more elements so it stops. Hence the result [(1, 2), (3, 4), (5, 6)]. This is the clustering a data series into n-length groups as mentioned.

The expression from the docs looks like this:

zip(*[iter(s)]*n)

This is equivalent to:

it = iter(s)
zip(*[it, it, ..., it]) # n times

The [...]*n repeats the list n times, and this results in a list that contains nreferences to the same iterator.

This is again equal to:

it = iter(s)
zip(it, it, ..., it)    # turning a list into positional parameters

The * before the list turns the list elements into positional parameters of the function call.

Now, when zip is called, it starts from left to right to call the iterators to obtain elements that should be grouped together. Since all parameters refer to the same iterator, this yields the first n elements of the initial sequence. Then that process continues for the second group in the resulting list, and so on.

The result is the same as if you had constructed the list like this (evaluated from left to right):

it = iter(s)
[(it.next(), it.next(), ..., it.next()), (it.next(), it.next(), ..., it.next()),  ...]

继续阅读：python

Binning into timeslots - Is there a better way than using list comp?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？