Delayed execution in python for big data

2022-12-15 08:49 问答作者：

I'm trying to think about how a Python API might look for large datastores like Cassandra. R, Matlab, and NumPy tend to use the "everything is a matrix" formulation and execute each operation separately. This model has proven itself effective for data that can fit in memory. However, one of the benefits of SAS for big data is that it executes row by row, doing all the row calculations before moving to the next. 开发者_StackOverflow中文版For a datastore like Cassandra, this model seems like a huge win -- we only loop through data once.

In Python, SAS's approach might look something like:

with load('datastore') as data:
  for row in rows(data):
    row.logincome = row.log(income)
    row.rich = "Rich" if row.income > 100000 else "Poor"

This is (too?) explicit but has the advantage of only looping once. For smaller datasets, performance will be very poor compared to NumPy because the functions aren't vectorized using compiled code. In R/Numpy we would have the much more concise and compiled:

data.logincome = log(data.income)
data.rich = ifelse(data.income > 100000, "Rich", Poor")

This will execute extremely quickly because log and ifelse are both compiled functions that operator on vectors. A downside, however, is that we will loop twice. For small datasets this doesn't matter, but for a Cassandra backed datastore, I don't see how this approach works.

Question: Is there a way to keep the second API (like R/Numpy/Matlab) but delay computation. Perhaps by calling a sync(data) function at the end?

Alternative ideas? It would be nice to maintain the NumPy type syntax since users will be using NumPy for smaller operations and will have an intuitive grasp of how that works.

I don't know anything about Cassandra/NumPy, but if you adapt your second approach (using NumPy) to process data in chunks of a reasonable size, you might benefit from the CPU and/or filesystem cache and therefore prevent any slowdown caused by looping over the data twice, without giving up the benefit of using optimized processing functions.

I don't have a perfect answer, just a rough idea, but maybe it is worthwhile. It centers around Python generators, in sort of a producer-consumer style combination.

For one, as you don't want to loop twice, I think there is no way around an explicit loop for the rows, like this:

for row in rows(data):
    # do stuff with row

Now, feed the row to (an arbitrary number of) consumers that are - don't choke - generators again. But you would be using the send method of the generator. As an example for such a consumer, here is a sketch of riches:

def riches():
    rich_data = []
    while True:
        row = (yield)
        if row == None: break
        rich_data.append("Rich" if row.income > 100000 else "Poor")
    yield rich_data

The first yield (expression) is just to fuel the individual rows into riches. It does its thing, here building up a result array. After the while loop, the second yield (statement) is used to actually provide the result data to the caller.

Going back to the caller loop, it could look someting like this:

richConsumer = riches()
richConsumer.next()  # advance to first yield
for row in rows(data):
    richConsumer.send(row)
    # other consumers.send(row) here
richConsumer.send(None)  # make consumer exit its inner loop
data.rich = richConsumer.next() # collect result data

I haven't tested that code, but that's how I think about it. It doesn't have the nice compact syntax of the vector-based functions. But it makes the main loop very simple and encapsulates all processing in separate consumers. Additional consumers can be nicely stacked after each other. And the API could be further polished by pushing generator managing code behind e.g. object boundaries. HTH

继续阅读：cassandra python

Delayed execution in python for big data

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？