Performance problem transforming JSON data

2023-03-27 14:22 问答作者：

I've got some data in JSON format that I want to do some visualization on. The data (approximately 10MB of JSON) loads pretty fast, but reshaping it into a usable form takes a couple of minutes for just 开发者_JS百科under 100,000 rows. I have something that works, but I think it can be done much better.

It may be easiest to understand by starting with my sample data.

Assuming you run the following command in /tmp:

curl http://public.west.spy.net/so/time-series.json.gz \
    | gzip -dc - > time-series.json

You should be able to see my desired output (after a while) here:

require(rjson)

trades <- fromJSON(file="/tmp/time-series.json")$rows


data <- do.call(rbind,
                lapply(trades,
                       function(row)
                           data.frame(date=strptime(unlist(row$key)[2], "%FT%X"),
                                      price=unlist(row$value)[1],
                                      volume=unlist(row$value)[2])))

someColors <- colorRampPalette(c("#000099", "blue", "orange", "red"),
                               space="Lab")
smoothScatter(data, colramp=someColors, xaxt="n")

days <- seq(min(data$date), max(data$date), by = 'month')
smoothScatter(data, colramp=someColors, xaxt="n")
axis(1, at=days,
     labels=strftime(days, "%F"),
     tick=FALSE)

You can get a 40x speedup by using plyr. Here is the code and the benchmarking comparison. The conversion to date can be done once you have the data frame and hence I have removed it from the code to facilitate apples-to-apples comparison. I am sure a faster solution exists.

f_ramnath = function(n) plyr::ldply(trades[1:n], unlist)[,-c(1, 2)]
f_dustin  = function(n) do.call(rbind, lapply(trades[1:n], 
                function(row) data.frame(
                    date   = unlist(row$key)[2],
                    price  = unlist(row$value)[1],
                    volume = unlist(row$value)[2]))
                )
f_mrflick = function(n) as.data.frame(do.call(rbind, lapply(trades[1:n], 
               function(x){
                   list(date=x$key[2], price=x$value[1], volume=x$value[2])})))

f_mbq = function(n) data.frame(
          t(sapply(trades[1:n],'[[','key')),    
          t(sapply(trades[1:n],'[[','value')))

rbenchmark::benchmark(f_ramnath(100), f_dustin(100), f_mrflick(100), f_mbq(100),
    replications = 50)

test            elapsed   relative 
f_ramnath(100)  0.144       3.692308     
f_dustin(100)   6.244     160.102564     
f_mrflick(100)  0.039       1.000000    
f_mbq(100)      0.074       1.897436

EDIT. MrFlick's solution leads to an additional 3.5x speedup. I have updated my tests.

I received another transformation by MrFlick in irc that was significantly faster and worth mentioning here:

data <- as.data.frame(do.call(rbind,
                              lapply(trades,
                                     function(x) {list(date=x$key[2],
                                                   price=x$value[1],
                                                   volume=x$value[2])})))

It seems to be made significantly faster by not building the inner frames.

You are doing vectorized operations on single elements, which is very inefficient. Price and volume can be extracted like this:

t(sapply(trades,'[[','value'))

And dates like this:

strptime(sapply(trades,'[[','key')[c(F,T)],'%FT%X')

Now only some sugar and the complete code looks like this:

data.frame(
 strptime(sapply(trades,'[[','key')[c(F,T)],'%FT%X'),
 t(sapply(trades,'[[','value')))->data
names(data)<-c('date','price','volume')

On my notebook, the whole set gets converted in about 0.7s, while 10k first rows (10%) take circa 8s using the original algorithm.

Is batching an option? Process 1000 rows at a time perhaps depending on how deep your json is. Do you really need to transform all the data? I am not sure about r and what exactly you are dealing with, but I am thinking of a generic approach.

Also do take a look at this: http://jackson.codehaus.org/ : A High-performance JSON processor.

继续阅读：r

Performance problem transforming JSON data

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？