开发者

Performance problem transforming JSON data

I've got some data in JSON format that I want to do some visualization on. The data (approximately 10MB of JSON) loads pretty fast, but reshaping it into a usable form takes a couple of minutes for just 开发者_JS百科under 100,000 rows. I have something that works, but I think it can be done much better.

It may be easiest to understand by starting with my sample data.

Assuming you run the following command in /tmp:

curl http://public.west.spy.net/so/time-series.json.gz \
    | gzip -dc - > time-series.json

You should be able to see my desired output (after a while) here:

require(rjson)

trades <- fromJSON(file="/tmp/time-series.json")$rows


data <- do.call(rbind,
                lapply(trades,
                       function(row)
                           data.frame(date=strptime(unlist(row$key)[2], "%FT%X"),
                                      price=unlist(row$value)[1],
                                      volume=unlist(row$value)[2])))

someColors <- colorRampPalette(c("#000099", "blue", "orange", "red"),
                               space="Lab")
smoothScatter(data, colramp=someColors, xaxt="n")

days <- seq(min(data$date), max(data$date), by = 'month')
smoothScatter(data, colramp=someColors, xaxt="n")
axis(1, at=days,
     labels=strftime(days, "%F"),
     tick=FALSE)


You can get a 40x speedup by using plyr. Here is the code and the benchmarking comparison. The conversion to date can be done once you have the data frame and hence I have removed it from the code to facilitate apples-to-apples comparison. I am sure a faster solution exists.

f_ramnath = function(n) plyr::ldply(trades[1:n], unlist)[,-c(1, 2)]
f_dustin  = function(n) do.call(rbind, lapply(trades[1:n], 
                function(row) data.frame(
                    date   = unlist(row$key)[2],
                    price  = unlist(row$value)[1],
                    volume = unlist(row$value)[2]))
                )
f_mrflick = function(n) as.data.frame(do.call(rbind, lapply(trades[1:n], 
               function(x){
                   list(date=x$key[2], price=x$value[1], volume=x$value[2])})))

f_mbq = function(n) data.frame(
          t(sapply(trades[1:n],'[[','key')),    
          t(sapply(trades[1:n],'[[','value')))

rbenchmark::benchmark(f_ramnath(100), f_dustin(100), f_mrflick(100), f_mbq(100),
    replications = 50)

test            elapsed   relative 
f_ramnath(100)  0.144       3.692308     
f_dustin(100)   6.244     160.102564     
f_mrflick(100)  0.039       1.000000    
f_mbq(100)      0.074       1.897436   

EDIT. MrFlick's solution leads to an additional 3.5x speedup. I have updated my tests.


I received another transformation by MrFlick in irc that was significantly faster and worth mentioning here:

data <- as.data.frame(do.call(rbind,
                              lapply(trades,
                                     function(x) {list(date=x$key[2],
                                                   price=x$value[1],
                                                   volume=x$value[2])})))

It seems to be made significantly faster by not building the inner frames.


You are doing vectorized operations on single elements, which is very inefficient. Price and volume can be extracted like this:

t(sapply(trades,'[[','value'))

And dates like this:

strptime(sapply(trades,'[[','key')[c(F,T)],'%FT%X')

Now only some sugar and the complete code looks like this:

data.frame(
 strptime(sapply(trades,'[[','key')[c(F,T)],'%FT%X'),
 t(sapply(trades,'[[','value')))->data
names(data)<-c('date','price','volume')

On my notebook, the whole set gets converted in about 0.7s, while 10k first rows (10%) take circa 8s using the original algorithm.


Is batching an option? Process 1000 rows at a time perhaps depending on how deep your json is. Do you really need to transform all the data? I am not sure about r and what exactly you are dealing with, but I am thinking of a generic approach.

Also do take a look at this: http://jackson.codehaus.org/ : A High-performance JSON processor.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜