开发者

R + ggplot2 - Cannot allocate vector of size 128.0 Mb

I have a file of 4.5MB (9,223,136 lines) with the following information:

0       0
0.0147938       3.67598e-07
0.0226194       7.35196e-07
0.0283794       1.10279e-06
0.033576        1.47039e-06
0.0383903       1.83799e-06
0.0424806    开发者_如何学编程   2.20559e-06
0.0465545       2.57319e-06
0.0499759       2.94079e-06

In each column a value is represented a value from 0 to 100 meaning a percentage. My goal is to draw a graphic in ggplot2 to see check the percentages between them (e.g. with 20% of column1 what is the percentage achieved on column2). Heres is my R script:

library(ggplot2)
dataset=read.table("~/R/datasets/cumul.txt.gz")
p <- ggplot(dataset,aes(V2,V1))
p <- p + geom_line()
p <- p + scale_x_continuous(formatter="percent") + scale_y_continuous(formatter="percent")
p <- p + theme_bw()
ggsave("~/R/grafs/cumul.png")

I'm having a problem because every time i run this R runs out of memory, giving the error: "Cannot allocate vector of size 128.0 Mb ". I'm running 32-bit R on a Linux machine and i have about 4gb free memory.

I thought on a workaround that consists of reducing the precision of these values (by rounding them) and eliminate duplicate lines so that i have less lines on the dataset. Could you give me some advice on how to do this?


Are you sure you have 9 million lines in a 4.5MB file (edit: perhaps your file is 4.5 GB??)? It must be heavily compressed -- when I create a file that is one tenth the size, it's 115Mb ...

n <- 9e5
set.seed(1001)
z <- rnorm(9e5)
z <- cumsum(z)/sum(z)
d <- data.frame(V1=seq(0,1,length=n),V2=z)
ff <- gzfile("lgfile2.gz", "w")
write.table(d,row.names=FALSE,col.names=FALSE,file=ff)
close(ff)
file.info("lgfile2.gz")["size"]

It's hard to tell from the information you've given what kind of "duplicate lines" you have in your data set ... unique(dataset) will extract just the unique rows, but that may not be useful. I would probably start by simply thinning the data set by a factor of 100 or 1000:

smdata <- dataset[seq(1,nrow(dataset),by=1000),]

and see how it goes from there. (edit: forgot a comma!)

Graphical representations of large data sets are often a challenge. In general you will be better off:

  • summarizing the data somehow before plotting it
  • using a specialized graphical type (density plots, contours, hexagonal binning) that reduces the data
  • using base graphics, which uses a "draw and forget" model (unless graphics recording is turned on, e.g. in Windows), rather than lattice/ggplot/grid graphics, which save a complete graphical object and then render it
  • using raster or bitmap graphics (PNG etc.), which only record the state of each pixel in the image, rather than vector graphics, which save all objects whether they overlap or not
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜