开发者

How to break a large CSV data file into individual data files using R?

I have a CSV file the first row of which contains the variables names and the rest of the rows contains the data. What's a good way to break it up into files each containing just one variable in R? Is this solution going to be robust? E.g. what if the input fi开发者_JAVA技巧le is 100G in size?

The input files looks like

var1,var2,var3
1,2,hello
2,5,yay
...

I want to create 3 (or however many variables) files var1.csv, var2.csv, var3.csv so that files resemble File1

var1
1
2
...

File2

var2?
2
5
...

File3

var3
hello
yay

I got a solution in Python (How to break a large CSV data file into individual data files?) but I wonder if R can do the same thing? Essential the Python code reads the csv file line by line and then writes the lines out one at a time. Can R do the same? The command read.csv reads the whole file all at once and this can slow the whole process down. Plus it can't read a 100G file and process it as R attempts to read the whole file into memory. I can't find a command in R that let's you read a csv file line by line. Please help. Thanks!!


You can scan and then write to a file(s) one line at a time.

i <- 0
while({x <- scan("file.csv", sep = ",", skip = i, nlines = 1, what = "character");
       length(x) > 1}) {
  write(x[1], "file1.csv", sep = ",", append = T)
  write(x[2], "file2.csv", sep = ",", append = T)
  write(x[3], "file3.csv", sep = ",", append = T)
  i <- i + 1
}

edit!! I am using the above data, copied over 1000 times. I've done a comparison of speed when we have the file connection open at all times.

ver1 <- function() {
  i <- 0
  while({x <- scan("file.csv", sep = ",", skip = i, nlines = 1, what = "character");
         length(x) > 1}) {
    write(x[1], "file1.csv", sep = ",", append = T)
    write(x[2], "file2.csv", sep = ",", append = T)
    write(x[3], "file3.csv", sep = ",", append = T)
    i <- i + 1
  }
}

system.time(ver1()) # w/ close to 3K lines of data, 3 columns
##    user  system elapsed 
##   2.809   0.417   3.629 

ver2 <- function() {
  f <- file("file.csv", "r")
  f1 <- file("file1.csv", "w")
  f2 <- file("file2.csv", "w")
  f3 <- file("file3.csv", "w")
  while({x <- scan(f, sep = ",", skip = 0, nlines = 1, what = "character");
         length(x) > 1}) {
    write(x[1], file = f1, sep = ",", append = T, ncol = 1)
    write(x[2], file = f2, sep = ",", append = T, ncol = 1)
    write(x[3], file = f3, sep = ",", append = T, ncol = 1)
  } 
  closeAllConnections()
}

system.time(ver2())
##   user  system elapsed 
##   0.257   0.098   0.409 
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜