How to break a large CSV data file into individual data files using R?
I have a CSV file the first row of which contains the variables names and the rest of the rows contains the data. What's a good way to break it up into files each containing just one variable in R? Is this solution going to be robust? E.g. what if the input fi开发者_JAVA技巧le is 100G in size?
The input files looks like
var1,var2,var3
1,2,hello
2,5,yay
...
I want to create 3 (or however many variables) files var1.csv, var2.csv, var3.csv so that files resemble File1
var1
1
2
...
File2
var2?
2
5
...
File3
var3
hello
yay
I got a solution in Python (How to break a large CSV data file into individual data files?) but I wonder if R can do the same thing? Essential the Python code reads the csv file line by line and then writes the lines out one at a time. Can R do the same? The command read.csv reads the whole file all at once and this can slow the whole process down. Plus it can't read a 100G file and process it as R attempts to read the whole file into memory. I can't find a command in R that let's you read a csv file line by line. Please help. Thanks!!
You can scan
and then write
to a file(s) one line at a time.
i <- 0
while({x <- scan("file.csv", sep = ",", skip = i, nlines = 1, what = "character");
length(x) > 1}) {
write(x[1], "file1.csv", sep = ",", append = T)
write(x[2], "file2.csv", sep = ",", append = T)
write(x[3], "file3.csv", sep = ",", append = T)
i <- i + 1
}
edit!! I am using the above data, copied over 1000 times. I've done a comparison of speed when we have the file connection open at all times.
ver1 <- function() {
i <- 0
while({x <- scan("file.csv", sep = ",", skip = i, nlines = 1, what = "character");
length(x) > 1}) {
write(x[1], "file1.csv", sep = ",", append = T)
write(x[2], "file2.csv", sep = ",", append = T)
write(x[3], "file3.csv", sep = ",", append = T)
i <- i + 1
}
}
system.time(ver1()) # w/ close to 3K lines of data, 3 columns
## user system elapsed
## 2.809 0.417 3.629
ver2 <- function() {
f <- file("file.csv", "r")
f1 <- file("file1.csv", "w")
f2 <- file("file2.csv", "w")
f3 <- file("file3.csv", "w")
while({x <- scan(f, sep = ",", skip = 0, nlines = 1, what = "character");
length(x) > 1}) {
write(x[1], file = f1, sep = ",", append = T, ncol = 1)
write(x[2], file = f2, sep = ",", append = T, ncol = 1)
write(x[3], file = f3, sep = ",", append = T, ncol = 1)
}
closeAllConnections()
}
system.time(ver2())
## user system elapsed
## 0.257 0.098 0.409
精彩评论