Monitoring R data loading progress when using read.table [duplicate]
I've found a lot of answers for other types of data loading, but none for showing progress when R is reading data using read.table(...)
. I've got a simple command:
data = read.table(file=filename,
sep="\t",
col.names=c("time","id","x","y"),
colClasses=c("integer","NULL","NULL","NULL"))
This loads a large amount of data in开发者_运维问答 about 30 seconds or so, but a progress bar would be really nice :-D
Continuing experiments:
Construct a temporary working file:
n <- 1e7
dd <- data.frame(time=1:n,id=rep("a",n),x=1:n,y=1:n)
fn <- tempfile()
write.table(dd,file=fn,sep="\t",row.names=FALSE,col.names=FALSE)
Run 10 replications with read.table
(with and without colClasses
specified) and scan
:
edit: corrected scan
call in response to comment, updated results:
library(rbenchmark)
(b1 <- benchmark(read.table(fn,
col.names=c("time","id","x","y"),
colClasses=c("integer",
"NULL","NULL","NULL")),
read.table(fn,
col.names=c("time","id","x","y")),
scan(fn,
what=list(integer(),NULL,NULL,NULL)),replications=10))
Results:
2 read.table(fn, col.names = c("time", "id", "x", "y"))
1 read.table(fn, col.names = c("time", "id", "x", "y"),
colClasses = c("integer", "NULL", "NULL", "NULL"))
3 scan(fn, what = list(integer(), NULL, NULL, NULL))
replications elapsed relative user.self sys.self
2 10 278.064 1.857016 232.786 30.722
1 10 149.737 1.011801 141.365 2.388
3 10 143.118 1.000000 140.617 2.105
(warning, these values are slightly cooked/inconsistent because I re-ran the benchmark and merged the results ... but the qualitative result should be OK).
read.table
without colClasses
is slowest (that's not surprising), but only (?) about 85% slower than scan
for this example. scan
is only a tiny bit faster than read.table
with colClasses
specified.
With either scan
or read.table
one could write a "chunked" version that used the skip
and nrows
(read.table
) or n
(scan
) parameters to read bits of the file at a time, then paste them together at the end. I don't know how much this would slow the process down, but it would allow calls to txtProgressBar
between chunks ...
精彩评论