Improving RODBC-Postgres Write Performance
I've recently begun using RODBC to connect to PostgreSQL as I couldn't get RPostgreSQL to compile and run in Windows x64. I've found that read performance is similar between the two packages, but write performance is not. For example, using RODBC (where z is a ~6.1M row dataframe):
library(RODBC)
con <- odbcConnect("PostgreSQL84")
#autoCommit=FALSE seems to speed things up
odbcSetAutoCommit(con, autoCommit = FALSE)
system.time(sqlSave(con, z, "ERASE111", fast = TRUE))
user system elapsed
275.34 369.86 1979.59
odbcEndTran(con, commit = TRUE)
odbcCloseAll()
Whereas for the开发者_如何学Go same ~6.1M row dataframe using RPostgreSQL (under 32-bit):
library(RPostgreSQL)
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname="gisdb", user="postgres", password="...")
system.time(dbWriteTable(con, "ERASE222", z))
user system elapsed
467.57 56.62 668.29
dbDisconnect(con)
So, in this test, RPostgreSQL is about 3X as fast as RODBC in writing tables. This performance ratio seems to stay more-or-less constant regardless of the number of rows in the dataframe (but the number of columns has far less effect). I do notice that RPostgreSQL uses something like COPY <table> FROM STDIN
while RODBC issues a bunch of INSERT INTO <table> (columns...) VALUES (...)
queries. I also notice that RODBC seems to choose int8 for integers, while RPostgreSQL chooses int4 where appropriate.
I need to do this kind of dataframe copy often, so I would very sincerely appreciate any advice on speeding up RODBC. For example, is this just inherent in ODBC, or am I not calling it properly?
It seems there is no immediate answer to this, so I'll post a kludgy workaround in case it is helpful for anyone.
Sharpie is correct--COPY FROM
is by far the fastest way to get data into Postgres. Based on his suggestion, I've hacked together a function that gives a significant performance boost over RODBC::sqlSave()
. For example, writing a 1.1 million row (by 24 column) dataframe took 960 seconds (elapsed) via sqlSave
vs 69 seconds using the function below. I wouldn't have expected this since the data are written once to disk then again to the db.
library(RODBC)
con <- odbcConnect("PostgreSQL90")
#create the table
createTab <- function(dat, datname) {
#make an empty table, saving the trouble of making it by hand
res <- sqlSave(con, dat[1, ], datname)
res <- sqlQuery(con, paste("TRUNCATE TABLE",datname))
#write the dataframe
outfile = paste(datname, ".csv", sep = "")
write.csv(dat, outfile)
gc() # don't know why, but memory is
# not released after writing large csv?
# now copy the data into the table. If this doesn't work,
# be sure that postgres has read permissions for the path
sqlQuery(con,
paste("COPY ", datname, " FROM '",
getwd(), "/", datname,
".csv' WITH NULL AS 'NA' DELIMITER ',' CSV HEADER;",
sep=""))
unlink(outfile)
}
odbcClose(con)
精彩评论