Append column to an SQLite table in R using RSQLite
I have two large data frames that I would like to outer join with merge()
, but the joined table is too large for RAM. My worked around is to use the RSQLite
package to to the outer join and store the joined table back to the database.
I would like to use an R function on the columns in this joined table, but I can't figure out how to append a column to the joined table. I know how to do it with dbWriteTable()
(shown below), but that's not an option as the joined table is larger than RAM.
library(RSQLite)
left <- data.frame(let = letters[rep(1:4, each = 5)], num = 1:20)
right <- data.frame(let = letters[rep(1:4, each = 5)], num = 21:40)
con <- dbConnect(dbDriver("SQLite"), dbname = tempfile())
dbWriteTable(con, "left_table", left, row.names = F)
dbWriteTable(con, "right_table", right, row.names = F)
dbGetQuery(con, "CREATE TABLE merged_table (letters TEXT, left_num INTEGER, right_num INTEGER)")
dbGetQuery(con, "INSERT INTO merged_table SELECT * FROM left_table LEFT OUTER JOIN right_table USING (let)")
fun <- function(x) rowSums(x)
temp <- dbReadTable(con, "merged_table")
dbWriteTable(con, "merged_table_new", cbind(temp, fun(temp[, 2:3])))
dbDisconnect(con)
I have heard that data bases work on rows, so I suspect the correct solution may just开发者_运维知识库 cycle through the rows, appending an entry to each row, but I'm not sure how to implement. Thanks!
(And there's nothing sacred about SQLite here, I just thought that it would be better for this ad hoc analysis.)
Edit: I learned about the bind.data
option in dbGetPreparedQuery()
and realized that I need a read and a write connection to the database, but I am still having some problems (i.e., the data doesn't INSERT
to the database). The script runs without error, but also without the desired result.
library(RSQLite)
left <- data.frame(let = letters[rep(1:4, each = 5)], num = 1:20)
right <- data.frame(let = letters[rep(1:4, each = 5)], num = 21:40)
my.tempfile <- tempfile()
con.write <- dbConnect(dbDriver("SQLite"), dbname = my.tempfile)
con.read <- dbConnect(dbDriver("SQLite"), dbname = my.tempfile)
dbWriteTable(con.write, "left_table", left, row.names = F)
dbWriteTable(con.write, "right_table", right, row.names = F)
dbGetQuery(con.write, "CREATE TABLE merged_table (letters TEXT, left_num INTEGER, right_num INTEGER)")
dbGetQuery(con.write, "INSERT INTO merged_table SELECT * FROM left_table LEFT OUTER JOIN right_table USING (let)")
dbGetQuery(con.write, "ALTER TABLE merged_table ADD COLUMN sum INTEGER")
dbGetQuery(con.write, "ALTER TABLE merged_table ADD COLUMN mean INTEGER")
res <- dbSendQuery(con.read, "SELECT left_num, right_num FROM merged_table")
while (!dbHasCompleted(res)) {
data.1 <- fetch(res)
data.2 <- data.frame(rowSums(data.1), rowMeans(data.1))
dbGetPreparedQuery(con.write, "INSERT INTO merged_table (sum, mean) VALUES (?, ?)", bind.data = data.2)
}
dbClearResult(res)
dbGetQuery(con.read, "SELECT * FROM merged_table LIMIT 5")
gives
letters left_num right_num sum mean
1 a 1 21 NA NA
2 a 1 22 NA NA
3 a 1 23 NA NA
4 a 1 24 NA NA
5 a 1 25 NA NA
but I expected
left_num right_num sum mean
1 1 21 22 11.0
2 1 22 23 11.5
3 1 23 24 12.0
4 1 24 25 12.5
5 1 25 26 13.0
An SQLite expert may be able to improve on this solution, but you can accomplish this using a single query by running this right after you create merged_table
:
dbGetQuery(con, "INSERT INTO merged_table SELECT
letters,left_num,right_num,left_num+right_num row_sum FROM
(SELECT let letters,left_table.num left_num, right_table.num right_num FROM
left_table LEFT OUTER JOIN right_table USING (let))")
That's kind of ugly SQL, I suppose, but it seems to work. If you have more than two columns to add, you can build up the column addition part of the query using paste
in R, if need be.
Other things to investigate might be adding the additional column using ALTER TABLE
and then doing bulk updates in batches. I played with that for a minute or two but couldn't make it work, but that certainly doesn't mean it isn't possible.
EDIT
The following code creates the output you desire. I'm sort of in a hurry at the moment (on my way out the door) so the while
loop throws an error because the loop reaches the end of the data before the exit condition has been reached, so the last time through you have an empty data.1
data frame which causes an error in bind.data
. But if you run the last query, you'll see all the data have been inserted.
library(RSQLite)
left <- data.frame(let = letters[rep(1:4, each = 5)], num = 1:20)
right <- data.frame(let = letters[rep(1:4, each = 5)], num = 21:40)
conn <- dbConnect(dbDriver("SQLite"), dbname = "sotemp.db")
conn.copy <- dbConnect(dbDriver("SQLite"), dbname = "sotempCopy.db")
dbWriteTable(conn, "left_table", left, row.names = F)
dbWriteTable(conn, "right_table", right, row.names = F)
dbGetQuery(conn, "CREATE TABLE merged_table1 (letters TEXT, left_num INTEGER, right_num INTEGER)")
dbGetQuery(conn.copy, "CREATE TABLE merged_table2 (letters TEXT, left_num INTEGER, right_num INTEGER, rowSum INTEGER,
rowMean REAL)")
dbGetQuery(conn, "INSERT INTO merged_table1 SELECT * FROM left_table LEFT OUTER JOIN right_table USING (let)")
res <- dbSendQuery(conn, "SELECT letters, left_num, right_num FROM merged_table1")
while (!dbHasCompleted(res)) {
data.1 <- fetch(res,n=5)
data.1 <- cbind(data.1,rowSums(data.1[,2:3]),rowMeans(data.1[,2:3]))
colnames(data.1)[4:5] <- c("rowSum","rowMean")
dbGetPreparedQuery(conn.copy, "INSERT INTO merged_table2 (letters, left_num, right_num,rowSum, rowMean) VALUES
(?, ?, ?, ?, ?)", bind.data = data.1)
}
dbClearResult(res)
dbGetQuery(conn.copy, "SELECT * FROM merged_table2")
This is by no means perfect. I hope others will swing by and edit/add to this. A few notes about why your solution wasn't working.
First, INSERT
adds new rows to a table. You seemed to be expecting it to alter values in a column, which is a job usually done with UPDATE
, which can get cumbersome.
Second, I'm not sure the separate read/write connections buy you anything. SQLite
doesn't support complete read/write concurrency, even using separate connections. So until you clear the results of a SELECT
you're going to get locking errors when trying to write.
Third, the strategy I used here for batch processing was to create two separate db's, loop through the results of a SELECT
in the first, create the new columns in R and then INSERT
the resulting data into the second db.
精彩评论