开发者

Append column to an SQLite table in R using RSQLite

I have two large data frames that I would like to outer join with merge(), but the joined table is too large for RAM. My worked around is to use the RSQLite package to to the outer join and store the joined table back to the database.

I would like to use an R function on the columns in this joined table, but I can't figure out how to append a column to the joined table. I know how to do it with dbWriteTable() (shown below), but that's not an option as the joined table is larger than RAM.

library(RSQLite)
left <- data.frame(let = letters[rep(1:4, each = 5)], num = 1:20)
right <- data.frame(let = letters[rep(1:4, each = 5)], num = 21:40)
con <- dbConnect(dbDriver("SQLite"), dbname = tempfile())
dbWriteTable(con, "left_table", left, row.names = F)
dbWriteTable(con, "right_table", right, row.names = F)
dbGetQuery(con, "CREATE TABLE merged_table (letters TEXT, left_num INTEGER, right_num INTEGER)")
dbGetQuery(con, "INSERT INTO merged_table SELECT * FROM left_table LEFT OUTER JOIN right_table USING (let)")
fun <- function(x) rowSums(x)
temp <- dbReadTable(con, "merged_table")
dbWriteTable(con, "merged_table_new", cbind(temp, fun(temp[, 2:3])))
dbDisconnect(con)

I have heard that data bases work on rows, so I suspect the correct solution may just开发者_运维知识库 cycle through the rows, appending an entry to each row, but I'm not sure how to implement. Thanks!

(And there's nothing sacred about SQLite here, I just thought that it would be better for this ad hoc analysis.)


Edit: I learned about the bind.data option in dbGetPreparedQuery() and realized that I need a read and a write connection to the database, but I am still having some problems (i.e., the data doesn't INSERT to the database). The script runs without error, but also without the desired result.

library(RSQLite)
left <- data.frame(let = letters[rep(1:4, each = 5)], num = 1:20)
right <- data.frame(let = letters[rep(1:4, each = 5)], num = 21:40)
my.tempfile <- tempfile()
con.write <- dbConnect(dbDriver("SQLite"), dbname = my.tempfile)
con.read <- dbConnect(dbDriver("SQLite"), dbname = my.tempfile)
dbWriteTable(con.write, "left_table", left, row.names = F)
dbWriteTable(con.write, "right_table", right, row.names = F)
dbGetQuery(con.write, "CREATE TABLE merged_table (letters TEXT, left_num INTEGER, right_num INTEGER)")
dbGetQuery(con.write, "INSERT INTO merged_table SELECT * FROM left_table LEFT OUTER JOIN right_table USING (let)")
dbGetQuery(con.write, "ALTER TABLE merged_table ADD COLUMN sum INTEGER")
dbGetQuery(con.write, "ALTER TABLE merged_table ADD COLUMN mean INTEGER")

res <- dbSendQuery(con.read, "SELECT left_num, right_num FROM merged_table")
while (!dbHasCompleted(res)) {
    data.1 <- fetch(res)
    data.2 <- data.frame(rowSums(data.1), rowMeans(data.1))
    dbGetPreparedQuery(con.write, "INSERT INTO merged_table (sum, mean) VALUES (?, ?)", bind.data = data.2)
}
dbClearResult(res)

dbGetQuery(con.read, "SELECT * FROM merged_table LIMIT 5")

gives

  letters left_num right_num sum mean
1       a        1        21  NA   NA
2       a        1        22  NA   NA
3       a        1        23  NA   NA
4       a        1        24  NA   NA
5       a        1        25  NA   NA

but I expected

  left_num right_num sum mean
1        1        21  22 11.0
2        1        22  23 11.5
3        1        23  24 12.0
4        1        24  25 12.5
5        1        25  26 13.0


An SQLite expert may be able to improve on this solution, but you can accomplish this using a single query by running this right after you create merged_table:

dbGetQuery(con, "INSERT INTO merged_table SELECT
                  letters,left_num,right_num,left_num+right_num row_sum FROM 
                    (SELECT let letters,left_table.num left_num, right_table.num right_num FROM 
                        left_table LEFT OUTER JOIN right_table USING (let))")

That's kind of ugly SQL, I suppose, but it seems to work. If you have more than two columns to add, you can build up the column addition part of the query using paste in R, if need be.

Other things to investigate might be adding the additional column using ALTER TABLE and then doing bulk updates in batches. I played with that for a minute or two but couldn't make it work, but that certainly doesn't mean it isn't possible.

EDIT

The following code creates the output you desire. I'm sort of in a hurry at the moment (on my way out the door) so the while loop throws an error because the loop reaches the end of the data before the exit condition has been reached, so the last time through you have an empty data.1 data frame which causes an error in bind.data. But if you run the last query, you'll see all the data have been inserted.

library(RSQLite)
left <- data.frame(let = letters[rep(1:4, each = 5)], num = 1:20)
right <- data.frame(let = letters[rep(1:4, each = 5)], num = 21:40)
conn <- dbConnect(dbDriver("SQLite"), dbname = "sotemp.db")
conn.copy <- dbConnect(dbDriver("SQLite"), dbname = "sotempCopy.db")
dbWriteTable(conn, "left_table", left, row.names = F)
dbWriteTable(conn, "right_table", right, row.names = F)
dbGetQuery(conn, "CREATE TABLE merged_table1 (letters TEXT, left_num INTEGER, right_num INTEGER)")
dbGetQuery(conn.copy, "CREATE TABLE merged_table2 (letters TEXT, left_num INTEGER, right_num INTEGER, rowSum INTEGER,
                            rowMean REAL)")

dbGetQuery(conn, "INSERT INTO merged_table1 SELECT * FROM left_table LEFT OUTER JOIN right_table USING (let)")

res <- dbSendQuery(conn, "SELECT letters, left_num, right_num FROM merged_table1")
while (!dbHasCompleted(res)) {
    data.1 <- fetch(res,n=5)
    data.1 <- cbind(data.1,rowSums(data.1[,2:3]),rowMeans(data.1[,2:3]))
    colnames(data.1)[4:5] <- c("rowSum","rowMean")
    dbGetPreparedQuery(conn.copy, "INSERT INTO merged_table2 (letters, left_num, right_num,rowSum, rowMean) VALUES 
                        (?, ?, ?, ?, ?)", bind.data = data.1)
}
dbClearResult(res)

dbGetQuery(conn.copy, "SELECT * FROM merged_table2")

This is by no means perfect. I hope others will swing by and edit/add to this. A few notes about why your solution wasn't working.

First, INSERT adds new rows to a table. You seemed to be expecting it to alter values in a column, which is a job usually done with UPDATE, which can get cumbersome.

Second, I'm not sure the separate read/write connections buy you anything. SQLite doesn't support complete read/write concurrency, even using separate connections. So until you clear the results of a SELECT you're going to get locking errors when trying to write.

Third, the strategy I used here for batch processing was to create two separate db's, loop through the results of a SELECT in the first, create the new columns in R and then INSERT the resulting data into the second db.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜