开发者

In R how do I read a CSV file line by line and have the contents recognised as the correct data type?

I want to read in a CSV file whose first line is the variable names and subsequent lines are the contents of those variables. Some of the variables are numeric and some of开发者_运维问答 them are text and some are even empty.

file = "path/file.csv"
f = file(file,'r')
varnames = strsplit(readLines(f,1),",")[[1]]
data = strsplit(readLines(f,1),",")[[1]]

Now that data contains all the variables, how do I make it so that data can recognise the data type being read in just like if I did read.csv.

I need to read the data line by line (or n lines at a time) as the whole dataset is too big to be read into R.


Based on DWin's comment, you can try something like this:

read.clump <- function(file, lines, clump){
    if(clump > 1){
        header <- read.csv(file, nrows=1, header=FALSE)
        p = read.csv(file, skip = lines*(clump-1), 
       #p = read.csv(file, skip = (lines*(clump-1))+1 if not a textConnection           
            nrows = lines, header=FALSE)

        names(p) = header
    } else {
        p = read.csv(file, skip = lines*(clump-1), nrows = lines)
    }
    return(p)
}

You should probably add some error handling/checking to the function, too.

Then with

x = "letter1, letter2
a, b
c, d
e, f
g, h
i, j
k, l"


>read.clump(textConnection(x), lines = 2, clump = 1)
  letter1 letter2
1       a       b
2       c       d

> read.clump(textConnection(x), lines = 2, clump = 2)
  letter1  letter2
1       e        f
2       g        h

> read.clump(textConnection(x), lines = 3, clump = 1)
  letter1 letter2
1       a       b
2       c       d
3       e       f


> read.clump(textConnection(x), lines = 3, clump = 2)
  letter1  letter2
1       g        h
2       i        j
3       k        l

Now you just have to *apply over clumps


An alternate strategy that has been discussed here before to deal with very big (say, > 1e7ish cells) CSV files is:

  1. Read the CSV file into an SQLite database.
  2. Import the data from the database with read.csv.sql from the sqldf package.

The main advantages of this are that it is usually quicker and you can easily filter the contents to only include the columns or rows that you need.

See how to import CSV into sqlite using RSqlite? for more info.


Just for fun (I'm waiting on a long running computation here :-) ), a version that allows you to use any of the read.* kind of functions, and that holds a solution to a tiny error in \Greg's code:

read.clump <- function(file, lines, clump, readFunc=read.csv,
    skip=(lines*(clump-1))+ifelse((header) & (clump>1) & (!inherits(file, "connection")),1,0),
    nrows=lines,header=TRUE,...){
    if(clump > 1){
            colnms<-NULL
            if(header)
            {
                colnms<-unlist(readFunc(file, nrows=1, header=FALSE))
                print(colnms)
            }
      p = readFunc(file, skip = skip,
          nrows = nrows, header=FALSE,...)
            if(! is.null(colnms))
            {
        colnames(p) = colnms
            }
    } else {
        p = readFunc(file, skip = skip, nrows = nrows, header=header)
    }
    return(p)
}

Now you can pass the relevant function as parameter readFunc, and pass extra parameters too. Meta programming is fun.


On a sidenote : If you really have that huge data, there are (next to the SQLite solution) different packages that will help you handle that without having to resort to tricks as described in these answers.

There's the ff and the bigmemory package with friends biganalytics, bigtabulate 'biglm' and so on. For an overview, see eg.

  • Best practices for storing and using data frames too large for memory?
  • For an example about plotting : Plotting of very large data sets in R
  • http://cran.r-project.org/web/views/HighPerformanceComputing.html : See under large memory and out-of-memory data


I would try the LaF package:

Methods for fast access to large ASCII files... It is assumed that the files are too large to fit into memory... Methods are provided to access and process files blockwise. Furthermore, an opened file can be accessed as one would an ordinary data.frame...

I was able to get it seemingly working with the sample code below and it seemed to have the performance you'd expect from a streaming implementation. However, I would recommend you run your own time tests too.

library('LaF')

model <- detect_dm_csv('data.csv', header = TRUE, nrows = 600)  # read only 600 rows for type detection

mylaf <- laf_open(model)

print(mylaf[1000])  # print 1000th row


I think using disk.frame's csv_to_disk.frame and setting in_chunk_size would be awesome for this use-case. E.g

library(disk.frame)
csv_to_disk.frame("/path/to/file.csv", in_chunk_size = 1e7)


You can use chunked or disk.frame if you don't mind a little tinkering to write out your data.

Both have options to let you read a data chunk-by-chunk

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜