In R how do I read a CSV file line by line and have the contents recognised as the correct data type?

2023-03-08 16:29 问答作者：

I want to read in a CSV file whose first line is the variable names and subsequent lines are the contents of those variables. Some of the variables are numeric and some of开发者_运维问答 them are text and some are even empty.

file = "path/file.csv"
f = file(file,'r')
varnames = strsplit(readLines(f,1),",")[[1]]
data = strsplit(readLines(f,1),",")[[1]]

Now that data contains all the variables, how do I make it so that data can recognise the data type being read in just like if I did read.csv.

I need to read the data line by line (or n lines at a time) as the whole dataset is too big to be read into R.

Based on DWin's comment, you can try something like this:

read.clump <- function(file, lines, clump){
    if(clump > 1){
        header <- read.csv(file, nrows=1, header=FALSE)
        p = read.csv(file, skip = lines*(clump-1), 
       #p = read.csv(file, skip = (lines*(clump-1))+1 if not a textConnection           
            nrows = lines, header=FALSE)

        names(p) = header
    } else {
        p = read.csv(file, skip = lines*(clump-1), nrows = lines)
    }
    return(p)
}

You should probably add some error handling/checking to the function, too.

Then with

x = "letter1, letter2
a, b
c, d
e, f
g, h
i, j
k, l"


>read.clump(textConnection(x), lines = 2, clump = 1)
  letter1 letter2
1       a       b
2       c       d

> read.clump(textConnection(x), lines = 2, clump = 2)
  letter1  letter2
1       e        f
2       g        h

> read.clump(textConnection(x), lines = 3, clump = 1)
  letter1 letter2
1       a       b
2       c       d
3       e       f


> read.clump(textConnection(x), lines = 3, clump = 2)
  letter1  letter2
1       g        h
2       i        j
3       k        l

Now you just have to *apply over clumps

An alternate strategy that has been discussed here before to deal with very big (say, > 1e7ish cells) CSV files is:

Read the CSV file into an SQLite database.
Import the data from the database with read.csv.sql from the sqldf package.

The main advantages of this are that it is usually quicker and you can easily filter the contents to only include the columns or rows that you need.

See how to import CSV into sqlite using RSqlite? for more info.

Just for fun (I'm waiting on a long running computation here :-) ), a version that allows you to use any of the read.* kind of functions, and that holds a solution to a tiny error in \Greg's code:

read.clump <- function(file, lines, clump, readFunc=read.csv,
    skip=(lines*(clump-1))+ifelse((header) & (clump>1) & (!inherits(file, "connection")),1,0),
    nrows=lines,header=TRUE,...){
    if(clump > 1){
            colnms<-NULL
            if(header)
            {
                colnms<-unlist(readFunc(file, nrows=1, header=FALSE))
                print(colnms)
            }
      p = readFunc(file, skip = skip,
          nrows = nrows, header=FALSE,...)
            if(! is.null(colnms))
            {
        colnames(p) = colnms
            }
    } else {
        p = readFunc(file, skip = skip, nrows = nrows, header=header)
    }
    return(p)
}

Now you can pass the relevant function as parameter readFunc, and pass extra parameters too. Meta programming is fun.

On a sidenote : If you really have that huge data, there are (next to the SQLite solution) different packages that will help you handle that without having to resort to tricks as described in these answers.

There's the ff and the bigmemory package with friends biganalytics, bigtabulate 'biglm' and so on. For an overview, see eg.

Best practices for storing and using data frames too large for memory?
For an example about plotting : Plotting of very large data sets in R
http://cran.r-project.org/web/views/HighPerformanceComputing.html : See under large memory and out-of-memory data

I would try the LaF package:

Methods for fast access to large ASCII files... It is assumed that the files are too large to fit into memory... Methods are provided to access and process files blockwise. Furthermore, an opened file can be accessed as one would an ordinary data.frame...

I was able to get it seemingly working with the sample code below and it seemed to have the performance you'd expect from a streaming implementation. However, I would recommend you run your own time tests too.

library('LaF')

model <- detect_dm_csv('data.csv', header = TRUE, nrows = 600)  # read only 600 rows for type detection

mylaf <- laf_open(model)

print(mylaf[1000])  # print 1000th row

I think using disk.frame's csv_to_disk.frame and setting in_chunk_size would be awesome for this use-case. E.g

library(disk.frame)
csv_to_disk.frame("/path/to/file.csv", in_chunk_size = 1e7)

You can use chunked or disk.frame if you don't mind a little tinkering to write out your data.

Both have options to let you read a data chunk-by-chunk

继续阅读：csv

In R how do I read a CSV file line by line and have the contents recognised as the correct data type?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？