开发者

Validating data integrity after import from xls to R using RODBC

I am using RODBC for importing a xls file with ~10000 row and ~250 columns, using the script below:

channel <- odbcConnectExcel(xls.file="s:/demo.xls")
demo <- sqlFetch(channel,"Sheet_1")
odbcClose(channel)

During the import process, I have open and closed that xls file.

As a precaution, I checked the number of rows from R and from the xls file itself, just to make sure the import is not corrupted, and yup, the number of rows remain the same, so I assume the import process is OK.

Problem arise when I start tabulating the data, I realize that even the row numbers are the same, some rows are actually being corrupted, I have an UID inside, and realize that at the later part of the dataframe, the UID are messed up (e.g. same row being duplicated twice).

I re-import the file again, dare not open or close the xls file, and this time, things are going well.

I have learned my lesson of never open or close the xls file when it is being read by RODBC, but how can I be sure that the data will be not messed up, or to ensure that I have really imported the dataset without error, after it is imported?

Thanks.

P.S. I am thinking of exporting the开发者_运维技巧 dataframe to csv, and the xls to csv also, then to compare the checksum, as an extra measure, but is there any other ways?


I use the read.xls function from the gdata package (install using `install.packages(gdata)').

This allows you to read from a non-opened Excel file if I'm correct. It is a wrapper for some Perl modules, so it might depend on whether you have those installed if it works for you.

Quoting the R manual for Data Export/Import:

Perl users have contributed a module OLE::SpreadSheet::ParseExcel and a program xls2csv.pl to convert Excel 95–2003 spreadsheets to CSV files. Package gdata provides a basic wrapper in its read.xls function. With suitable Perl modules installed this function can also read Excel 2007 spreadsheets.

Generally, I find the safest way to get data from Excel into R is via a csv export.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜