Reading correctly (alpha)numeric fields into R

2022-12-28 08:49 问答作者：

A tab-delimited text file, which is actually an export (using bcp) of a database table, is of that form (first 5 columns):

102 1   01  e113c   3224.96     12  
102 1   01  e185    101127.25   12
102 2   01  e185    176417.90   12
102A   3    01  e185    26261.03    12

I tried to import it in R with a command开发者_开发问答 like

data <- read.delim("C:\\test.txt", header = FALSE, sep = "\t")

The problem is that the 3rd column which is actually a varchar field (alphanumeric) is mistakenly read as integer (as there are no letters in the entire column) and the leading zeros disappeared. The same thing happened when I imported the data directly from the database, using odbcConnect. Again that column was read as integer.

str(data)
$ code: int  1 1 1 1 1 1 6 1 1 8 ...

How can I import such a dataset in R correctly, so as to be able to safely populate that db table again, after doing some data manipulations?

EDIT

I did it adding the following parameter in read.delim

 colClasses = c("factor","integer","factor","factor","numeric","character","factor","factor","factor","factor","integer","character","factor")

Would you suggest "character" or "factor" for varchar fields?
Is it ok to use "character" for datetime ones?
What should I do in order to be able to read a numeric field like this 540912.68999999994 exactly as is and not as 540912.69?

I would like an -as automatic as possible- creation of that colClasses vector, depending on the datatypes defined in the relevant table's schema.

Would you suggest "character" or "factor" for varchar fields?

As John mentioned, this depends upon usage. It is simple to switch between the two, so don't worry too much about it. If the column represents a categorical variable, it should eventually be considered as a factor. If you intend on mining the text (e.g. comments fields), then character makes more sense.

Is it ok to use "character" for datetime ones?

It's fine for storing the dates in a data frame, but if you want them to be treated correctly for analysis purposes, you'll have to convert it to Date or POSIXct/POSIXlt form.

What should I do in order to be able to read a numeric field like this 540912.68999999994 exactly as is and not as 540912.69?

Values are read in to usual double accuracy (about 15 sig figs); in this particular example, 540912.69 is the best accuracy you can achieve. Compare

print(540912.68999999994)             # 540912.7
print(540912.68999999994, digits=22)  # 540912.69
print(540912.6899999994)              # 540912.7
print(540912.6899999994, digits=22)   # 540912.6899999994

EDIT: If you need more precision for your numbers, use the Rmpfr package.

I would like an -as automatic as possible- creation of that colClasses vector, depending on the datatypes defined in the relevant table's schema.

The default for colClasses (when you don't specify it) does a pretty good job of guessing what columns should be. If you are doing things like using 01 as a character, then there's no way round explicitly specifying it.

the character and factor question is something only you can answer. It depends if you need to use them later as factors or characters. It also depends whether you need to clean them up at all afterwards. For example, if you plan to apply a number of ifelse() modifications to a factor afterwards you might as well just read it in as a character now and turn it into a factor later. Or, if you want to specifically code the factor in some way you will likely be better off reading it in as character.

As an aside, the reason you use read.delim over read.table is because of the default settings therefore don't bother setting the sep to the same as the default.

继续阅读：r

Reading correctly (alpha)numeric fields into R

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？