开发者

Reading correctly (alpha)numeric fields into R

A tab-delimited text file, which is actually an export (using bcp) of a database table, is of that form (first 5 columns):

102 1   01  e113c   3224.96     12  
102 1   01  e185    101127.25   12
102 2   01  e185    176417.90   12
102A   3    01  e185    26261.03    12

I tried to import it in R with a command开发者_开发问答 like

data <- read.delim("C:\\test.txt", header = FALSE, sep = "\t")

The problem is that the 3rd column which is actually a varchar field (alphanumeric) is mistakenly read as integer (as there are no letters in the entire column) and the leading zeros disappeared. The same thing happened when I imported the data directly from the database, using odbcConnect. Again that column was read as integer.

str(data)
$ code: int  1 1 1 1 1 1 6 1 1 8 ...

How can I import such a dataset in R correctly, so as to be able to safely populate that db table again, after doing some data manipulations?

EDIT

I did it adding the following parameter in read.delim

 colClasses = c("factor","integer","factor","factor","numeric","character","factor","factor","factor","factor","integer","character","factor")
  • Would you suggest "character" or "factor" for varchar fields?

  • Is it ok to use "character" for datetime ones?

  • What should I do in order to be able to read a numeric field like this 540912.68999999994 exactly as is and not as 540912.69?

I would like an -as automatic as possible- creation of that colClasses vector, depending on the datatypes defined in the relevant table's schema.


Would you suggest "character" or "factor" for varchar fields?

As John mentioned, this depends upon usage. It is simple to switch between the two, so don't worry too much about it. If the column represents a categorical variable, it should eventually be considered as a factor. If you intend on mining the text (e.g. comments fields), then character makes more sense.

Is it ok to use "character" for datetime ones?

It's fine for storing the dates in a data frame, but if you want them to be treated correctly for analysis purposes, you'll have to convert it to Date or POSIXct/POSIXlt form.

What should I do in order to be able to read a numeric field like this 540912.68999999994 exactly as is and not as 540912.69?

Values are read in to usual double accuracy (about 15 sig figs); in this particular example, 540912.69 is the best accuracy you can achieve. Compare

print(540912.68999999994)             # 540912.7
print(540912.68999999994, digits=22)  # 540912.69
print(540912.6899999994)              # 540912.7
print(540912.6899999994, digits=22)   # 540912.6899999994

EDIT: If you need more precision for your numbers, use the Rmpfr package.


I would like an -as automatic as possible- creation of that colClasses vector, depending on the datatypes defined in the relevant table's schema.

The default for colClasses (when you don't specify it) does a pretty good job of guessing what columns should be. If you are doing things like using 01 as a character, then there's no way round explicitly specifying it.


the character and factor question is something only you can answer. It depends if you need to use them later as factors or characters. It also depends whether you need to clean them up at all afterwards. For example, if you plan to apply a number of ifelse() modifications to a factor afterwards you might as well just read it in as a character now and turn it into a factor later. Or, if you want to specifically code the factor in some way you will likely be better off reading it in as character.

As an aside, the reason you use read.delim over read.table is because of the default settings therefore don't bother setting the sep to the same as the default.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜