开发者

Importing CSV data containing commas, thousand separators and trailing minus sign

R 2.13.1 on Mac OS X. I'm trying to开发者_StackOverflow社区 import a data file that has a point for thousand separator and comma as the decimal point, as well as trailing minus for negative values.

Basically, I'm trying to convert from:

"A|324,80|1.324,80|35,80-"

to

  V1    V2     V3    V4
1  A 324.80 1324.8 -35.80

Now, interactively both the following works:

gsub("\\.","","1.324,80")
[1] "1324,80"

gsub("(.+)-$","-\\1", "35,80-")
[1] "-35,80"

and also combining them:

gsub("\\.", "", gsub("(.+)-$","-\\1","1.324,80-"))
[1] "-1324,80"

However, I'm not able to remove the thousand separator from read.data:

setClass("num.with.commas")

setAs("character", "num.with.commas", function(from) as.numeric(gsub("\\.", "", sub("(.+)-$","-\\1",from))) )
mydata <- "A|324,80|1.324,80|35,80-"

mytable <- read.table(textConnection(mydata), header=FALSE, quote="", comment.char="", sep="|", dec=",", skip=0, fill=FALSE,strip.white=TRUE, colClasses=c("character","num.with.commas", "num.with.commas", "num.with.commas"))

Warning messages:
1: In asMethod(object) : NAs introduced by coercion
2: In asMethod(object) : NAs introduced by coercion
3: In asMethod(object) : NAs introduced by coercion

mytable
  V1 V2 V3 V4
1  A NA NA NA

Note that if I change from "\\." to "," in the function, things look a bit different:

setAs("character", "num.with.commas", function(from) as.numeric(gsub(",", "", sub("(.+)-$","-\\1",from))) )

mytable <- read.table(textConnection(mydata), header=FALSE, quote="", comment.char="", sep="|", dec=",", skip=0, fill=FALSE,strip.white=TRUE, colClasses=c("character","num.with.commas", "num.with.commas", "num.with.commas"))

mytable
  V1    V2     V3    V4
1  A 32480 1.3248 -3580

I think the problem is that read.data with dec="," converts the incoming "," to "." BEFORE calling as(from, "num.with.commas"), so that the input string can be e.g. "1.324.80".

I want as("1.123,80-","num.with.commas") to return -1123.80 and as("1.100.123,80", "num.with.commas") to return 1100123.80.

How can I make my num.with.commas replace all except the last decimal point in the input string?

Update: First, I added negative lookahead and got as() working in the console:

setAs("character", "num.with.commas", function(from) as.numeric(gsub("(?!\\.\\d\\d$)\\.", "", gsub("(.+)-$","-\\1",from), perl=TRUE)) )
as("1.210.123.80-","num.with.commas")
[1] -1210124
as("10.123.80-","num.with.commas")
[1] -10123.8
as("10.123.80","num.with.commas")
[1] 10123.8

However, read.table still had the same problem. Adding some print()s to my function showed that num.with.commas in fact got the comma and not the point.

So my current solution is to then replace from "," to "." in num.with.commas.

setAs("character", "num.with.commas", function(from) as.numeric(gsub(",","\\.",gsub("(?!\\.\\d\\d$)\\.", "", gsub("(.+)-$","-\\1",from), perl=TRUE))) )
mytable <- read.table(textConnection(mydata), header=FALSE, quote="", comment.char="", sep="|", dec=",", skip=0, fill=FALSE,strip.white=TRUE, colClasses=c("character","num.with.commas", "num.with.commas", "num.with.commas"))
mytable
  V1    V2      V3    V4
1  A 324.8 1101325 -35.8


You should be removing all the periods first and then changing the commas to decimal points before coercing with as.numeric(). You can later control how decimal points are printed with options(OutDec=",") . I do not think R uses commas as decimal separators internally even in locales where they are conventional.

> tst <- c("A","324,80","1.324,80","35,80-")
> 
> as.numeric( sub("\\,", ".", sub("(.+)-$","-\\1", gsub("\\.", "", tst)) ) )
[1]     NA  324.8 1324.8  -35.8
Warning message:
NAs introduced by coercion 


Here's a solution with regular expressions and substitutions

mydata <- "A|324,80|1.324,80|35,80-"
# Split data
mydata2 <- strsplit(mydata,"|",fixed=TRUE)[[1]]
# Remove commas
mydata3 <- gsub(",","",mydata2,fixed=TRUE)
# Move negatives to front of string
mydata4 <- gsub("^(.+)-$","-\\1",mydata3)
# Convert to numeric
mydata.cleaned <- c(mydata4[1],as.numeric(mydata4[2:4]))
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜