开发者

Creating an xts object results in altered timestamps

Suppose I have:

R> str(data)
'data.frame':   4 obs. of  2 variables:
 $ datetime: Factor w/ 4 levels "2011-01-05 09:30:00.001",..: 1 2 3 4
 $ price   : num  18.3 18.3 18.3 18.3

R> data
                 datetime price
1 2011-01-05 09:30:00.001 18.31
2 2011-01-05 09:30:00.321 18.33
3 2011-01-05 09:30:01.511 18.33
4 2011-01-05 09:30:02.192 18.34

When I try to load this into an xts object the timestamps are subtly altered:

R> x <- xts(data[-1], as.POSIXct(strptime(data$datetime, '%Y-%m-%d %H:%M:%OS')))
R> str(x)
An ‘xts’ object from 2011-01-05 09:30:00.000 to 2011-01-05 09:30:02.191 containing:
  Data: num [1:4, 1] 18.3 18.3 18.3 18.3
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr "price"
  Indexed by objects of class: [POSIXct,POSIXt] TZ: 
  xts Attributes:  
 NULL

 R> x开发者_如何学运维
                         price
 2011-01-05 09:30:00.000 18.31
 2011-01-05 09:30:00.321 18.33
 2011-01-05 09:30:01.510 18.33
 2011-01-05 09:30:02.191 18.34

You'll notice that the timestamps have been altered. The first entry now occurs at 09:30:00.000 instead of what the original data said, 09:30:00.001. The third and fourth rows are also incorrect.

What's causing this? Am I doing something fundamentally wrong? I've tried various incantations to get the data into an xts object and they all seem to exhibit this behavior.

EDIT: Add sessionInfo()

R> sessionInfo()
R version 2.13.1 (2011-07-08)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=C               LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] xts_0.8-2 zoo_1.7-4

loaded via a namespace (and not attached):
[1] grid_2.13.1     lattice_0.19-30 tools_2.13.1   

EDIT 2: If I modify my source data to be microsecond precision as follows:

datetime,price
2011-01-05 09:30:00.001000,18.31
2011-01-05 09:30:00.321000,18.33
2011-01-05 09:30:01.511000,18.33
2011-01-05 09:30:02.192000,18.34

And then load it so I have:

R> test
                    datetime price
1 2011-01-05 09:30:00.001000 18.31
2 2011-01-05 09:30:00.321000 18.33
3 2011-01-05 09:30:01.511000 18.33
4 2011-01-05 09:30:02.192000 18.34

And then, finally, convert it into an xts object and set the index format:

R> x <- xts(test[,-1], as.POSIXct(strptime(test$datetime, '%Y-%m-%d %H:%M:%OS')))
R> indexFormat(x) <- '%Y-%m-%d %H:%M:%OS6'
R> x
                            [,1]
2011-01-05 09:30:00.000999 18.31
2011-01-05 09:30:00.321000 18.33
2011-01-05 09:30:01.510999 18.33
2011-01-05 09:30:02.191999 18.34

You can see the effect as well. I was hoping that adding the extra precision would help, but unfortunately it does not.

EDIT 3: Please see @DWin's answer for an end-to-end test case that reproduces this behavior.

EDIT 4: The behavior does not appear to be millisecond oriented. The following shows the same altered result of a microsecond resolution timestamp. If I change my input data to:

R> data
                    datetime price
1 2011-01-05 09:30:00.001001 18.31
2 2011-01-05 09:30:00.321001 18.33
3 2011-01-05 09:30:01.511001 18.33
4 2011-01-05 09:30:02.192005 18.34

And then create an xts object:

R> x <- xts(data[-1], 
            as.POSIXct(strptime(as.character(data$datetime), '%Y-%m-%d %H:%M:%OS')))
R> indexFormat(x) <- '%Y-%m-%d %H:%M:%OS6'
R> x
                           price
2011-01-05 09:30:00.001000 18.31
2011-01-05 09:30:00.321001 18.33
2011-01-05 09:30:01.511001 18.33
2011-01-05 09:30:02.192004 18.34

EDIT 5: It would appear to be a floating point precision issue. Observe:

R> t <- as.POSIXct("2011-01-05 09:30:00.001001")
R> t
[1] "2011-01-05 09:30:00.001 CST"
R> as.numeric(t)
[1] 1294241400.0010008812

This exhibits the error behavior, and is consistent with the example in EDIT 4. However, using an example that didn't show the error:

R> t <- as.POSIXct("2011-01-05 09:30:01.511001")
R> t
[1] "2011-01-05 09:30:01.511001 CST"
R> as.numeric(t)
[1] 1294241401.5110011101

It seems as if xts or some underlying component is rounding down rather than to the nearest?


You have your times in a factor:

R> str(data)
'data.frame':   4 obs. of  2 variables:
 $ datetime: Factor w/ 4 levels "2011-01-05 09:30:00.001",..: 1 2 3 4
 [...]

That is not the best place to start. You need to convert to character. Hence instead of

x <- xts(data[-1], as.POSIXct(strptime(data$datetime, '%Y-%m-%d %H:%M:%OS')))

I would suggest

x <- xts(data[-1], 
         order.by=as.POSIXct(strptime(as.character(data$datetime), 
                                      '%Y-%m-%d %H:%M:%OS')))   

In my experience, the as.character() around a factor is critical. Factors are powerful for modeling, they are however a bit of a nuisance when you get them accidentally from reading data. Use stringsAsFactor=FALSE to your advantage and avoid them on data import.

Edit: So this appears to point to the strptime/strftime implementations. To make matters more interesting, R takes some of these from the operating system and reimplements some in src/main/datetime.c.

Also, pay attention to the smallest epsilon you can add to a time variable and still have R see them as equal. On my 64-bit Linux system, this happens 10^-7 :

R> sapply(seq(1, 8), FUN=function(x) identical(now, now+1/10^x)) 
[1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
R> 


It seems the problem is only in printing. Using the OP's original data:

ind <- as.POSIXct(strptime(data$datetime, '%Y-%m-%d %H:%M:%OS'))
as.numeric(ind)*1e6  # as expected
# [1] 1294241400001000 1294241400321000 1294241401511000 1294241402192000
ind  # wrong
# [1] "2011-01-05 09:30:00.000 CST" "2011-01-05 09:30:00.321 CST"
# [3] "2011-01-05 09:30:01.510 CST" "2011-01-05 09:30:02.191 CST"
x <- xts(data[-1], ind)
x  # wrong
#                         price
# 2011-01-05 09:30:00.000 18.31
# 2011-01-05 09:30:00.321 18.33
# 2011-01-05 09:30:01.510 18.33
# 2011-01-05 09:30:02.191 18.34
as.numeric(index(x))*1e6  # but the underlying index values are as expected
# [1] 1294241400001000 1294241400321000 1294241401511000 1294241402192000


I post this just so people who want to explore it can have a reproducible example which shows that it happens on more than just the OP's system. as.character to the factor does not keep it from occurring.

dat <- read.table(textConnection("     datetime\tprice
 1\t2011-01-05 09:30:00.001\t18.31
 2\t2011-01-05 09:30:00.321\t18.33
 3\t2011-01-05 09:30:01.511\t18.33
 4\t2011-01-05 09:30:02.192\t18.34"), header =TRUE, sep="\t")
 as.character(dat$datetime)
#[1] "2011-01-05 09:30:00.001" "2011-01-05 09:30:00.321" "2011-01-05 09:30:01.511"
#[4] "2011-01-05 09:30:02.192"
  strptime(as.character(dat$datetime),         '%Y-%m-%d %H:%M:%OS')
#[1] "2011-01-05 09:30:00" "2011-01-05 09:30:00" "2011-01-05 09:30:01"
#[4] "2011-01-05 09:30:02"
 as.POSIXct(strptime(as.character(dat$datetime), 
                                       '%Y-%m-%d %H:%M:%OS'))
#[1] "2011-01-05 09:30:00 EST" "2011-01-05 09:30:00 EST" "2011-01-05 09:30:01 EST"
#[4] "2011-01-05 09:30:02 EST"
 x <- xts(dat[-1], 
          order.by=as.POSIXct(strptime(as.character(dat$datetime), 
                                       '%Y-%m-%d %H:%M:%OS')))
 x
####                price
2011-01-05 09:30:00 18.31
2011-01-05 09:30:00 18.33
2011-01-05 09:30:01 18.33
2011-01-05 09:30:02 18.34
indexFormat(x) <- '%Y-%m-%d %H:%M:%OS6'
x
                           price
2011-01-05 09:30:00.000999 18.31
2011-01-05 09:30:00.321000 18.33
2011-01-05 09:30:01.510999 18.33
2011-01-05 09:30:02.191999 18.34

sessionInfo()
R version 2.13.1 RC (2011-07-03 r56263)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] grid      splines   stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
 [1] xts_0.8-2       zoo_1.7-4       sculpt3d_0.2-2  RGtk2_2.20.12  
 [5] rgl_0.92.798    survey_3.24     hexbin_1.26.0   spam_0.23-0    
 [9] xtable_1.5-6    polspline_1.1.5 Ryacas_0.2-10   XML_3.4-0      
[13] rms_3.3-1       Hmisc_3.8-3     survival_2.36-9 sos_1.3-0      
[17] brew_1.0-6      lattice_0.19-30

loaded via a namespace (and not attached):
[1] cluster_1.14.0 tools_2.13.1  
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜