开发者

using hash to determine whether 2 dataframes are identical (PART 02)

I refer to the question I asked yesterday and have followup questions:

Since I realize the difference of the 2 dataframes are caused by the ordering of the rows, I added the following:

ddd.old <- 开发者_Go百科ddd.old[order(ddd.old[,"adm_route"]),]
ddd.old <- ddd.old[order(ddd.old[,"ddd"]),]
ddd.old <- ddd.old[order(ddd.old[,"atc_code"]),]
ddd.old <- data.frame(ddd.old,stringsAsFactors=FALSE)

ddd.new <- ddd.new[order(ddd.new[,"adm_route"]),]
ddd.new <- ddd.new[order(ddd.new[,"ddd"]),]
ddd.new <- ddd.new[order(ddd.new[,"atc_code"]),]
ddd.new <- data.frame(ddd.new,stringsAsFactors=FALSE)

Which gives me something like this:

> digest(ddd.old)
[1] "e76d3d519f3a8c066597654ae312d68d"
> digest(ddd.new)
[1] "813a68bde6840e9798db771272584e7c"
> all.equal(ddd.old, ddd.new,check.attributes=TRUE)
[1] "Attributes: < Component 2: Mean relative difference: 0.006306306 >"

Two questions:

  • why digest still fails?
  • what does the output for all.equal means?


all.equal tells you that attributes are different. I guess that are row names.

Check attributes(ddd.old)[[2]] vs attributes(ddd.new)[[2]]. Sorting don't change row names so you got them in different order.

You could wipe out them by:

rownames(ddd.old) <- NULL
rownames(ddd.new) <- NULL

Or step earlier by adding argument to data.frame:

ddd.old <- data.frame(ddd.old, stringsAsFactors=FALSE, row.names=NULL)

After that hash should be equal too.

Alternatively use arrange from plyr package it will remove rownames:

ddd.new <- read.table("ddd.table.new.txt",header=TRUE,stringsAsFactors=FALSE)
ddd.old <- read.table("ddd.table.old.txt",header=TRUE,stringsAsFactors=FALSE)

ddd.new <- arrange(ddd.new, atc_code, ddd, adm_route)
ddd.old <- arrange(ddd.new, atc_code, ddd, adm_route)
all.equal(ddd.new, ddd.old)
# TRUE
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜