Difficult randomization, based on frequency ranking
I have a dataframe like this:
x = data.frame(A=c("D1", "D1", "D1", "D1", "D1", "D2", "D3", "D3", "D4", "D4", "D4", "D5", "D5"), B=c("A1","A3","A4","A5","A6","A5","A5","A6","A6","A1","A2","A5","A6"))
A B
D1 A1
D1 A3
D1 A4
D1 A5
D1 A6
D2 A5
D3 A5
D3 A6
D4 A6
D4 A1
D4 A2
D5 A5
D5 A6
To sort by column B, the entities in column B has different Frequencies.
A B freq(B)
D1 A1 2
D4 A1 2
D4 A2 1
D1 A3 1
D1 A4 1
D1 A5 4
D2 A5 4
D3 A5 4
D5 A5 4
D1 A6 4
D3 A6 4
D4 A6 4
D5 A6 4
I want to generate a random dataframe on the B column of dataframe x, but the randomization can only be taken place where the frequency of the entries are the same or similar (+/- one rank). Let'said. now, A2, A3, A4 has frequency of 1, so that A2, A3 and A4 can be replacing by each other freely, but not to A5 and A6 nor A1. Similarly, as A5 and A6 has freq开发者_开发问答uency=4, they can be randomized among themself. For A1, which is the only entry has frequency=2 (rank 2nd based on freq(B)), as there is no replacement can be taken place, a special conditions was given to A1. A1 can be randomly replaced by A2,A3,A4 (which rank one class (1, rank 1st based on freq(B)) lower than A1) or A5/A6 (which rank one class (4, rank 2nd, rank 3rd based on freq(B)) higher than A1).
Is it possible to be done easily by R?
The first part is easily handled by functions in my permute
package (only on R-forge at the moment)
require(permute) ## install from R-forge if not available
x <- data.frame(A = c("D1","D1","D1","D1","D1","D2","D3","D3",
"D4","D4","D4","D5","D5"),
B = c("A1","A3","A4","A5","A6","A5","A5","A6",
"A6","A1","A2","A5","A6"))
x <- x[order(x$B), ]
x <- transform(x, freq = rep((lens <- sapply(with(x, split(B, B)),
length)), lens))
set.seed(529)
ind <- permuted.index(NROW(x), control = permControl(strata = factor(x$freq)))
Which gives:
R> x[ind, ]
A B freq
10 D4 A1 2
1 D1 A1 2
11 D4 A2 1
2 D1 A3 1
3 D1 A4 1
12 D5 A5 4
4 D1 A5 4
9 D4 A6 4
13 D5 A6 4
5 D1 A6 4
6 D2 A5 4
8 D3 A6 4
7 D3 A5 4
R> ind
[1] 2 1 3 4 5 9 6 12 13 10 7 11 8
We can wrap this is a statement to generate n permutations
ctrl <- permControl(strata = factor(x$freq))
n <- 10
set.seed(83)
IND <- replicate(n, permuted.index(NROW(x), control = ctrl))
Which gives:
> IND
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 2 2 1 2 1 2 1 2 1 1
[2,] 1 1 2 1 2 1 2 1 2 2
[3,] 3 5 4 3 5 5 4 5 5 5
[4,] 5 3 5 5 3 4 5 4 4 4
[5,] 4 4 3 4 4 3 3 3 3 3
[6,] 9 12 11 12 6 10 13 10 8 13
[7,] 10 11 6 11 13 7 7 12 7 9
[8,] 8 9 9 10 8 6 11 13 12 10
[9,] 12 10 8 6 9 13 9 6 9 11
[10,] 13 6 12 9 7 9 8 8 13 8
[11,] 6 7 10 13 12 11 6 11 10 7
[12,] 11 8 13 7 11 8 10 7 6 12
[13,] 7 13 7 8 10 12 12 9 11 6
Now you also need to do some special sampling. If I understand correctly, what you want is to identify which frequency level consists of only a single level of B. Then possibly, at random, replace the B's in that frequency level with B's selected at random from the B's in adjacent frequency classes. If this is so, then it is a bit more complex to get the right rows to replace, but I think the function below does it:
randSampleSpecial <- function(x, replace = TRUE) {
## have we got access to permute?
stopifnot(require(permute))
## generate a random permutation within the levels of freq
ind <- permuted.index(NROW(x),
control = permControl(strata = factor(x$freq)))
## split freq into freq classes
ranks <- with(x, split(freq, freq))
## rank the freq classes
Ranked <- rank(as.numeric(names(ranks)))
## split the Bs on basis of freq classes
Bs <- with(x, split(B, freq))
## number of unique Bs in freq class
uniq <- sapply(Bs, function(x) length(unique(x)))
## which contain only a single type of B?
repl <- which(uniq == 1)
## if there are no freq classes with only one level of B, return
if(!(length(repl) > 0))
return(ind)
## if not, continue
## which of the freq classes are adjacent to unique class?
other <- which(Ranked %in% (repl + c(1,-1)))
## generate uniform random numbers to decide if we replace
Rand <- runif(length(ranks[[repl]]))
## Which are the rows in `x` that we want to change?
candidates <- with(x, which(freq == as.numeric(names(uniq[repl]))))
## which are the adjacent values we can replace with
replacements <- with(x, which(freq %in% as.numeric(names(uniq[other]))))
## which candidates to replace? Decision is random
change <- sample(candidates, sum(Rand > 0.5))
## if we are changing a candidate, sample from the replacements and
## assign
if(length(change) > 0)
ind[candidates][change] <- sample(ind[replacements], length(change),
replace = replace)
## return
ind
}
To use this, we do:
R> set.seed(35)
R> randSampleSpecial(x)
[1] 2 1 5 3 4 6 9 12 10 11 7 8 13
We can wrap this in a replicate()
call to produce many such replacements:
R> IND <- replicate(10, randSampleSpecial(x))
R> IND
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 11 3 6 4 2 1 1 2 10 3
[2,] 1 11 1 12 11 11 2 1 1 13
[3,] 4 5 4 3 4 3 4 5 5 4
[4,] 5 4 5 5 5 4 5 3 3 3
[5,] 3 3 3 4 3 5 3 4 4 5
[6,] 11 7 11 12 9 6 7 8 9 9
[7,] 13 12 12 7 11 7 9 10 8 10
[8,] 10 8 9 8 12 12 8 6 13 8
[9,] 7 9 13 10 8 10 13 9 12 11
[10,] 6 11 10 11 10 13 12 13 10 13
[11,] 12 10 6 6 6 9 11 12 7 12
[12,] 9 6 7 9 7 8 10 7 6 7
[13,] 8 13 8 13 13 11 6 11 11 6
For this data set, we know that it is rows 1 and 2 in the sorted x
that we might want to replace with values from the other freq classes. If we had done no replacements, the first two rows of IND
would have values 1
or 2
only in them (see IND
from earlier). In the new IND
, where a value in the first two rows is not a 1
or 2
, we have replaced it with a B from one of the adjacent frequency classes.
My function assumes you want to:
- Only replace elements in the homogeneous frequency class with one of the adjacent class at random! If you want to always replace then we change change the function to suit.
- That if we are doing a replacement, that replacement can be any of the replacements and if we need more than 1 replacement the same replacement can be selected more than once. Set
replace = FALSE
in the call to do the sampling without replacement if that is what you want. - The function assumes you only have a single monospecific frequency class. If should be easy to modify using a loop over the two or more monospecific classes, but that does complicated the function and as your description of the problem was not overly clear, I kept things simple.
@Gavin gives you a nice approach, and asked if somebody could come up with something simpler. The next function does the same, based on only base functions. It uses count
to process the frequencies, and takes into account that for the minimal en maximal frequency there is only one neighbouring rank. The function of Gavin gives an error in that case.
Permdf <- function(x,v){
# some code to allow Permdf(df,var)
mc <- match.call()
v <- as.quoted(mc$v)
y <- unlist(eval.quoted(v,x))
# make bins with values in v per frequency
freqs <- count(x,v)
bins <- split(freqs[[1]],freqs[[2]])
nbins <- length(bins)
# define the output
dfid <- 1:nrow(x)
for (i in 1:nbins){
# which id's to change
id <- which(y %in% bins[[i]])
if(length(bins[[i]]) > 1){
# in case there's more than one value for that frequency
dfid[id] <- sample(dfid[id])
} else {
bid <- c(i-1,i,i+1)
# control wether id in range
bid <- bid[bid > 0 & bid <=nbins]
# id values to choose from
vid <- which(y %in% unlist(bins[bid]))
# random selection
dfid[id] <- sample(vid,length(id),replace=TRUE)
}
}
#return
dfid
}
This can be used as
Permdf(x,B)
The bottom half of your question regarding randomization is a bit unclear, but here's a start. When you update your question - I'll update the answer accordingly. The code below adds the count information for the column B and then samples the rows based on the value of the frequency column we added. I think all that's needed from here is to modify the availability of which columns can be used for sampling, but please confirm what you want.
require(plyr)
x <- merge(x,count(x, "B"))
ddply(x, "freq", function(x) sample(x))
精彩评论