开发者

Difficult randomization, based on frequency ranking

I have a dataframe like this:

x = data.frame(A=c("D1", "D1", "D1", "D1", "D1", "D2", "D3", "D3", "D4", "D4", "D4", "D5", "D5"), B=c("A1","A3","A4","A5","A6","A5","A5","A6","A6","A1","A2","A5","A6"))

A        B  
D1  A1  
D1  A3  
D1  A4  
D1  A5  
D1  A6  
D2  A5  
D3  A5  
D3  A6  
D4  A6  
D4  A1  
D4  A2  
D5  A5  
D5  A6 

To sort by column B, the entities in column B has different Frequencies.

A   B   freq(B)  
D1  A1  2  
D4  A1  2  
D4  A2  1  
D1  A3  1  
D1  A4  1  
D1  A5  4  
D2  A5  4  
D3  A5  4  
D5  A5  4  
D1  A6  4  
D3  A6  4  
D4  A6  4  
D5  A6  4  

I want to generate a random dataframe on the B column of dataframe x, but the randomization can only be taken place where the frequency of the entries are the same or similar (+/- one rank). Let'said. now, A2, A3, A4 has frequency of 1, so that A2, A3 and A4 can be replacing by each other freely, but not to A5 and A6 nor A1. Similarly, as A5 and A6 has freq开发者_开发问答uency=4, they can be randomized among themself. For A1, which is the only entry has frequency=2 (rank 2nd based on freq(B)), as there is no replacement can be taken place, a special conditions was given to A1. A1 can be randomly replaced by A2,A3,A4 (which rank one class (1, rank 1st based on freq(B)) lower than A1) or A5/A6 (which rank one class (4, rank 2nd, rank 3rd based on freq(B)) higher than A1).

Is it possible to be done easily by R?


The first part is easily handled by functions in my permute package (only on R-forge at the moment)

require(permute) ## install from R-forge if not available
x <- data.frame(A = c("D1","D1","D1","D1","D1","D2","D3","D3",
                      "D4","D4","D4","D5","D5"),
                B = c("A1","A3","A4","A5","A6","A5","A5","A6",
                      "A6","A1","A2","A5","A6"))
x <- x[order(x$B), ]
x <- transform(x, freq = rep((lens <- sapply(with(x, split(B, B)), 
                             length)), lens))
set.seed(529)
ind <- permuted.index(NROW(x), control = permControl(strata = factor(x$freq)))

Which gives:

R> x[ind, ]
    A  B freq
10 D4 A1    2
1  D1 A1    2
11 D4 A2    1
2  D1 A3    1
3  D1 A4    1
12 D5 A5    4
4  D1 A5    4
9  D4 A6    4
13 D5 A6    4
5  D1 A6    4
6  D2 A5    4
8  D3 A6    4
7  D3 A5    4
R> ind
 [1]  2  1  3  4  5  9  6 12 13 10  7 11  8

We can wrap this is a statement to generate n permutations

ctrl <- permControl(strata = factor(x$freq))
n <- 10
set.seed(83)
IND <- replicate(n, permuted.index(NROW(x), control = ctrl))

Which gives:

> IND
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    2    2    1    2    1    2    1    2    1     1
 [2,]    1    1    2    1    2    1    2    1    2     2
 [3,]    3    5    4    3    5    5    4    5    5     5
 [4,]    5    3    5    5    3    4    5    4    4     4
 [5,]    4    4    3    4    4    3    3    3    3     3
 [6,]    9   12   11   12    6   10   13   10    8    13
 [7,]   10   11    6   11   13    7    7   12    7     9
 [8,]    8    9    9   10    8    6   11   13   12    10
 [9,]   12   10    8    6    9   13    9    6    9    11
[10,]   13    6   12    9    7    9    8    8   13     8
[11,]    6    7   10   13   12   11    6   11   10     7
[12,]   11    8   13    7   11    8   10    7    6    12
[13,]    7   13    7    8   10   12   12    9   11     6

Now you also need to do some special sampling. If I understand correctly, what you want is to identify which frequency level consists of only a single level of B. Then possibly, at random, replace the B's in that frequency level with B's selected at random from the B's in adjacent frequency classes. If this is so, then it is a bit more complex to get the right rows to replace, but I think the function below does it:

randSampleSpecial <- function(x, replace = TRUE) {
    ## have we got access to permute?
    stopifnot(require(permute))
    ## generate a random permutation within the levels of freq
    ind <- permuted.index(NROW(x), 
                          control = permControl(strata = factor(x$freq)))
    ## split freq into freq classes
    ranks <- with(x, split(freq, freq))
    ## rank the freq classes
    Ranked <- rank(as.numeric(names(ranks)))
    ## split the Bs on basis of freq classes
    Bs <- with(x, split(B, freq))
    ## number of unique Bs in freq class
    uniq <- sapply(Bs, function(x) length(unique(x)))
    ## which contain only a single type of B?
    repl <- which(uniq == 1)
    ## if there are no freq classes with only one level of B, return
    if(!(length(repl) > 0))
        return(ind) 
    ## if not, continue
    ## which of the freq classes are adjacent to unique class?
    other <- which(Ranked %in% (repl + c(1,-1)))
    ## generate uniform random numbers to decide if we replace
    Rand <- runif(length(ranks[[repl]]))
    ## Which are the rows in `x` that we want to change?
    candidates <- with(x, which(freq == as.numeric(names(uniq[repl]))))
    ## which are the adjacent values we can replace with
    replacements <- with(x, which(freq %in% as.numeric(names(uniq[other]))))
    ## which candidates to replace? Decision is random
    change <- sample(candidates, sum(Rand > 0.5))
    ## if we are changing a candidate, sample from the replacements and
    ## assign
    if(length(change) > 0)
        ind[candidates][change] <- sample(ind[replacements], length(change), 
                                          replace = replace)
    ## return
    ind
}

To use this, we do:

R> set.seed(35)
R> randSampleSpecial(x)
 [1]  2  1  5  3  4  6  9 12 10 11  7  8 13

We can wrap this in a replicate() call to produce many such replacements:

R> IND <- replicate(10, randSampleSpecial(x))
R> IND
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]   11    3    6    4    2    1    1    2   10     3
 [2,]    1   11    1   12   11   11    2    1    1    13
 [3,]    4    5    4    3    4    3    4    5    5     4
 [4,]    5    4    5    5    5    4    5    3    3     3
 [5,]    3    3    3    4    3    5    3    4    4     5
 [6,]   11    7   11   12    9    6    7    8    9     9
 [7,]   13   12   12    7   11    7    9   10    8    10
 [8,]   10    8    9    8   12   12    8    6   13     8
 [9,]    7    9   13   10    8   10   13    9   12    11
[10,]    6   11   10   11   10   13   12   13   10    13
[11,]   12   10    6    6    6    9   11   12    7    12
[12,]    9    6    7    9    7    8   10    7    6     7
[13,]    8   13    8   13   13   11    6   11   11     6

For this data set, we know that it is rows 1 and 2 in the sorted x that we might want to replace with values from the other freq classes. If we had done no replacements, the first two rows of IND would have values 1 or 2 only in them (see IND from earlier). In the new IND, where a value in the first two rows is not a 1 or 2, we have replaced it with a B from one of the adjacent frequency classes.

My function assumes you want to:

  1. Only replace elements in the homogeneous frequency class with one of the adjacent class at random! If you want to always replace then we change change the function to suit.
  2. That if we are doing a replacement, that replacement can be any of the replacements and if we need more than 1 replacement the same replacement can be selected more than once. Set replace = FALSE in the call to do the sampling without replacement if that is what you want.
  3. The function assumes you only have a single monospecific frequency class. If should be easy to modify using a loop over the two or more monospecific classes, but that does complicated the function and as your description of the problem was not overly clear, I kept things simple.


@Gavin gives you a nice approach, and asked if somebody could come up with something simpler. The next function does the same, based on only base functions. It uses count to process the frequencies, and takes into account that for the minimal en maximal frequency there is only one neighbouring rank. The function of Gavin gives an error in that case.

Permdf <- function(x,v){
  # some code to allow Permdf(df,var)
  mc <- match.call()
  v <- as.quoted(mc$v)
  y <- unlist(eval.quoted(v,x))
  # make bins with values in v per frequency
  freqs <- count(x,v)
  bins <- split(freqs[[1]],freqs[[2]])
  nbins <- length(bins)
  # define the output
  dfid <- 1:nrow(x)

  for (i in 1:nbins){
    # which id's to change
    id <- which(y %in% bins[[i]])

    if(length(bins[[i]]) > 1){
      # in case there's more than one value for that frequency
      dfid[id] <- sample(dfid[id])
    } else {
      bid <- c(i-1,i,i+1)
      # control wether id in range
      bid <- bid[bid > 0 & bid <=nbins]
      # id values to choose from
      vid <- which(y %in% unlist(bins[bid]))
      # random selection
      dfid[id] <- sample(vid,length(id),replace=TRUE)
    }
  }
  #return
  dfid
}

This can be used as

Permdf(x,B)


The bottom half of your question regarding randomization is a bit unclear, but here's a start. When you update your question - I'll update the answer accordingly. The code below adds the count information for the column B and then samples the rows based on the value of the frequency column we added. I think all that's needed from here is to modify the availability of which columns can be used for sampling, but please confirm what you want.

require(plyr)
x <- merge(x,count(x, "B"))
ddply(x, "freq", function(x) sample(x))
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜