R: Adding zeroes after old zeroes in a vector?
Imagine I have a vector with ones and zeroes
I write it compactly:
111111110000111开发者_开发技巧1111111110000000001111111111100101
I need to get a new vector replacing the "N" ones following the zeroes to new zeroes.
For example for N = 3.
1111111100001111111111110000000001111111111100101 becomes 1111111100000001111111110000000000001111111100000
I can do it with a for loop but I've read is not a good practice, How can I do it then?
cheers
My vector is a zoo series, indeed, but I guess it doesn't make any difference. If I wanted zeroes up to end I would use cumprod.
You can also do this with rle
. All you need to do is add n to all the lengths where the value is 0 and subtract n when the value is 1 (being a little bit careful when there are less than n ones in a row). (Using Greg's method to construct the sample)
rr <- rle(tmp)
## Pad so that it always begins with 1 and ends with 1
if (rr$values[1] == 0) {
rr$values <- c(1, rr$values)
rr$lengths <- c(0, rr$lengths)
}
if (rr$values[length(rr$values)] == 0) {
rr$values <- c(rr$values, 1)
rr$lengths <- c(rr$lengths, 0)
}
zero.indices <- seq(from=2, to=length(rr$values), by=2)
one.indices <- seq(from=3, to=length(rr$values), by=2)
rr$lengths[zero.indices] <- rr$lengths[zero.indices] + pmin(rr$lengths[one.indices], n)
rr$lengths[one.indices] <- pmax(0, rr$lengths[one.indices] - n)
inverse.rle(rr)
How about just looping through the (assuming few) N instances:
addZeros <- function(x, N = 3) {
xx <- x
z <- x - 1
for (i in 1:N) {
xx <- xx + c(rep(0, i), z[-c((NROW(x) - i + 1):NROW(x))])
}
xx[xx<0] <- 0
xx
}
Simply turns all zero instances into -1 in order to subtract the N succeeding values.
> x <- c(1,1,1,1,1,1,1,1,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,0,0,1,0,1)
> x
[1] 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1
[39] 1 1 1 1 1 1 0 0 1 0 1
> addZeros(x)
[1] 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1
[39] 1 1 1 1 1 1 0 0 0 0 0
EDIT:
After reading your description of the data in the R-help mailing list, this clearly is not a case of small N. Hence, you might want to consider a C function for this.
In the file "addZeros.c":
void addZeros(int *x, int *N, int *n)
{
int i, j;
for (i = *n - 1; i > 0; i--)
{
if ((x[i - 1] == 0) && (x[i] == 1))
{
j = 0;
while ((j < *N) && (i + j < *n) && (x[i + j] == 1))
{
x[i + j] = 0;
j++;
}
}
}
}
In command prompt (MS DOS in Windows, press Win+r and write cmd), write "R CMD SHLIB addZeros.c". If the path to R is not attainable (i.e. "unknown kommand R") you need to state full address (on my system:
"c:\Program Files\R\R-2.10.1\bin\R.exe" CMD SHLIB addZeros.c
On Windows this should produce a DLL (.so in Linux), but if you do not already have the R-toolbox you should download and install it (it is a collection of tools, such as Perl and Mingw). Download the newest version from http://www.murdoch-sutherland.com/Rtools/
The R wrapper function for this would be:
addZeros2 <- function(x, N) {
if (!is.loaded("addZeros"))
dyn.load(file.path(paste("addZeros", .Platform$dynlib.ext, sep = "")))
.C("addZeros",
x = as.integer(x),
as.integer(N),
as.integer(NROW(x)))$x
}
Note that the working directory in R should be the same as the DLL (on my system setwd("C:/Users/eyjo/Documents/Forrit/R/addZeros")
) before the addZeros R function is called the first time (alternatively, in dyn.load
just include the full path to the dll file). It is good practice to keep these in a sub-directory under the project (i.e. "c"), then just add "c/" in front of "addZeros" in the file path.
To illustrate:
> x <- rbinom(1000000, 1, 0.9)
>
> system.time(addZeros(x, 10))
user system elapsed
0.45 0.14 0.59
> system.time(addZeros(x, 400))
user system elapsed
15.87 3.70 19.64
>
> system.time(addZeros2(x, 10))
user system elapsed
0.01 0.02 0.03
> system.time(addZeros2(x, 400))
user system elapsed
0.03 0.00 0.03
>
Where the "addZeros" is my original suggestion with just internal R, and addZeros2 is using the C function.
x <- c(1,1,1,1,1,1,1,1,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,0,0,1,1,0,1)
n <- 3
z<-rle(x)
tmp <- cumsum(z$lengths)
for (i in seq(which.min(z$values),max(which(z$values==1)),2)) {
if (z$lengths[i+1] < n) x[tmp[i]:(tmp[i] + z$lengths[i+1])] <- 0
else x[tmp[i]:(tmp[i]+n)] <- 0
}
Here is one way:
> tmp <- strsplit('1111111100001111111111110000000001111111111100101','')
> tmp <- as.numeric(unlist(tmp))
>
> n <- 3
>
> tmp2 <- embed(tmp, n+1)
>
> tmp3 <- tmp
> tmp3[ which( apply( tmp2, 1, function(x) any(x==0) ) ) + n ] <- 0
>
> paste(tmp3, collapse='')
[1] "1111111100000001111111110000000000001111111100000"
whether this is better than a loop or not is up to you.
This will also not change the 1st n elements if there is a 0 there.
here is another way:
> library(gtools)
>
> tmpfun <- function(x) {
+ if(any(x==0)) {
+ 0
+ } else {
+ x[length(x)]
+ }
+ }
>
> tmp4 <- running( tmp, width=4, fun=tmpfun,
+ allow.fewer=TRUE )
>
> tmp4 <- unlist(tmp4)
> paste(tmp4, collapse='')
[1] "1111111100000001111111110000000000001111111100000"
>
To follow up on my previous comment, if speed is in fact a concern - converting the vector to a string and using regex may well be faster than other solutions. First a function:
replaceZero <- function(x,n){
x <- gsub(paste("01.{",n-1,"}", sep = "") , paste(rep(0,n+1),collapse = ""), x)
}
Generate data
z <- sample(0:1, 1000000, replace = TRUE)
z <- paste(z, collapse="")
repz <- replaceZero(z,3)
repz <- as.numeric(unlist(strsplit(repz, "")))
System time to collapse, run regex, and split back into vector:
Regex method
user system elapsed
2.39 0.04 2.39
Greg's method
user system elapsed
17.m39 0.17 18.30
Jonathon's method
user system elapsed
2.47 0.02 2.31
I really like the idea of using a "regular expression" for this so I gave a vote up for that. (Wish I had gotten an rle answer in too and learned something from the embed and running answers. Neat!) Here's a variation on Chase's answer that I think may address the issues raised:
replaceZero2 <- function(x, n) {
if (n == 0) {
return(x)
}
xString <- paste(x, collapse="")
result <- gsub(paste("(?<=",
paste("01{", 0:(n - 1), "}", sep="", collapse="|"),
")1", sep=""),
"0", xString, perl=TRUE)
return(as.numeric(unlist(strsplit(result, ""))))
}
This seems to produce identical results to Chang's rle method for n = 1,2,3,4,5 on gd047's example input.
Maybe you could write this more cleanly using \K?
I've found a solution myself. I think it's very easy and not very slow. I guess if someone could compile it in C++ it would be very fast because it has just one loop.
f5 <- function(z, N) {
x <- z
count <- 0
for (i in 1:length(z)) {
if (z[i]==0) { count <- N }
else {
if (count >0) {
x[i] <- 0
count <- count-1 }
}
}
x
}
Using a moving minimum function is very fast, simple, and not dependent on the distribution of spans:
x <- rbinom(1000000, 1, 0.9)
system.time(movmin(x, 3, na.rm=T))
# user system elapsed
# 0.11 0.02 0.13
The following simple definition of movmin suffices (the complete function has some functionality superfluous to this case, such as using the van Herk/Gil-Werman algorithm for large N)
movmin = function(x, n, na.rm=F) {
x = c(rep.int(NA, n - 1), x) # left pad
do.call(pmin, c(lapply(1:n, function(i) x[i:(length(x) - n + i)]), na.rm=na.rm))
}
Actually you need a window size of 4 because you affect the 3 values following a zero. This matches your f5:
x <- rbinom(1000000, 1, 0.9)
all.equal(f5(x, 3), movmin(x, 4, na.rm=T))
# [1] TRUE
精彩评论