开发者

Split vector of strings and paste subset of resulting elements into a new vector

Define

z<- as.character(c("1_xx xx xxx_xxxx_12_sep.xls","2_xx xx xxx_xxxx_15_aug.xls"))

such that

> z
[1] "1_xx xx xxx_xxxx_12_sep.xls" "2_xx xx xxx_xxxx_15_aug.xls"

I want to cr开发者_StackOverflow中文版eate a vector w such that

> w
[1] "1_12_sep" "2_15_aug"

That is, split each element of z by _ and then join elements 1,4,5, with the .xls removed from the latter.

I can manage the split part, but not sure what function to provide, e.g something like"

w <- as.character(lapply(strsplit(z,"_"), function(x) ???))


You can do this using a combination of strsplit, substr and lapply:

y <- strsplit(z,"_",fixed=TRUE)
lapply(y,FUN=function(x){paste(x[1],x[4],substr(x[5],1,3),sep="_")})


Using a bit of magic in the stringr package: I separately extract the left and right date fields, combine them, and finally remove the .xls at the end.

library(stringr)
l <- str_extract(z, "\\d+_")
r <- str_extract(z, "\\d+_\\w*\\.xls")
gsub(".xls", "", paste(l, r, sep=""))

[1] "1_12_sep" "2_15_aug"

str_extract is a wrapper around some of the base R functions which I find easier to use.

Edit Here is a short explanation of what the regex does:

  • \\d+ looks for one or more digits. It is escaped to distinguish from a normal character d.
  • \\w* looks for zero or more alphanumeric characters (word). Again, it's escaped.
  • \\. looks for a decimal point. This needs to be escaped because otherwise the decimal point means any single character.

In theory the regex should be quite flexible. It should find single or double characters for your dates.


One call to gsub (and some regex magic based on @Andrie's answer) can do this. See ?regexp for details on what I used in the pattern and replacement (back-reference) arguments.

gsub("^(\\d+_).*_(\\d+_\\w*).xls", "\\1\\2", z)
# [1] "1_12_sep" "2_15_aug"


An alternative along the same lines of @Joran's Answer is this:

foo <- function(x) {
    o <- paste(x[c(1,4,5)], collapse = "_")
    substr(o, 1, nchar(o) - 4) 
}

sapply(strsplit(z, "_"), foo)

The differences are minor - I use collapse = "_" and nchar() but other than that it is similar.

You can write this as a one-liner

sapply(strsplit(z, "_"), 
       function(x) {o <- paste(x[c(1,4,5)], 
                               collapse = "_"); substr(o, 1, nchar(o)-4)})

but writing the custom function to apply is nicer.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜