开发者

Matching multiple patterns

I want to see, if "001" or "100" or "000" occurs in a string of 4 characters of 0 and 1. For example, a 4 character string could be like "1100" or "0010" or "1001" or "1111". How do I match many strings in a string with开发者_C百科 a single command?

I know grep could be used for pattern matching, but using grep, I can check only one string at a time. I want to know if multiple strings can be used with some other command or with grep itself.


Yes, you can. The | in a grep pattern has the same meaning as or. So you can test for your pattern by using "001|100|000" as your pattern. At the same time, grep is vectorised, so all of this can be done in one step:

x <- c("1100", "0010", "1001", "1111")
pattern <- "001|100|000"

grep(pattern, x)
[1] 1 2 3

This returns an index of which of your vectors contained the matching pattern (in this case the first three.)

Sometimes it is more convenient to have a logical vector that tells you which of the elements in your vector were matched. Then you can use grepl:

grepl(pattern, x)
[1]  TRUE  TRUE  TRUE FALSE

See ?regex for help about regular expressions in R.


Edit: To avoid creating pattern manually we can use paste:

myValues <- c("001", "100", "000")
pattern <- paste(myValues, collapse = "|")


Here is one solution using stringr package

require(stringr)
mylist = c("1100", "0010", "1001", "1111")
str_locate(mylist, "000|001|100")


Use the -e argument to add additional patterns:

echo '1100' | grep -e '001' -e '110' -e '101'


If you want logical vector then you should check stri_detect function from stringi package. In your case the pattern is regex, so use this one:

stri_detect_regex(x, pattern)
## [1]  TRUE  TRUE  TRUE FALSE

And some benchmarks:

require(microbenchmark)
test <- stri_paste(stri_rand_strings(100000, 4, "[0-1]"))
head(test)
## [1] "0001" "1111" "1101" "1101" "1110" "0110"
microbenchmark(stri_detect_regex(test, pattern), grepl(pattern, test))
Unit: milliseconds
                             expr      min       lq     mean   median       uq      max neval
 stri_detect_regex(test, pattern) 29.67405 30.30656 31.61175 30.93748 33.14948 35.90658   100
             grepl(pattern, test) 36.72723 37.71329 40.08595 40.01104 41.57586 48.63421   100


Sorry for making this an additonal answer, but it is too many lines for a comment.

I just wanted to remind, that the number of items that can be pasted together via paste(..., collapse = "|") to be used as a single matching pattern is limited - see below. Maybe somebody can tell where exactly the limit is? Admittedly the number might not be realistic, but depending on the task to be performed it should not entirely be excluded from our considerations.

For a really large number of items, a loop would be required to check each item of the pattern.

set.seed(0)
samplefun <- function(n, x, collapse){
  paste(sample(x, n, replace=TRUE), collapse=collapse)
}

words <- sapply(rpois(10000000, 8) + 1, samplefun, letters, '')
text <- sapply(rpois(1000, 5) + 1, samplefun, words, ' ')

#since execution takes a while, I have commented out the following lines

#result <- grepl(paste(words, collapse = "|"), text)

# Error in grepl(pattern, text) : 
#   invalid regular expression 
# 'wljtpgjqtnw|twiv|jphmer|mcemahvlsjxr|grehqfgldkgfu|
# ...

#result <- stringi::stri_detect_regex(text, paste(words, collapse = "|"))

# Error in stringi::stri_detect_regex(text, paste(words, collapse = "|")) : 
# Pattern exceeds limits on size or complexity. (U_REGEX_PATTERN_TOO_BIG)


You can also use the %like% operator from data.table library.

library(data.table)

# input
  x <- c("1100", "0010", "1001", "1111")
  pattern <- "001|100|000"

# check for pattern
  x %like% pattern

> [1]  TRUE  TRUE  TRUE FALSE
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜