Position in vector based on approximate matching
I have the sorted vector
m<-c(1.1, 3.2, 3.6, 4, 4.6, 4.6, 5.6, 5.7, 6.2, 8.9)
I want to find 开发者_StackOverflowthe position of a value based on approximate matching. If the value does not exist in the vector i would like the position of the immediately previous value
for exact matching I would use
> match(4,m)
[1] 4
But if I do
> match(6,m)
[1] NA
What i would like to get in this example is 8
(the position of the immidiately previous value of 6 which is the position of 5.7 which is 8)
Thank you in advance
There is a built-in function that does precisely what you want: findInterval
...It's vectorized as well, so you can give it several values to find in one go which is much more efficient.
m <- c(1.1, 3.2, 3.6, 4, 4.6, 4.6, 5.6, 5.7, 6.2, 8.9)
# Find nearest (lower) index for both 4 and 6
findInterval(c(4,6), m)
# [1] 4 8
Use which.max
in combination with vector subsetting, a solution of 17 characters:
which.max(m[m<=6]) # Edited to specify <=
[1] 8
Since your vector is sorted, you can use the even shorter:
sum(m<=6) # Edited to specify <=
[1] 8
This works because the value TRUE
is implicitly converted to 1 in the sum
This should do it for you.
#v is a sorted vector and x is an item for which we want the exact or nearest lower match
lowSideMatch <- function(x, v) max(which(v <= x))
lowSideMatch(6, m)
[1] 8
lowSideMatch(4, m)
[1] 4
you could use the which()
function to get the index of the element with the smallest deviation from your searched value. Also works with unordered vectors.
x <- c(8,4,1,2,3,6,9)
find <- 6
pos <- which(abs(x-find) == min(abs(x-find)))
Using c++ upper_bound
with Rcpp
might be fast.
Rcpp::cppFunction(
"int upper_bound(double v, const Rcpp::NumericVector& x) {
return std::distance(x.begin(), std::upper_bound(x.begin(), x.end(), v));
}")
upper_bound(4, m)
#[1] 4
upper_bound(6, m)
#[1] 8
Benchmark:
set.seed(42)
n <- 1e6
m <- sort(rnorm(n, 0, 2))
bench::mark(which_min = c(which(abs(m-4) == min(abs(m-4))), which(abs(m-6) == min(abs(m-6))))
, which.max = c(which.max(m[m<=4]), which.max(m[m<=6]))
, length = c(length(m[m<=4]), length(m[m<=6]))
, max_which = c(max(which(m <= 4)), max(which(m <= 6)))
, sum = c(sum(m<=4), sum(m<=6))
, findIntervall = findInterval(c(4,6), m)
, CPP_upper_bound = c(upper_bound(4, m), upper_bound(6, m)))
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
#1 which_min 16.33ms 18.33ms 51.1 NA 55.0 26 28
#2 which.max 9.38ms 11.46ms 73.1 NA 101. 37 51
#3 length 8.41ms 10.06ms 79.7 NA 112. 40 56
#4 max_which 5.89ms 7.48ms 136. NA 67.8 68 34
#5 sum 2.68ms 3.33ms 274. NA 45.9 137 23
#6 findIntervall 679.33µs 788.1µs 1214. NA 0 607 0
#7 CPP_upper_bound 2.31µs 2.75µs 307024. NA 30.7 10000 1
Something like:
> m<-c(1.1, 3.2, 3.6, 4, 4.6, 4.6, 5.6, 5.7, 6.2, 8.9)
> ( index = length(m[m<=4]) )
[1] 4
> ( m[index] )
[1] 4
> ( index = length(m[m<=6]) )
[1] 8
> ( m[index] )
[1] 5.7
精彩评论