Fastest way to find second (third...) highest/lowest value in vector or column
R 开发者_开发问答offers max and min, but I do not see a really fast way to find another value in the order, apart from sorting the whole vector and then picking a value x from this vector.
Is there a faster way to get the second highest value, for example?
Use the partial
argument of sort()
. For the second highest value:
n <- length(x)
sort(x,partial=n-1)[n-1]
Slightly slower alternative, just for the records:
x <- c(12.45,34,4,0,-234,45.6,4)
max( x[x!=max(x)] )
min( x[x!=min(x)] )
Rfast has a function called nth_element that does exactly what you ask.
Further the methods discussed above that are based on partial sort, don't support finding the k smallest values
Update (28/FEB/21) package kit offers a faster implementation (topn) see https://stackoverflow.com/a/66367996/4729755, https://stackoverflow.com/a/53146559/4729755
Disclaimer: An issue appears to occur when dealing with integers which can by bypassed by using as.numeric (e.g. Rfast::nth(as.numeric(1:10), 2)), and will be addressed in the next update of Rfast.
Rfast::nth(x, 5, descending = T)
Will return the 5th largest element of x, while
Rfast::nth(x, 5, descending = F)
Will return the 5th smallest element of x
Benchmarks below against most popular answers.
For 10 thousand numbers:
N = 10000
x = rnorm(N)
maxN <- function(x, N=2){
len <- length(x)
if(N>len){
warning('N greater than length(x). Setting N=length(x)')
N <- length(x)
}
sort(x,partial=len-N+1)[len-N+1]
}
microbenchmark::microbenchmark(
Rfast = Rfast::nth(x,5,descending = T),
maxn = maxN(x,5),
order = x[order(x, decreasing = T)[5]])
Unit: microseconds
expr min lq mean median uq max neval
Rfast 160.364 179.607 202.8024 194.575 210.1830 351.517 100
maxN 396.419 423.360 559.2707 446.452 487.0775 4949.452 100
order 1288.466 1343.417 1746.7627 1433.221 1500.7865 13768.148 100
For 1 million numbers:
N = 1e6
x = rnorm(N)
microbenchmark::microbenchmark(
Rfast = Rfast::nth(x,5,descending = T),
maxN = maxN(x,5),
order = x[order(x, decreasing = T)[5]])
Unit: milliseconds
expr min lq mean median uq max neval
Rfast 89.7722 93.63674 114.9893 104.6325 120.5767 204.8839 100
maxN 150.2822 207.03922 235.3037 241.7604 259.7476 336.7051 100
order 930.8924 968.54785 1005.5487 991.7995 1031.0290 1164.9129 100
I wrapped Rob's answer up into a slightly more general function, which can be used to find the 2nd, 3rd, 4th (etc.) max:
maxN <- function(x, N=2){
len <- length(x)
if(N>len){
warning('N greater than length(x). Setting N=length(x)')
N <- length(x)
}
sort(x,partial=len-N+1)[len-N+1]
}
maxN(1:10)
Here is an easy way to find the indices of N smallest/largest values in a vector(Example for N = 3):
N <- 3
N Smallest:
ndx <- order(x)[1:N]
N Largest:
ndx <- order(x, decreasing = T)[1:N]
So you can extract the values as:
x[ndx]
For nth highest value,
sort(x, TRUE)[n]
Here you go... kit is the obvious winner!
N = 1e6
x = rnorm(N)
maxN <- function(x, N=2){
len <- length(x)
if(N>len){
warning('N greater than length(x). Setting N=length(x)')
N <- length(x)
}
sort(x,partial=len-N+1)[len-N+1]
}
microbenchmark::microbenchmark(
Rfast = Rfast::nth(x,5,descending = T),
maxN = maxN(x,5),
order = x[order(x, decreasing = T)[5]],
kit = x[kit::topn(x, 5L,decreasing = T)[5L]]
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# Rfast 12.311168 12.473771 16.36982 12.702134 16.110779 102.749873 100
# maxN 12.922118 13.124358 17.49628 18.977537 20.053139 28.928694 100
# order 50.443100 50.926975 52.54067 51.270163 52.323116 66.561606 100
# kit 1.177202 1.216371 1.29542 1.240228 1.297286 2.771715 100
Edit: I forgot that kit::topn
has hasna
option...let's do another run.
microbenchmark::microbenchmark(
Rfast = Rfast::nth(x,5,descending = T),
maxN = maxN(x,5),
order = x[order(x, decreasing = T)[5]],
kit = x[kit::topn(x, 5L,decreasing = T)[5L]],
kit2 = x[kit::topn(x, 5L,decreasing = T,hasna = F)[5L]],
unit = "ms"
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# Rfast 13.194314 13.358787 14.7227116 13.4560340 14.551194 24.524105 100
# maxN 7.378960 7.527661 10.0747803 7.7119715 12.217756 67.409526 100
# order 50.088927 50.488832 52.4714347 50.7415680 52.267003 70.062662 100
# kit 1.180698 1.217237 1.2975441 1.2429790 1.278243 3.263202 100
# kit2 0.842354 0.876329 0.9398055 0.9109095 0.944407 2.135903 100
Here is the simplest way I found,
num <- c(5665,1615,5154,65564,69895646)
num <- sort(num, decreasing = F)
tail(num, 1) # Highest number
head(tail(num, 2),1) # Second Highest number
head(tail(num, 3),1) # Third Highest number
head(tail(num, n),1) # Generl equation for finding nth Highest number
I found that removing the max element first and then do another max runs in comparable speed:
system.time({a=runif(1000000);m=max(a);i=which.max(a);b=a[-i];max(b)})
user system elapsed
0.092 0.000 0.659
system.time({a=runif(1000000);n=length(a);sort(a,partial=n-1)[n-1]})
user system elapsed
0.096 0.000 0.653
dplyr has the function nth, where the first argument is the vector and the second is which place you want. This goes for repeating elements as well. For example:
x = c(1,2, 8, 16, 17, 20, 1, 20)
Finding the second largest value:
nth(unique(x),length(unique(x))-1)
[1] 17
When I was recently looking for an R function returning indexes of top N max/min numbers in a given vector, I was surprised there is no such a function.
And this is something very similar.
The brute force solution using base::order function seems to be the easiest one.
topMaxUsingFullSort <- function(x, N) {
sort(x, decreasing = TRUE)[1:min(N, length(x))]
}
But it is not the fastest one in case your N value is relatively small compared to length of the vector x.
On the other side if the N is really small, you can use base::whichMax function iteratively and in each iteration you can replace found value by -Inf
# the input vector 'x' must not contain -Inf value
topMaxUsingWhichMax <- function(x, N) {
vals <- c()
for(i in 1:min(N, length(x))) {
idx <- which.max(x)
vals <- c(vals, x[idx]) # copy-on-modify (this is not an issue because idxs is relative small vector)
x[idx] <- -Inf # copy-on-modify (this is the issue because data vector could be huge)
}
vals
}
I believe you see the problem - the copy-on-modify nature of R. So this will perform better for very very very small N (1,2,3) but it will rapidly slow down for larger N values. And you are iterating over all elements in vector x N times.
I think the best solution in clean R is to use partial base::sort.
topMaxUsingPartialSort <- function(x, N) {
N <- min(N, length(x))
x[x >= -sort(-x, partial=N)[N]][1:N]
}
Then you can select the last (Nth) item from the result of functions defiend above.
Note: functions defined above are just examples - if you want to use them, you have to check/sanity inputs (eg. N > length(x)).
I wrote a small article about something very similar (get indexes of top N max/min values of a vector) at http://palusga.cz/?p=18 - you can find here some benchmarks of similar functions I defined above.
head(sort(x),..)
or tail(sort(x),...)
should work
This will find the index of the N'th smallest or largest value in the input numeric vector x. Set bottom=TRUE in the arguments if you want the N'th from the bottom, or bottom=FALSE if you want the N'th from the top. N=1 and bottom=TRUE is equivalent to which.min, N=1 and bottom=FALSE is equivalent to which.max.
FindIndicesBottomTopN <- function(x=c(4,-2,5,-77,99),N=1,bottom=FALSE)
{
k1 <- rank(x)
if(bottom==TRUE){
Nindex <- which(k1==N)
Nindex <- Nindex[1]
}
if(bottom==FALSE){
Nindex <- which(k1==(length(x)+1-N))
Nindex <- Nindex[1]
}
return(Nindex)
}
topn = function(vector, n){
maxs=c()
ind=c()
for (i in 1:n){
biggest=match(max(vector), vector)
ind[i]=biggest
maxs[i]=max(vector)
vector=vector[-biggest]
}
mat=cbind(maxs, ind)
return(mat)
}
this function will return a matrix with the top n values and their indices. hope it helps VDevi-Chou
You can identify the next higher value with cummax()
. If you want the location of the each new higher value for example you can pass your vector of cummax()
values to the diff()
function to identify locations at which the cummax()
value changed. say we have the vector
v <- c(4,6,3,2,-5,6,8,12,16)
cummax(v) will give us the vector
4 6 6 6 6 6 8 12 16
Now, if you want to find the location of a change in cummax()
you have many options I tend to use sign(diff(cummax(v)))
. You have to adjust for the lost first element because of diff()
. The complete code for vector v
would be:
which(sign(diff(cummax(v)))==1)+1
You can use the sort
keyword like this:
sort(unique(c))[1:N]
Example:
c <- c(4,2,44,2,1,45,34,2,4,22,244)
sort(unique(c), decreasing = TRUE)[1:5]
will give the first 5 max numbers.
精彩评论