How to find top n% of records in a column of a dataframe using R
I have a dataset showing the exchange rate of the Australian Dollar versus the US dollar once a day over a period of about 20 years. I have the data in a data frame, with the first column being the date, and the second column being the exchange rate. Here's a sample from the data:
>data
V1 V2
1 12/12/1983 0.9175
2 13/12/1983 0.9010
3 14/12/1983 0.9000
4 15/12/1983 0.8978
5 16/12/1983 0.8928
6 19/12/1983 0.8770
7 20/12/1983 0.8795
8 21/12/1983 0.8905
9 22/12/1983 0.9005
10 23/12/1983 0.9005
How would I go about displaying the top n% of the开发者_如何学编程se records? E.g. say I want to see the days and exchange rates for those days where the exchange rate falls in the top 5% of all exchange rates in the dataset?
For the top 5%:
n <- 5
data[data$V2 > quantile(data$V2,prob=1-n/100),]
For the top 5% also:
head(data[order(data$V2,decreasing=T),],.05*nrow(data))
Another solution could be use for sqldf
if the data is sorted based on the V1
value:
library(sqldf)
sqldf('SELECT * FROM df
ORDER BY V1
LIMIT (SELECT 0.05 * COUNT(*) FROM df)
')
You can change the rate form 0.05
(5%
) to any required rate.
A dplyr
solution could look like this:
obs <- nrow(data)
data %>% filter(row_number() < obs * 0.05)
This only works if the data is sorted, but your question and example data implies this. If the data is unsorted, you will need to arrange
it by the variable you're interested in:
data <- data %>% arrange(desc(V2))
精彩评论