Splitting a continuous variable into equal sized groups
I need to split/divide up a continuous variable into 3 equal sized groups.
Example data frame:
das <- data.frame(anim = 1:15,
wt = c(181,179,180.5,201,201.5,245,246.4,
189.3,301,354,369,205,199,394,231.3))
After being cut up (according to the value of wt
), I would need to have the 3 classes under the new variable wt2
like this:
> das
anim wt wt2
1 1 181.0 1
2 2 179.0 1
3 开发者_如何学编程3 180.5 1
4 4 201.0 2
5 5 201.5 2
6 6 245.0 2
7 7 246.4 3
8 8 189.3 1
9 9 301.0 3
10 10 354.0 3
11 11 369.0 3
12 12 205.0 2
13 13 199.0 1
14 14 394.0 3
15 15 231.3 2
This would be applied to a large data set.
Or see cut_number
from the ggplot2
package, e.g.
das$wt_2 <- as.numeric(cut_number(das$wt,3))
Note that cut(...,3)
divides the range of the original data into three ranges of equal lengths; it doesn't necessarily result in the same number of observations per group if the data are unevenly distributed (you can replicate what cut_number
does by using quantile
appropriately, but it's a nice convenience function). On the other hand, Hmisc::cut2()
using the g=
argument does split by quantiles, so is more or less equivalent to ggplot2::cut_number
. I might have thought that something like cut_number
would have made its way into dplyr
by so far, but as far as I can tell it hasn't.
try this:
split(das, cut(das$anim, 3))
if you want to split based on the value of wt
, then
library(Hmisc) # cut2
split(das, cut2(das$wt, g=3))
anyway, you can do that by combining cut
, cut2
and split
.
UPDATED
if you want a group index as an additional column, then
das$group <- cut(das$anim, 3)
if the column should be index like 1, 2, ..., then
das$group <- as.numeric(cut(das$anim, 3))
UPDATED AGAIN
try this:
> das$wt2 <- as.numeric(cut2(das$wt, g=3))
> das
anim wt wt2
1 1 181.0 1
2 2 179.0 1
3 3 180.5 1
4 4 201.0 2
5 5 201.5 2
6 6 245.0 2
7 7 246.4 3
8 8 189.3 1
9 9 301.0 3
10 10 354.0 3
11 11 369.0 3
12 12 205.0 2
13 13 199.0 1
14 14 394.0 3
15 15 231.3 2
Here's another solution using the bin_data()
function from the mltools package.
library(mltools)
# Resulting bins have an equal number of observations in each group
das[, "wt2"] <- bin_data(das$wt, bins=3, binType = "quantile")
# Resulting bins are equally spaced from min to max
das[, "wt3"] <- bin_data(das$wt, bins=3, binType = "explicit")
# Or if you'd rather define the bins yourself
das[, "wt4"] <- bin_data(das$wt, bins=c(-Inf, 250, 322, Inf), binType = "explicit")
das
anim wt wt2 wt3 wt4
1 1 181.0 [179, 200.333333333333) [179, 250.666666666667) [-Inf, 250)
2 2 179.0 [179, 200.333333333333) [179, 250.666666666667) [-Inf, 250)
3 3 180.5 [179, 200.333333333333) [179, 250.666666666667) [-Inf, 250)
4 4 201.0 [200.333333333333, 245.466666666667) [179, 250.666666666667) [-Inf, 250)
5 5 201.5 [200.333333333333, 245.466666666667) [179, 250.666666666667) [-Inf, 250)
6 6 245.0 [200.333333333333, 245.466666666667) [179, 250.666666666667) [-Inf, 250)
7 7 246.4 [245.466666666667, 394] [179, 250.666666666667) [-Inf, 250)
8 8 189.3 [179, 200.333333333333) [179, 250.666666666667) [-Inf, 250)
9 9 301.0 [245.466666666667, 394] [250.666666666667, 322.333333333333) [250, 322)
10 10 354.0 [245.466666666667, 394] [322.333333333333, 394] [322, Inf]
11 11 369.0 [245.466666666667, 394] [322.333333333333, 394] [322, Inf]
12 12 205.0 [200.333333333333, 245.466666666667) [179, 250.666666666667) [-Inf, 250)
13 13 199.0 [179, 200.333333333333) [179, 250.666666666667) [-Inf, 250)
14 14 394.0 [245.466666666667, 394] [322.333333333333, 394] [322, Inf]
15 15 231.3 [200.333333333333, 245.466666666667) [179, 250.666666666667) [-Inf, 250)
If you want to split into 3 equally distributed groups, the answer is the same as Ben Bolker's answer above - use ggplot2::cut_number()
. For sake of completion here are the 3 methods of converting continuous to categorical (binning).
cut_number()
: Makes n groups with (approximately) equal numbers of observationcut_interval()
: Makes n groups with equal rangecut_width()
: Makes groups of width
My go-to is cut_number()
because this uses evenly spaced quantiles for binning observations. Here's an example with skewed data.
library(tidyverse)
skewed_tbl <- tibble(
counts = c(1:100, 1:50, 1:20, rep(1:10, 3),
rep(1:5, 5), rep(1:2, 10), rep(1, 20))
) %>%
mutate(
counts_cut_number = cut_number(counts, n = 4),
counts_cut_interval = cut_interval(counts, n = 4),
counts_cut_width = cut_width(counts, width = 25)
)
# Data
skewed_tbl
#> # A tibble: 265 x 4
#> counts counts_cut_number counts_cut_interval counts_cut_width
#> <dbl> <fct> <fct> <fct>
#> 1 1 [1,3] [1,25.8] [-12.5,12.5]
#> 2 2 [1,3] [1,25.8] [-12.5,12.5]
#> 3 3 [1,3] [1,25.8] [-12.5,12.5]
#> 4 4 (3,13] [1,25.8] [-12.5,12.5]
#> 5 5 (3,13] [1,25.8] [-12.5,12.5]
#> 6 6 (3,13] [1,25.8] [-12.5,12.5]
#> 7 7 (3,13] [1,25.8] [-12.5,12.5]
#> 8 8 (3,13] [1,25.8] [-12.5,12.5]
#> 9 9 (3,13] [1,25.8] [-12.5,12.5]
#> 10 10 (3,13] [1,25.8] [-12.5,12.5]
#> # ... with 255 more rows
summary(skewed_tbl$counts)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1.00 3.00 13.00 25.75 42.00 100.00
# Histogram showing skew
skewed_tbl %>%
ggplot(aes(counts)) +
geom_histogram(bins = 30)
# cut_number() evenly distributes observations into bins by quantile
skewed_tbl %>%
ggplot(aes(counts_cut_number)) +
geom_bar()
# cut_interval() evenly splits the interval across the range
skewed_tbl %>%
ggplot(aes(counts_cut_interval)) +
geom_bar()
# cut_width() uses the width = 25 to create bins that are 25 in width
skewed_tbl %>%
ggplot(aes(counts_cut_width)) +
geom_bar()
Created on 2018-11-01 by the reprex package (v0.2.1)
Alternative without using cut2.
das$wt2 <- as.factor( as.numeric( cut(das$wt,3)))
or
das$wt2 <- as.factor( cut(das$wt,3, labels=F))
As pointed out by @ben-bolker this splits into equal-widths rather occupancy.
I think that using quantiles
one can approximate equal-occupancy
x = rnorm(10)
x
[1] -0.1074316 0.6690681 -1.7168853 0.5144931 1.6460280 0.7014368
[7] 1.1170587 -0.8503069 0.4462932 -0.1089427
bin = 3 #for 1/3 rd, 4 for 1/4, 100 for 1/100th etc
xx = cut(x, quantile(x, breaks=1/bin*c(1:bin)), labels=F, include.lowest=T)
table(xx)
1 2 3 4
3 2 2 3
ntile
from dplyr
now does this but behaves weirdly with NA
's.
I've used similar code in the following function that works in base R and does the equivalent of the cut2
solution above:
ntile_ <- function(x, n) {
b <- x[!is.na(x)]
q <- floor((n * (rank(b, ties.method = "first") - 1)/length(b)) + 1)
d <- rep(NA, length(x))
d[!is.na(x)] <- q
return(d)
}
cut
, when not given explicit break points divides values into bins of same width, they won't contain an equal number of items in general:
x <- c(1:4,10)
lengths(split(x, cut(x, 2)))
# (0.991,5.5] (5.5,10]
# 4 1
Hmisc::cut2
and ggplot2::cut_number
use quantiles, which will usually create groups of same size (in term of number of elements) if the data is well spread and of decent size, it's not always the case however. mltools::bin_data
can give different results but is also based on quantiles.
These functions don't always give neat results when the data contains a small number of distinct values :
x <- rep(c(1:20),c(15, 7, 10, 3, 9, 3, 4, 9, 3, 2,
23, 2, 4, 1, 1, 7, 18, 37, 6, 2))
table(x)
# x
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# 15 7 10 3 9 3 4 9 3 2 23 2 4 1 1 7 18 37 6 2
table(Hmisc::cut2(x, g=4))
# [ 1, 6) [ 6,12) [12,19) [19,20]
# 44 44 70 8
table(ggplot2::cut_number(x, 4))
# [1,5] (5,11] (11,18] (18,20]
# 44 44 70 8
table(mltools::bin_data(x, bins=4, binType = "quantile"))
# [1, 5) [5, 11) [11, 18) [18, 20]
# 35 30 56 45
This is not clear if the optimal solution has been found here.
What is the best binning approach is a subjective matter, but one reasonable way to approach it is to look for the bins that minimize the variance around the expected bin size.
The function smart_cut
from (my) package cutr
proposes such feature. It's computationally heavy though and should be reserved to cases where cut points and unique values are few (which happen to be usually the case where it matters).
# devtools::install_github("moodymudskipper/cutr")
table(cutr::smart_cut(x, list(4, "balanced"), "g"))
# [1,6) [6,12) [12,18) [18,20]
# 44 44 33 45
We see the groups are much better balanced.
"balanced"
in the call can in fact be replaced by a custom function to optimize or restrict the bins as desired if the method based on variance isn't enough.
equal_freq
from funModeling
takes a vector and the number of bins (based on equal frequency):
das <- data.frame(anim=1:15,
wt=c(181,179,180.5,201,201.5,245,246.4,
189.3,301,354,369,205,199,394,231.3))
das$wt_bin=funModeling::equal_freq(das$wt, 3)
table(das$wt_bin)
#[179,201) [201,246) [246,394]
# 5 5 5
You can also use the bin
function with method = "content"
from the OneR
package for that:
library(OneR)
das$wt_2 <- as.numeric(bin(das$wt, nbins = 3, method = "content"))
das
## anim wt wt_2
## 1 1 181.0 1
## 2 2 179.0 1
## 3 3 180.5 1
## 4 4 201.0 2
## 5 5 201.5 2
## 6 6 245.0 2
## 7 7 246.4 3
## 8 8 189.3 1
## 9 9 301.0 3
## 10 10 354.0 3
## 11 11 369.0 3
## 12 12 205.0 2
## 13 13 199.0 1
## 14 14 394.0 3
## 15 15 231.3 2
Without any extra package, 3 being the number of groups:
> findInterval(das$wt, unique(quantile(das$wt, seq(0, 1, length.out = 3 + 1))), rightmost.closed = TRUE)
[1] 1 1 1 2 2 2 3 1 3 3 3 2 1 3 2
You can speed up the quantile computation by using a representative sample of the values of interest. Double check the documentation of the FindInterval
function.
So, interestingly if you would like to cut the variable "wt", into sections with equal 3 sub-sections (i.e., 179-181, 181-183, etc); you could do this:
x<-table(as.matrix(cut(das$wt,breaks = ((max(das$wt)-min(das$wt))/3)),as.numeric(cut(das$wt,breaks = ((max(das$wt)-min(das$wt))/3)))))
Which gives the result, based on the dataset "das":
x
(179,182] (188,191] (197,200] (200,203] (203,206] (230,234] (243,246]
3 1 1 2 1 1 1
(246,249] (300,303] (352,355] (367,370] (391,394]
1 1 1 1 1
(that digit 3 in codes is an arbitrary item which can be changed by your interest.)
精彩评论