Calculating the Mode for Nominal as well as Continuous variables in [R]
Can anyone help me with this?
If I run:
> mode(iris$Species)
[1] "numeric"
> mode(iris$Sepal.Width)
[1] "numeric"
Then I get "numeric"
as answe开发者_C百科r
Cheers
M
The function mode()
is used to find out the storage mode of the the object, in this case is is stored as mode "numeric"
. This function is not used to find the most "frequent" observed value in a data set, i.e. it is not used to find the statistical mode. See ?mode
for more on what this function does in R and why it isn't useful for your problem.
For discrete data, the mode is the most frequent observed value among the set:
> set.seed(1) ## reproducible example
> dat <- sample(1:5, 100, replace = TRUE) ## dummy data
> (tab <- table(dat)) ## tabulate the frequencies
dat
1 2 3 4 5
13 25 19 26 17
> which.max(tab) ## which is the mode?
4
4
> tab[which.max(tab)] ## what is the frequency of the mode?
4
26
For continuous data, the mode is the value of the data at which the probability density function (PDF) reaches a maximum. As your data are generally a sample from some continuous probability distribution, we don't know the PDF but we can estimate it through a histogram or better through a kernel density estimate.
Returning to the iris data, here is an example of determining the mode from continuous data:
> sepalwd <- with(iris, density(Sepal.Width)) ## kernel density estimate
> plot(sepalwd)
> str(sepalwd)
List of 7
$ x : num [1:512] 1.63 1.64 1.64 1.65 1.65 ...
$ y : num [1:512] 0.000244 0.000283 0.000329 0.000379 0.000436 ...
$ bw : num 0.123
$ n : int 150
$ call : language density.default(x = Sepal.Width)
$ data.name: chr "Sepal.Width"
$ has.na : logi FALSE
- attr(*, "class")= chr "density"
> with(sepalwd, which.max(y)) ## which value has maximal density?
[1] 224
> with(sepalwd, x[which.max(y)]) ## use the above to find the mode
[1] 3.000314
See ?density
for more info. By default, density()
evaluates the kernel density estimate at n = 512
equally spaced locations. If this is too crude for you, increase the number of locations evaluated and returned:
> sepalwd2 <- with(iris, density(Sepal.Width, n = 2048))
> with(sepalwd, x[which.max(y)])
[1] 3.000314
In this case it doesn't alter the result.
see ?mode
: mode
is giving you the storage mode. If you want the value with the maximum count, then use table.
> Sample <- sample(letters[1:5],50,replace=T)
> tmp <- table(Sample)
> tmp
Sample
a b c d e
9 12 9 7 13
> tmp[which(tmp==max(tmp))]
e
13
Please, read the help files if a function is not doing what you think it should.
Some extra explanation :
max(tmp)
is the maximum of tmp
tmp == max(tmp)
gives a logical vector with a length of tmp, indicating whether a value is equal or not to max(tmp).
which(tmp == max(tmp))
returns the index of the values in the vector that are TRUE
. These indices you use to select the value in tmp that is the maximum value.
See the help files ?which
, ?max
and the introductory manuals for R.
See ?mode : mode is giving you the storage mode.
If you want to know the mode of a continuous random variable, I recently released the package ModEstM. In addition to the method proposed by Gavin Simpson, it addresses the case of multimodal variables. For example, in case you study the sample:
> x2 <- c(rbeta(1000, 23, 4), rbeta(1000, 4, 16))
Which is clearly bimodal, you get the answer:
> ModEstM::ModEstM(x2)
[[1]]
[1] 0.8634313 0.1752347
精彩评论