开发者

Calculating the Mode for Nominal as well as Continuous variables in [R]

Can anyone help me with this?

If I run:

> mode(iris$Species)
[1] "numeric"
> mode(iris$Sepal.Width)
[1] "numeric"

Then I get "numeric" as answe开发者_C百科r

Cheers

M


The function mode() is used to find out the storage mode of the the object, in this case is is stored as mode "numeric". This function is not used to find the most "frequent" observed value in a data set, i.e. it is not used to find the statistical mode. See ?mode for more on what this function does in R and why it isn't useful for your problem.

For discrete data, the mode is the most frequent observed value among the set:

> set.seed(1) ## reproducible example
> dat <- sample(1:5, 100, replace = TRUE) ## dummy data
> (tab <- table(dat)) ## tabulate the frequencies
dat
 1  2  3  4  5 
13 25 19 26 17 
> which.max(tab) ## which is the mode?
4 
4 
> tab[which.max(tab)] ## what is the frequency of the mode?
 4 
26

For continuous data, the mode is the value of the data at which the probability density function (PDF) reaches a maximum. As your data are generally a sample from some continuous probability distribution, we don't know the PDF but we can estimate it through a histogram or better through a kernel density estimate.

Returning to the iris data, here is an example of determining the mode from continuous data:

> sepalwd <- with(iris, density(Sepal.Width)) ## kernel density estimate
> plot(sepalwd)
> str(sepalwd)
List of 7
 $ x        : num [1:512] 1.63 1.64 1.64 1.65 1.65 ...
 $ y        : num [1:512] 0.000244 0.000283 0.000329 0.000379 0.000436 ...
 $ bw       : num 0.123
 $ n        : int 150
 $ call     : language density.default(x = Sepal.Width)
 $ data.name: chr "Sepal.Width"
 $ has.na   : logi FALSE
 - attr(*, "class")= chr "density"
> with(sepalwd, which.max(y)) ## which value has maximal density?
[1] 224
> with(sepalwd, x[which.max(y)]) ## use the above to find the mode
[1] 3.000314

See ?density for more info. By default, density() evaluates the kernel density estimate at n = 512 equally spaced locations. If this is too crude for you, increase the number of locations evaluated and returned:

> sepalwd2 <- with(iris, density(Sepal.Width, n = 2048))
> with(sepalwd, x[which.max(y)])
[1] 3.000314

In this case it doesn't alter the result.


see ?mode : mode is giving you the storage mode. If you want the value with the maximum count, then use table.

> Sample <- sample(letters[1:5],50,replace=T)
> tmp <- table(Sample)
> tmp
Sample
 a  b  c  d  e 
 9 12  9  7 13 
> tmp[which(tmp==max(tmp))]
 e 
13 

Please, read the help files if a function is not doing what you think it should.

Some extra explanation :

max(tmp) is the maximum of tmp

tmp == max(tmp) gives a logical vector with a length of tmp, indicating whether a value is equal or not to max(tmp).

which(tmp == max(tmp)) returns the index of the values in the vector that are TRUE. These indices you use to select the value in tmp that is the maximum value.

See the help files ?which, ?max and the introductory manuals for R.


See ?mode : mode is giving you the storage mode.

If you want to know the mode of a continuous random variable, I recently released the package ModEstM. In addition to the method proposed by Gavin Simpson, it addresses the case of multimodal variables. For example, in case you study the sample:

> x2 <- c(rbeta(1000, 23, 4), rbeta(1000, 4, 16))

Which is clearly bimodal, you get the answer:

> ModEstM::ModEstM(x2)
[[1]]
[1] 0.8634313 0.1752347
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜