开发者

Calculate Mean of a column in R having non numeric values

I have a column which contain numeric as well as non-numeric values. I want to find the mean of the numeric values which i can use it to replace the non-numeric values. How can this be done in开发者_高级运维 R?


Say your data frame is named df and the column you want to "fix" is called df$x. You could do the following.

You have to unfactor and then convert to numeric. This will give you NAs for all the character strings that cannot be coalesced to numbers.

nums <- as.numeric(as.character(df$x))

As Richie Cotton pointed out, there is a "more efficient, but harder to remember" way to convert factors to numeric

nums <- as.numeric(levels(df$x))[as.integer(df$x)]

To get the mean, you use mean() but pass na.rm = T

m <- mean(nums, na.rm = T)

Assign the mean to all the NA values.

nums[is.na(nums)] <- m

You could then replace the old data, but I don't recommend it. Instead just add a new column

df$new.x <- nums


This is a function I wrote yesterday to combat the non-numeric types. I have a data.frame with unpredictable type for each column. I want to calculate the means for numeric, and leave everything else untouched.

colMeans2 <- function(x) {
    # This function tries to guess column type. Since all columns come as
    # characters, it first tries to see if x == "TRUE" or "FALSE". If
    # not so, it tries to coerce vector into integer. If that doesn't 
    # work it tries to see if there's a ' \" ' in the vector (meaning a
    # column with character), it uses that as a result. Finally if nothing
    # else passes, it means the column type is numeric, and it calculates
    # the mean of that. The end.

#   browser()

    # try if logical
    if (any(levels(x) == "TRUE" | levels(x) == "FALSE")) return(NA)

    # try if integer
    try.int <- strtoi(x)
    if (all(!is.na(try.int)))  return(try.int[1])

    # try if character
    if (any(grepl("\\\"", x))) return(x[1])

    # what's left is numeric
    mean(as.numeric(as.character(x)), na.rm = TRUE)
    # a possible warning about coerced NAs probably originates in the above line
}

You would use it like so:

apply(X = your.dataframe, MARGIN = 2, FUN = colMeans2)


It sort of depends on what your data looks like.

Does it look like this?

data = list(1, 2, 'new jersey')

Then you could

data.numbers = sapply(data, as.numeric)

and get

c(1, 2, NA)

And you can find the mean with

mean(data.numbers, na.rm=T)


A compact conversion:

  vec <- c(0:10,"a","z")
  vec2 <- (as.numeric(vec))
  vec2[is.na(vec2)] <- mean(vec2[!is.na(vec2)])

as.numeric will print the warning message listed below and convert the non-numeric to NA.

Warning message:
In mean(as.numeric(vec)) : NAs introduced by coercion
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜