开发者

Recoding variables with R

Recoding variables in R, seems to be my biggest headache. What functions, packages, processes do you use to ensure the best result?

I've found very few useful examples on the Internet that give a one-size-fits-all solution to recoding and I'm interested to see what you guys and gals are u开发者_StackOverflow中文版sing.

Note: This may be a community wiki topic.


Recoding can mean a lot of things, and is fundamentally complicated.

Changing the levels of a factor can be done using the levels function:

> #change the levels of a factor
> levels(veteran$celltype) <- c("s","sc","a","l")

Transforming a continuous variable simply involves the application of a vectorized function:

> mtcars$mpg.log <- log(mtcars$mpg) 

For binning continuous data look at cut and cut2 (in the hmisc package). For example:

> #make 4 groups with equal sample sizes
> mtcars[['mpg.tr']] <- cut2(mtcars[['mpg']], g=4)
> #make 4 groups with equal bin width
> mtcars[['mpg.tr2']] <- cut(mtcars[['mpg']],4, include.lowest=TRUE)

For recoding continuous or factor variables into a categorical variable there is recode in the car package and recode.variables in the Deducer package

> mtcars[c("mpg.tr2")] <- recode.variables(mtcars[c("mpg")] , "Lo:14 -> 'low';14:24 -> 'mid';else -> 'high';")

If you are looking for a GUI, Deducer implements recoding with the Transform and Recode dialogs:

http://www.deducer.org/pmwiki/pmwiki.php?n=Main.TransformVariables

http://www.deducer.org/pmwiki/pmwiki.php?n=Main.RecodeVariables


I found mapvalues from plyr package very handy. Package also contains function revalue which is similar to car:::recode.

The following example will "recode"

> mapvalues(letters, from = c("r", "o", "m", "a", "n"), to = c("R", "O", "M", "A", "N"))
 [1] "A" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "M" "N" "O" "p" "q" "R" "s" "t" "u" "v" "w" "x" "y" "z"


I find this very convenient when several values should be transformed (its like doing recodes in Stata):

# load package and gen some data
require(car)
x <- 1:10

# do the recoding
x
## [1]   1   2   3   4   5   6   7   8   9  10

recode(x,"10=1; 9=2; 1:4=-99")
## [1] -99 -99 -99 -99   5   6   7   8   2   1


I've found that it can sometimes be easier to convert non numeric factors to character before attempting to change them, for example.

df <- data.frame(example=letters[1:26]) 
example <- as.character(df$example)
example[example %in% letters[1:20]] <- "a"
example[example %in% letters[21:26]] <- "b"

Also, when importing data, it can be useful to ensure that numbers are actually numeric before attempting to convert:

df <- data.frame(example=1:100)
example <- as.numeric(df$example)
example[example < 20] <- 1
example[example >= 20 & example < 80] <- 2
example[example >= 80] <- 3


When you want to recode levels of a factor, forcats might come in handy. You can read a chapter of R for Data Science for an extensive tutorial, but here is the gist of it.

library(tidyverse)
library(forcats)
gss_cat %>%
  mutate(partyid = fct_recode(partyid,
                           "Republican, strong"    = "Strong republican",
                           "Republican, weak"      = "Not str republican",
                           "Independent, near rep" = "Ind,near rep",
                           "Independent, near dem" = "Ind,near dem",
                           "Democrat, weak"        = "Not str democrat",
                           "Democrat, strong"      = "Strong democrat",
                           "Other"                 = "No answer",
                           "Other"                 = "Don't know",
                           "Other"                 = "Other party"
  )) %>%
  count(partyid)
#> # A tibble: 8 × 2
#>                 partyid     n
#>                  <fctr> <int>
#> 1                 Other   548
#> 2    Republican, strong  2314
#> 3      Republican, weak  3032
#> 4 Independent, near rep  1791
#> 5           Independent  4119
#> 6 Independent, near dem  2499
#> # ... with 2 more rows

You can even let R decide what categories (factor levels) to merge together.

Sometimes you just want to lump together all the small groups to make a plot or table simpler. That’s the job of fct_lump(). [...] The default behaviour is to progressively lump together the smallest groups, ensuring that the aggregate is still the smallest group.

gss_cat %>%
  mutate(relig = fct_lump(relig, n = 10)) %>%
  count(relig, sort = TRUE) %>%
  print(n = Inf)
#> # A tibble: 2 × 2
#>        relig     n
#>       <fctr> <int>
#> 1 Protestant 10846
#> 2      Other 10637


Consider this sample data.

df <- data.frame(a = 1:5, b = 5:1)
df
#  a b
#1 1 5
#2 2 4
#3 3 3
#4 4 2
#5 5 1

Here are two options -

1. case_when :

For single column -

library(dplyr)

df %>%
  mutate(a = case_when(a == 1 ~ 'a', 
                       a == 2 ~ 'b', 
                       a == 3 ~ 'c', 
                       a == 4 ~ 'd', 
                       a == 5 ~ 'e'))

#  a b
#1 a 5
#2 b 4
#3 c 3
#4 d 2
#5 e 1

For multiple columns -

df %>%
  mutate(across(c(a, b), ~case_when(. == 1 ~ 'a', 
                                    . == 2 ~ 'b', 
                                    . == 3 ~ 'c', 
                                    . == 4 ~ 'd', 
                                    . == 5 ~ 'e')))

#  a b
#1 a e
#2 b d
#3 c c
#4 d b
#5 e a

2. dplyr::recode :

For single column -

df %>%
  mutate(a = recode(a, '1' = 'a', '2' = 'b', '3' = 'c', '4' = 'd', '5' = 'e'))

For multiple columns -

df %>%
  mutate(across(c(a, b), 
         ~recode(., '1' = 'a', '2' = 'b', '3' = 'c', '4' = 'd', '5' = 'e')))


Create a lookup vector using setNames, then match on name:

# iris as an example data
table(iris$Species)
# setosa versicolor  virginica 
#     50         50         50

x <- setNames(c("x","y","z"), c("setosa","versicolor","virginica"))
iris$Species <- x[ iris$Species ]

table(iris$Species)
#  x  y  z 
# 50 50 50 
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜