Recoding variables with R

2023-02-17 20:02 问答作者：

Recoding variables in R, seems to be my biggest headache. What functions, packages, processes do you use to ensure the best result?

I've found very few useful examples on the Internet that give a one-size-fits-all solution to recoding and I'm interested to see what you guys and gals are u开发者_StackOverflow中文版sing.

Note: This may be a community wiki topic.

Recoding can mean a lot of things, and is fundamentally complicated.

Changing the levels of a factor can be done using the levels function:

> #change the levels of a factor
> levels(veteran$celltype) <- c("s","sc","a","l")

Transforming a continuous variable simply involves the application of a vectorized function:

> mtcars$mpg.log <- log(mtcars$mpg)

For binning continuous data look at cut and cut2 (in the hmisc package). For example:

> #make 4 groups with equal sample sizes
> mtcars[['mpg.tr']] <- cut2(mtcars[['mpg']], g=4)
> #make 4 groups with equal bin width
> mtcars[['mpg.tr2']] <- cut(mtcars[['mpg']],4, include.lowest=TRUE)

For recoding continuous or factor variables into a categorical variable there is recode in the car package and recode.variables in the Deducer package

> mtcars[c("mpg.tr2")] <- recode.variables(mtcars[c("mpg")] , "Lo:14 -> 'low';14:24 -> 'mid';else -> 'high';")

If you are looking for a GUI, Deducer implements recoding with the Transform and Recode dialogs:

http://www.deducer.org/pmwiki/pmwiki.php?n=Main.TransformVariables

http://www.deducer.org/pmwiki/pmwiki.php?n=Main.RecodeVariables

I found mapvalues from plyr package very handy. Package also contains function revalue which is similar to car:::recode.

The following example will "recode"

> mapvalues(letters, from = c("r", "o", "m", "a", "n"), to = c("R", "O", "M", "A", "N"))
 [1] "A" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "M" "N" "O" "p" "q" "R" "s" "t" "u" "v" "w" "x" "y" "z"

I find this very convenient when several values should be transformed (its like doing recodes in Stata):

# load package and gen some data
require(car)
x <- 1:10

# do the recoding
x
## [1]   1   2   3   4   5   6   7   8   9  10

recode(x,"10=1; 9=2; 1:4=-99")
## [1] -99 -99 -99 -99   5   6   7   8   2   1

I've found that it can sometimes be easier to convert non numeric factors to character before attempting to change them, for example.

df <- data.frame(example=letters[1:26]) 
example <- as.character(df$example)
example[example %in% letters[1:20]] <- "a"
example[example %in% letters[21:26]] <- "b"

Also, when importing data, it can be useful to ensure that numbers are actually numeric before attempting to convert:

df <- data.frame(example=1:100)
example <- as.numeric(df$example)
example[example < 20] <- 1
example[example >= 20 & example < 80] <- 2
example[example >= 80] <- 3

When you want to recode levels of a factor, forcats might come in handy. You can read a chapter of R for Data Science for an extensive tutorial, but here is the gist of it.

library(tidyverse)
library(forcats)
gss_cat %>%
  mutate(partyid = fct_recode(partyid,
                           "Republican, strong"    = "Strong republican",
                           "Republican, weak"      = "Not str republican",
                           "Independent, near rep" = "Ind,near rep",
                           "Independent, near dem" = "Ind,near dem",
                           "Democrat, weak"        = "Not str democrat",
                           "Democrat, strong"      = "Strong democrat",
                           "Other"                 = "No answer",
                           "Other"                 = "Don't know",
                           "Other"                 = "Other party"
  )) %>%
  count(partyid)
#> # A tibble: 8 × 2
#>                 partyid     n
#>                  <fctr> <int>
#> 1                 Other   548
#> 2    Republican, strong  2314
#> 3      Republican, weak  3032
#> 4 Independent, near rep  1791
#> 5           Independent  4119
#> 6 Independent, near dem  2499
#> # ... with 2 more rows

You can even let R decide what categories (factor levels) to merge together.

Sometimes you just want to lump together all the small groups to make a plot or table simpler. That’s the job of fct_lump(). [...] The default behaviour is to progressively lump together the smallest groups, ensuring that the aggregate is still the smallest group.

gss_cat %>%
  mutate(relig = fct_lump(relig, n = 10)) %>%
  count(relig, sort = TRUE) %>%
  print(n = Inf)
#> # A tibble: 2 × 2
#>        relig     n
#>       <fctr> <int>
#> 1 Protestant 10846
#> 2      Other 10637

Consider this sample data.

df <- data.frame(a = 1:5, b = 5:1)
df
#  a b
#1 1 5
#2 2 4
#3 3 3
#4 4 2
#5 5 1

Here are two options -

1. case_when :

For single column -

library(dplyr)

df %>%
  mutate(a = case_when(a == 1 ~ 'a', 
                       a == 2 ~ 'b', 
                       a == 3 ~ 'c', 
                       a == 4 ~ 'd', 
                       a == 5 ~ 'e'))

#  a b
#1 a 5
#2 b 4
#3 c 3
#4 d 2
#5 e 1

For multiple columns -

df %>%
  mutate(across(c(a, b), ~case_when(. == 1 ~ 'a', 
                                    . == 2 ~ 'b', 
                                    . == 3 ~ 'c', 
                                    . == 4 ~ 'd', 
                                    . == 5 ~ 'e')))

#  a b
#1 a e
#2 b d
#3 c c
#4 d b
#5 e a

2. dplyr::recode :

For single column -

df %>%
  mutate(a = recode(a, '1' = 'a', '2' = 'b', '3' = 'c', '4' = 'd', '5' = 'e'))

For multiple columns -

df %>%
  mutate(across(c(a, b), 
         ~recode(., '1' = 'a', '2' = 'b', '3' = 'c', '4' = 'd', '5' = 'e')))

Create a lookup vector using setNames, then match on name:

# iris as an example data
table(iris$Species)
# setosa versicolor  virginica 
#     50         50         50

x <- setNames(c("x","y","z"), c("setosa","versicolor","virginica"))
iris$Species <- x[ iris$Species ]

table(iris$Species)
#  x  y  z 
# 50 50 50

继续阅读：r

Recoding variables with R

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？