Why do column names get concatenated into the row output of a linear model summary?

2023-03-27 18:22 问答作者：

I've never noticed this behavior before, but I'm surprised at the output naming conventions for linear model summaries. My question, essentially, is why row names in a linear model summary always seem to carry the name of the column they came from.

An example

Suppose you h开发者_Go百科ad some data for 300 movie audience members from three different cities:

Chicago
Milwaukee
Dayton

And suppose all of them were subjected to the stinking pile of confusing, contaminated waste that was Spider-Man 3. After enduring the entirety of that cinematic abomination, they were asked to rate the movie on a 100-point scale.

Because all of the audience members were reasonable human beings, the ratings were all below zero. (Naturally. Anyone who's seen the movie would agree.)

Here's what that might look like in R:

> score <- rnorm(n = 300, mean = -50, sd = 10)
> city  <- rep(c("Chicago", "Milwaukee", "Dayton"), times = 100)
> spider.man.3.sucked <- data.frame(score, city)
> head(spider.man.3.sucked)
      score      city
1 -64.57515   Chicago
2 -50.51050 Milwaukee
3 -56.51409    Dayton
4 -45.55133   Chicago
5 -47.88686 Milwaukee
6 -51.22812    Dayton

Great. So let's run a quick linear model, assign it to lm1, and get its summary output:

> lm1 <- lm(score ~ city, data = spider.man.3.sucked)
> summary(lm1)

Call:
lm(formula = score ~ city, data = spider.man.3.sucked)

Residuals:
     Min       1Q   Median       3Q      Max 
-29.8515  -6.1090  -0.4745   6.0340  26.2616 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -51.3621     0.9630 -53.337   <2e-16 ***
cityDayton      1.1892     1.3619   0.873    0.383    
cityMilwaukee   0.8288     1.3619   0.609    0.543    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 9.63 on 297 degrees of freedom
Multiple R-squared: 0.002693,   Adjusted R-squared: -0.004023 
F-statistic: 0.4009 on 2 and 297 DF,  p-value: 0.6701

What's bugging me

The part I want to highlight is this:

cityDayton      1.1892     1.3619   0.873    0.383    
cityMilwaukee   0.8288     1.3619   0.609    0.543

It looks like R sensibly concatenated the column name (city, if you remember from above) with the distinct value (in this case either Dayton or Milwaukee). If I don't want R to output in that format, is there any way to override it? For example, in my case all I'd need is simply:

Dayton      1.1892     1.3619   0.873    0.383    
Milwaukee   0.8288     1.3619   0.609    0.543

Two questions in one

So,

What's controlling the format of the output for linear model summary rows, and
Can/should I change it?

The extractor function for that component of a summary object is coef. Does this provide the means to control your output acceptably:

summ <- summary(lm1)
csumm <- coef(summ)
rownames(csumm) <- sub("^city", "", rownames(csumm))
print(csumm[-1,], digits=4)
#           Estimate Std. Error t value Pr(>|t|)
# Dayton      0.8133      1.485  0.5478   0.5842
# Milwaukee   0.3891      1.485  0.2621   0.7934

(No random seed was set so cannot match your values.)

For 1) it appears to happen inside model.matrix.default() and inside internal R compiled code for that matter.

It might be difficult to change this easily - the obvious way would be to write your own model.matrix.default() that calls model.matrix.default() and updates the names afterwards. But this isn't tested or tried.

Here is a hack

# RUN REGRESSION
require(ggplot2)
lm1 = lm(tip ~ total_bill + sex + day, data = tips)

# FUNCTION TO REMOVE FACTOR NAMES FROM MODEL SUMMARY
remove_factors = function(mod){
   mydf = mod$model    
   # PREPARE VECTOR OF VARIABLES WITH REPETITIONS = UNIQUE FACTOR LEVELS
   vars  = names(mod$model)[-1]
   eachlen = sapply(mydf[,vars,drop=F], function(x) 
     ifelse(is.numeric(x), 1, length(unique(x)) - 1))        
   vars = rep(vars, eachlen)

   # REPLACE COEF NAMES WITH VARIABLE NAME WHEN APPROPRIATE
   coefs = names(lm1$coefficients)[-1]
   coefs2 = stringr::str_replace(coefs, vars, "")
   names(mod$coefficients)[-1] = ifelse(coefs2 == "", coefs, coefs2)

   return(mod)
}

summary(remove_factors(lm1))

This gives

              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.95588    0.27579    3.47  0.00063 ***
total_bill   0.10489    0.00758   13.84  < 2e-16 ***
Male        -0.03844    0.14215   -0.27  0.78706    
Sat         -0.08088    0.26226   -0.31  0.75806    
Sun          0.08282    0.26741    0.31  0.75706    
Thur        -0.02063    0.26975   -0.08  0.93910

However, it is not always advisable to do this, as you can see from running the same hack for a different regression. It is not clear what the Yes variable in the last name stands for. R by default writes it as smokerYes to make its meaning clear. So use with caution.

lm2 = lm(tip ~ total_bill + sex + day + smoker, data = tips)
summary(remove_factors(lm2))

              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.05182    0.29315    3.59  0.00040 ***
total_bill   0.10569    0.00763   13.86  < 2e-16 ***
Male        -0.03769    0.14217   -0.27  0.79114    
Sat         -0.12636    0.26648   -0.47  0.63582    
Sun          0.00407    0.27959    0.01  0.98841    
Thur        -0.09283    0.27994   -0.33  0.74048    
Yes         -0.13935    0.14422   -0.97  0.33489

Why do column names get concatenated into the row output of a linear model summary?

An example

What's bugging me

Two questions in one

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

An example

What's bugging me

Two questions in one

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？