开发者

Why do column names get concatenated into the row output of a linear model summary?

I've never noticed this behavior before, but I'm surprised at the output naming conventions for linear model summaries. My question, essentially, is why row names in a linear model summary always seem to carry the name of the column they came from.

An example

Suppose you h开发者_Go百科ad some data for 300 movie audience members from three different cities:

  • Chicago
  • Milwaukee
  • Dayton

And suppose all of them were subjected to the stinking pile of confusing, contaminated waste that was Spider-Man 3. After enduring the entirety of that cinematic abomination, they were asked to rate the movie on a 100-point scale.

Because all of the audience members were reasonable human beings, the ratings were all below zero. (Naturally. Anyone who's seen the movie would agree.)

Here's what that might look like in R:

> score <- rnorm(n = 300, mean = -50, sd = 10)
> city  <- rep(c("Chicago", "Milwaukee", "Dayton"), times = 100)
> spider.man.3.sucked <- data.frame(score, city)
> head(spider.man.3.sucked)
      score      city
1 -64.57515   Chicago
2 -50.51050 Milwaukee
3 -56.51409    Dayton
4 -45.55133   Chicago
5 -47.88686 Milwaukee
6 -51.22812    Dayton

Great. So let's run a quick linear model, assign it to lm1, and get its summary output:

> lm1 <- lm(score ~ city, data = spider.man.3.sucked)
> summary(lm1)

Call:
lm(formula = score ~ city, data = spider.man.3.sucked)

Residuals:
     Min       1Q   Median       3Q      Max 
-29.8515  -6.1090  -0.4745   6.0340  26.2616 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -51.3621     0.9630 -53.337   <2e-16 ***
cityDayton      1.1892     1.3619   0.873    0.383    
cityMilwaukee   0.8288     1.3619   0.609    0.543    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 9.63 on 297 degrees of freedom
Multiple R-squared: 0.002693,   Adjusted R-squared: -0.004023 
F-statistic: 0.4009 on 2 and 297 DF,  p-value: 0.6701

What's bugging me

The part I want to highlight is this:

cityDayton      1.1892     1.3619   0.873    0.383    
cityMilwaukee   0.8288     1.3619   0.609    0.543    

It looks like R sensibly concatenated the column name (city, if you remember from above) with the distinct value (in this case either Dayton or Milwaukee). If I don't want R to output in that format, is there any way to override it? For example, in my case all I'd need is simply:

Dayton      1.1892     1.3619   0.873    0.383    
Milwaukee   0.8288     1.3619   0.609    0.543    

Two questions in one

So,

  1. What's controlling the format of the output for linear model summary rows, and
  2. Can/should I change it?


The extractor function for that component of a summary object is coef. Does this provide the means to control your output acceptably:

summ <- summary(lm1)
csumm <- coef(summ)
rownames(csumm) <- sub("^city", "", rownames(csumm))
print(csumm[-1,], digits=4)
#           Estimate Std. Error t value Pr(>|t|)
# Dayton      0.8133      1.485  0.5478   0.5842
# Milwaukee   0.3891      1.485  0.2621   0.7934

(No random seed was set so cannot match your values.)


For 1) it appears to happen inside model.matrix.default() and inside internal R compiled code for that matter.

It might be difficult to change this easily - the obvious way would be to write your own model.matrix.default() that calls model.matrix.default() and updates the names afterwards. But this isn't tested or tried.


Here is a hack

# RUN REGRESSION
require(ggplot2)
lm1 = lm(tip ~ total_bill + sex + day, data = tips)

# FUNCTION TO REMOVE FACTOR NAMES FROM MODEL SUMMARY
remove_factors = function(mod){
   mydf = mod$model    
   # PREPARE VECTOR OF VARIABLES WITH REPETITIONS = UNIQUE FACTOR LEVELS
   vars  = names(mod$model)[-1]
   eachlen = sapply(mydf[,vars,drop=F], function(x) 
     ifelse(is.numeric(x), 1, length(unique(x)) - 1))        
   vars = rep(vars, eachlen)

   # REPLACE COEF NAMES WITH VARIABLE NAME WHEN APPROPRIATE
   coefs = names(lm1$coefficients)[-1]
   coefs2 = stringr::str_replace(coefs, vars, "")
   names(mod$coefficients)[-1] = ifelse(coefs2 == "", coefs, coefs2)

   return(mod)
}

summary(remove_factors(lm1))

This gives

              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.95588    0.27579    3.47  0.00063 ***
total_bill   0.10489    0.00758   13.84  < 2e-16 ***
Male        -0.03844    0.14215   -0.27  0.78706    
Sat         -0.08088    0.26226   -0.31  0.75806    
Sun          0.08282    0.26741    0.31  0.75706    
Thur        -0.02063    0.26975   -0.08  0.93910 

However, it is not always advisable to do this, as you can see from running the same hack for a different regression. It is not clear what the Yes variable in the last name stands for. R by default writes it as smokerYes to make its meaning clear. So use with caution.

lm2 = lm(tip ~ total_bill + sex + day + smoker, data = tips)
summary(remove_factors(lm2))

              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.05182    0.29315    3.59  0.00040 ***
total_bill   0.10569    0.00763   13.86  < 2e-16 ***
Male        -0.03769    0.14217   -0.27  0.79114    
Sat         -0.12636    0.26648   -0.47  0.63582    
Sun          0.00407    0.27959    0.01  0.98841    
Thur        -0.09283    0.27994   -0.33  0.74048    
Yes         -0.13935    0.14422   -0.97  0.33489
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜