Why do column names get concatenated into the row output of a linear model summary?
I've never noticed this behavior before, but I'm surprised at the output naming conventions for linear model summaries. My question, essentially, is why row names in a linear model summary always seem to carry the name of the column they came from.
An example
Suppose you h开发者_Go百科ad some data for 300 movie audience members from three different cities:
- Chicago
- Milwaukee
- Dayton
And suppose all of them were subjected to the stinking pile of confusing, contaminated waste that was Spider-Man 3. After enduring the entirety of that cinematic abomination, they were asked to rate the movie on a 100-point scale.
Because all of the audience members were reasonable human beings, the ratings were all below zero. (Naturally. Anyone who's seen the movie would agree.)
Here's what that might look like in R:
> score <- rnorm(n = 300, mean = -50, sd = 10)
> city <- rep(c("Chicago", "Milwaukee", "Dayton"), times = 100)
> spider.man.3.sucked <- data.frame(score, city)
> head(spider.man.3.sucked)
score city
1 -64.57515 Chicago
2 -50.51050 Milwaukee
3 -56.51409 Dayton
4 -45.55133 Chicago
5 -47.88686 Milwaukee
6 -51.22812 Dayton
Great. So let's run a quick linear model, assign it to lm1
, and get its summary output:
> lm1 <- lm(score ~ city, data = spider.man.3.sucked)
> summary(lm1)
Call:
lm(formula = score ~ city, data = spider.man.3.sucked)
Residuals:
Min 1Q Median 3Q Max
-29.8515 -6.1090 -0.4745 6.0340 26.2616
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -51.3621 0.9630 -53.337 <2e-16 ***
cityDayton 1.1892 1.3619 0.873 0.383
cityMilwaukee 0.8288 1.3619 0.609 0.543
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.63 on 297 degrees of freedom
Multiple R-squared: 0.002693, Adjusted R-squared: -0.004023
F-statistic: 0.4009 on 2 and 297 DF, p-value: 0.6701
What's bugging me
The part I want to highlight is this:
cityDayton 1.1892 1.3619 0.873 0.383
cityMilwaukee 0.8288 1.3619 0.609 0.543
It looks like R sensibly concatenated the column name (city
, if you remember from above) with the distinct value (in this case either Dayton
or Milwaukee
). If I don't want R to output in that format, is there any way to override it? For example, in my case all I'd need is simply:
Dayton 1.1892 1.3619 0.873 0.383
Milwaukee 0.8288 1.3619 0.609 0.543
Two questions in one
So,
- What's controlling the format of the output for linear model summary rows, and
- Can/should I change it?
The extractor function for that component of a summary object is coef
. Does this provide the means to control your output acceptably:
summ <- summary(lm1)
csumm <- coef(summ)
rownames(csumm) <- sub("^city", "", rownames(csumm))
print(csumm[-1,], digits=4)
# Estimate Std. Error t value Pr(>|t|)
# Dayton 0.8133 1.485 0.5478 0.5842
# Milwaukee 0.3891 1.485 0.2621 0.7934
(No random seed was set so cannot match your values.)
For 1) it appears to happen inside model.matrix.default()
and inside internal R compiled code for that matter.
It might be difficult to change this easily - the obvious way would be to write your own model.matrix.default()
that calls model.matrix.default()
and updates the names afterwards. But this isn't tested or tried.
Here is a hack
# RUN REGRESSION
require(ggplot2)
lm1 = lm(tip ~ total_bill + sex + day, data = tips)
# FUNCTION TO REMOVE FACTOR NAMES FROM MODEL SUMMARY
remove_factors = function(mod){
mydf = mod$model
# PREPARE VECTOR OF VARIABLES WITH REPETITIONS = UNIQUE FACTOR LEVELS
vars = names(mod$model)[-1]
eachlen = sapply(mydf[,vars,drop=F], function(x)
ifelse(is.numeric(x), 1, length(unique(x)) - 1))
vars = rep(vars, eachlen)
# REPLACE COEF NAMES WITH VARIABLE NAME WHEN APPROPRIATE
coefs = names(lm1$coefficients)[-1]
coefs2 = stringr::str_replace(coefs, vars, "")
names(mod$coefficients)[-1] = ifelse(coefs2 == "", coefs, coefs2)
return(mod)
}
summary(remove_factors(lm1))
This gives
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.95588 0.27579 3.47 0.00063 ***
total_bill 0.10489 0.00758 13.84 < 2e-16 ***
Male -0.03844 0.14215 -0.27 0.78706
Sat -0.08088 0.26226 -0.31 0.75806
Sun 0.08282 0.26741 0.31 0.75706
Thur -0.02063 0.26975 -0.08 0.93910
However, it is not always advisable to do this, as you can see from running the same hack for a different regression. It is not clear what the Yes
variable in the last name stands for. R by default writes it as smokerYes
to make its meaning clear. So use with caution.
lm2 = lm(tip ~ total_bill + sex + day + smoker, data = tips)
summary(remove_factors(lm2))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.05182 0.29315 3.59 0.00040 ***
total_bill 0.10569 0.00763 13.86 < 2e-16 ***
Male -0.03769 0.14217 -0.27 0.79114
Sat -0.12636 0.26648 -0.47 0.63582
Sun 0.00407 0.27959 0.01 0.98841
Thur -0.09283 0.27994 -0.33 0.74048
Yes -0.13935 0.14422 -0.97 0.33489
精彩评论