fix unreadable postscript tree output in r
I have a relatively complicated classification tree that I'm trying to output. The resulting postscript output looks very jumbled.
> fit = rpart(virility ~ friend_count + recip_count + twitter_handles + has_email +
has_bio + has_foursquare + has_linkedin + auto_tweet +
interaction_visibility + site_own_cnt + site_rec_cnt + has_url +
has_linkedin_url + lb_cnt, + mob_own_cnt + mob_rec_cnt +
twt_own_cnt + twt_rec_cnt, method="class", data=vir)
> fit
n= 9704
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 9704 3742 virile (0.39970092 0.60029908)
2) recip_count< 15.5 9610 3159 mule (0.52005469 0.47994531)
4) site_own_cnt< 0.5 7201 1372 mule (0.65423387 0.34576613)
8) friend_count< 2.5 6763 948 mule (0.69566613 0.30433387)
16) has_bio>=0.5 4030 601 mule (0.73743993 0.26256007) *
17) has_bio< 0.5 2733 347 mule (0.57990315 0.42009685)
34) recip_count< 0.5 2496 88 mule (0.78000000 0.22000000) *
35) recip_count>=0.5 237 167 virile (0.39201878 0.60798122) *
9) friend_count>=2.5 438 424 mule (0.50293083 0.49706917)
18) lb_cnt< 2.5 427 344 mule (0.55208333 0.44791667)
36) has_foursquare< 0.5 401 257 mule (0.61353383 0.38646617)
72) twitter_handles>=0.5 382 210 mule (0.65742251 0.34257749) *
73) twitter_handles< 0.5 19 5 virile (0.09615385 0.90384615) *
37) has_foursquare>=0.5 26 16 virile (0.15533981 0.84466019) *
19) lb_cnt>=2.5 11 5 virile (0.05882353 0.94117647开发者_如何转开发) *
5) site_own_cnt>=0.5 2409 827 virile (0.31637337 0.68362663)
10) recip_count< 0.5 1344 274 mule (0.62102351 0.37897649)
20) friend_count< 0.5 955 75 mule (0.81155779 0.18844221) *
21) friend_count>=0.5 389 126 virile (0.38769231 0.61230769)
42) twitter_handles< 0.5 62 3 mule (0.93181818 0.06818182) *
43) twitter_handles>=0.5 327 85 virile (0.30249110 0.69750890) *
11) recip_count>=0.5 1065 378 virile (0.19989424 0.80010576) *
3) recip_count>=15.5 94 319 virile (0.11474820 0.88525180)
6) friend_count< 2.5 40 265 virile (0.32435741 0.67564259)
12) site_rec_cnt>=1.5 24 175 mule (0.59112150 0.40887850)
24) site_rec_cnt< 4 13 46 mule (0.80257511 0.19742489) *
25) site_rec_cnt>=4 11 66 virile (0.33846154 0.66153846) *
13) site_rec_cnt< 1.5 16 12 virile (0.03084833 0.96915167) *
7) friend_count>=2.5 54 54 virile (0.02750891 0.97249109) *
> post(fit, file = "/tmp/blah.ps", title = "virility model")
This results in:
The nodes of the tree are all written half on top of each other. Is there any way to make this output look reasonably readable?
The post
method for rpart
in fact calls first the plot
method and then the text
method for rpart. This means you can study the help for ?plot.rpart
and ?text.rpart
to find ways of improving your plot output.
?text.rpart
offers some very good pointers. I suggest you try the following parameters:
fancy=FALSE
will remove the ellipses and boxes. Your plot is clearly too busy and large to have this. Removing it will increase legibility.cex=0.8
will reduce the font size to 0.8 of the normal size. Slightly smaller fonts may increase spacing between elements on the plot.
Here is an example of the difference this can make, using a model fitted to the diamonds
data in ggplot2
:
library(ggplot2)
library(rpart)
fit <- rpart(price~. , diamonds)
par(mfrow=c(1, 2))
plot(fit, main="Default settings")
text(fit, fancy=TRUE)
plot(fit, uniform=TRUE, main="fancy=FALSE")
text(fit, fancy=FALSE, pretty=NULL, cex=0.8)
精彩评论