开发者

How do I sort a dataframe by the average of subsets of one of the rows?

I'm fairly new to R, but I'm making good progress. I've been able to bend ggplot2 to my will with the exception of one thing: the order that the categorical labels are plotted along the x axis in my boxplot. I think this is just a hole in my knowledge of how to address ranges of a dataframe in formulas, but here's the fake data, as a dataframe called df:

Index    Label    Value
index1   A        1
index2   A        2
index3   A        3
index4   B        12
index5   B        11
index6   B        10
index7   C        8
index8   C        7
index9   C        9
...
index76  Z        15
index77  Z        17
index78  Z        16

My plot code looks like qplot(df$Label, df$Value, data=df) + scale_x_discrete("Label") + opts(axis.text.x = theme_text(angle = 90, hjust = 0, size=7)) + geom_boxplot() and gives me exactly what I want, which is a boxplot showing one box & whiskers for label A, one for B, and one for C. However, the axis goes in the order of the labels (the boxplot of 1,2,3 being closest to the origin, 10,11,12 in the middle, 7,8,9 on the right of the graph). What I want is for the boxplot data to start with 开发者_运维知识库the subset that has the highest within label average and proceed in decreasing order. I can average within each label by mean(df$Label[1:3]) and mean(df$Label[4:6]) etc., but I can't figure out how to get the graph to display such that the plots for the labels go not in the order they appear in factor(df$Label) (i.e. A, B, C along the x with boxes at 2, 11, 8) but in order of highest within-label average to lowest (i.e. B, C, An along the x and the boxes then at 11, 8, 2).

I'm thinking I would create a vector consisting of each within-label average and somehow pass that to ggplot to specify the axis order, but I can't figure out how to create the vector to start with.

What I need to know is:

What's the best way to get a vector consisting of the averages of each label, in order from highest to lowest?

How do I pass that vector to ggplot so that it orders the x-axis by those values, while still labeling the x axis with factor(df$Label)

I'm open to suggestions for other ways to display the data as well, but I think I'm pretty close to what I want & the mean & spread of the values within a given label is important.


Here is one way to do it

# create a dummy data frame
set.seed(1234)
df = data.frame(
       label = rep(letters[1:3], each = 3),
       value = sample(100, 9))

# boxplot without sorting
qplot(label, value, data = df, geom = 'boxplot')

How do I sort a dataframe by the average of subsets of one of the rows?

# boxplot with label sorted by median of value
qplot(reorder(label, value, median), value, data = df, geom = 'boxplot')

How do I sort a dataframe by the average of subsets of one of the rows?


Label is a factor. Try as.numeric(df$Label) to see what number each level of the factor corresponds to. It is likely that ggplot2 uses the labels in their numerical level order. You can order the levels of a factor by passing a levels argument to factor. For example, if you had the each of the Labels in a vector in the order that you want the, ordered.levels=c("B","C","A",...), then you can "reorder" the Labels by converting to caracter and back, with an explicit levels argument: df$Label <- factor(as.character(df$Label), levels=ordered.levels).

All of this assumes that ggplot2 uses the numerical values of the levels to order the plots.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜