开发者

R: summarise data frame with repeating rows into boxplots

I am an R neophyte, with a data frame of database function runtimes with the following data:

> head(data2)
              dbfunc runtime
1 fn_slot03_byperson  38.083
2 fn_slot03_byperson  32.396
3 fn_slot03_byperson  41.246
4 fn_slot03_byperson  92.904
5 fn_slot03_byperson 130.512
6 fn_slot开发者_StackOverflow社区03_byperson 113.853

The data has data for 127 discrete functions comprising of some 1940170 rows.

I would like to:

  1. Summarise the data to only include database functions with a mean runtime of over 100 ms
  2. Produce boxplots of the 25 slowest database functions showing the distribution of runtimes, sorted by slowest first.

I'm particularly stumped by the summary step.

Note : I've also asked this questions at stats.stackexchange.com.


Here's one approach using ggplot and plyr. The steps you outlined could be combined to be slightly more efficient, but for learning purposes I'll show you the steps as you asked them.

#Load ggplot and make some fake data
library(ggplot2)
dat <- data.frame(dbfunc = rep(letters[1:10], each = 100)
                  , runtime = runif(1000, max = 300))

#Use plyr to calculate a new variable for the mean runtime by dbfunc and add as 
#a new column
dat <- ddply(dat, "dbfunc", transform, meanRunTime = mean(runtime))

#Subset only those dbfunc with mean run times greater than 100. Is this step necessary?
dat.long <- subset(dat, meanRunTime > 100)


#Reorder the level for the dbfunc variable in terms of the mean runtime. Note that relevel
#accepts a function like mean so if the subset step above isn't necessary, then we can simply
#use that instead.
dat.long$dbfunc <- reorder(dat.long$dbfunc, -dat.long$meanRunTime)

#Subset one more time to get the top *n* dbfunctions based on mean runtime. I chose three here...
dat.plot <- subset(dat.long, dbfunc %in% levels(dbfunc)[1:3])

#Now you have your top three dbfuncs, but a bunch of unused levels hanging out so let's drop them
dat.plot$dbfunc <- droplevels(dat.plot$dbfunc)

#Plotting time!
ggplot(dat.plot, aes(dbfunc, runtime)) + 
  geom_boxplot()

Like I said, I feel a few of those steps could be combined and made more efficient, but wanted to show you the steps as you outlined them.


The summary step is easy:

attach(data2)
func_mean = tapply(runtime, dbfunc, mean)

ad question 1:

func_mean[func_mean > 100]

ad question 2:

slowest25 = head(sort(func_mean, decreasing = TRUE), n=25)
sl25_data = merge(data.frame(dbfunc = names(slowest25), data2, sort = F)
plot(sl25_data$runtime ~ sl25_data$dbfunc)

Hope this helps. Yet the boxplots are not sorted in the plot.


I'm posting this as the 'answer' whereas Tomas and Chases' answers are in fact more complete. In Chase's case I couldn't get ggplot to operate, and time was short. In Tomas' case I got stuck at the sl25_data step.

We ended up using the following, which works with one remaining problem:

# load data frame
dbruntimes <- read.csv("db_runtimes.csv",sep=',',header=FALSE)
# calc means
meanruns <- aggregate(dbruntimes["runtime"],dbruntimes["dbfunc"],mean)
# filter
topmeanruns <- meanruns[meanruns$runtime>100,]
# order by means
meanruns <- meanruns[rev(order(meanruns$runtime)),]
# get top 25 results
drawfuncs <- meanruns[1:25,"dbfunc"]
# subset for plot
forboxplot <- subset(dbruntimes,dbfunc %in% levels(drawfuncs)[0:25])
# plot
boxplot(forboxplot$runtime~forboxplot$dbfunc)

This gives us the result we are looking for, but all the functions are still shown on the plot xaxis, rather than just the top 25.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜