开发者

Digging into R profiling information

I am trying to optimize a bit of code, and am puzzled about information from summaryRprof(). In particular, it looks like a number of calls are made to external C programs, but I'm not able to pin down which C program, from which R function. I am planning to resolve this through a bunch of slicing and dicing of the code, but wondered if I am overlooking some better way to interpret the profiling data.

The highest-consuming function is .Call, which is apparently a generic description for calls to C code; the next leading functions appear to be assignment operations:

$by.self
                             self.time self.pct total.time total.pct
".Call"                        2281.0    54.40     2312.0     55.14
"[.data.frame"                  145.0     3.46      218.5      5.21
"initialize"                    123.5     2.95      217.5      5.19
"$<-.data.frame"                121.5     2.90      121.5      2.90
"as.vector"                     110.5     2.64      416.0      9.92

I decided to focus on the .Call to see how this arises. I looked through the profiling file to find those entries with .Call in the call stack, and the following are the top entries in the call stack (by count of # of appearances):

开发者_如何转开发13640 "eval"
11252 "["
7044 "standardGeneric"
4691 "<Anonymous>"
4658 "tryCatch"
4654 "tryCatchList"
4652 "tryCatchOne"
4648 "doTryCatch"

This list is as clear as mud: I have <Anonymous> and standardGeneric in there.

I believe this is due to calls to functions in the Matrix package, but that's because I'm looking at the code and that package appears to be the only possible source of C code. However, a lot of different functions from Matrix are called in this package, and it seems very difficult to determine which function is consuming this time.

So, my question is pretty basic: is there some way of deciphering and attributing these calls (e.g. .Call, <Anonymous>, etc.) in some other way? The plot of the call graph for this code is rather tricky to render, given the # of functions involved.

The fallback tactics I see are to either (1) comment out bits of code (and hack around to make the code work with this) to see where the time consumption occurs, or to (2) wrap certain operations inside of other functions and see when those functions appear on the call stack. The latter is inelegant, but it seems like it's the best way to add a tag to the call stack. The former is unpleasant because it takes quite some time to run the code, and iteratively uncommenting code and rerunning is unpleasant.


May I suggest you use the profr package. This is another bit of Hadley magic. It's a wrapper around Rprof and gives a visulation of the call stack and timings.

I find profr very easy to use and interpret. For example, here is a profile of a bit of ddply example code and the resulting profr plot:

library(profr)
p <- profr(
    ddply(baseball, .(year), "nrow"),
    0.01
)
plot(p)

Digging into R profiling information

You can immediately see the following:

  • How ddply calls ldply, llply and loop_apply.
  • Inside loop_apply there is a .Call function.

You can confirm this by reading the source code for loop_apply:

> plyr:::loop_apply
function (n, f, env = parent.frame()) 
{
    .Call("loop_apply", as.integer(n), f, env)
}
<environment: namespace:plyr>

Edit. There is something very odd about the ggplot.profr method. I have proposed the following fix to Hadley. (You may wish to try this on your example.)

ggplot.profr <- function (data, ..., minlabel = 0.1, angle = 0){
  if (!require("ggplot2", quiet = TRUE)) 
    stop("Please install ggplot2 to use this plotting method")
  data$range <- diff(range(data$time))
  ggplot(as.data.frame(data), aes(y=level)) + 
      geom_rect(
          #aes(xmin=(level), xmax=factor(level)+1, ymin=start, ymax=end),  
          aes(ymin=level-0.5, ymax=level+0.5, xmin=start, xmax=end),  
          #position = "identity", stat = "identity", width = 1, 
          fill = "grey95", 
          colour = "black", size = 0.5) + 
      geom_text(aes(label = f, x = start + range/60), 
          data = subset(data, time > max(time) * minlabel), size = 4, angle = angle, vjust=0.5, hjust = 0) + 
      scale_x_continuous("time") + 
      scale_y_continuous("level")
}


It seems that the short answer is "No" and the long answer is "Yes, but you're not going to enjoy this." Even answering this question is going to take some time (so stick around, I may be updating it).

There are several basic things to get one's head around when working with profiling in R:

First, there are many different ways to think about profiling. It is quite typical to think in terms of a call stack. At any given instant, this is the sequence of function calls that are active, essentially nested within each other (subroutines, if you will). This is quite useful for understanding the state of evaluations, where functions will return, and lots of other things that are important for seeing things as the computer / interpreter / OS may see them. Rprof does call stack profiling.

Second, a different perspective is that I've got a bunch of code and a particular call is taking a long time: which line in my code caused that call to be made? This is line profiling. R doesn't have line profiling, as far as I can tell. This is in contrast with Python and Matlab, which both have line profilers.

Third, the map from from lines to calls is surjective, but it is not bijective: given a particular call stack we cannot guarantee that we can map it back to the code. In fact, call stack analyses often summarize the calls completely out of context of the whole stack (i.e. cumulative times are reported no matter where that call was on all of the different stacks in which it occurred).

Fourth, even though we have these constraints, we can put on our statistical hats and analyze the call stack data carefully and see what we can make of it. The call stack information is data and we like data, don't we? :)

Just a quick intro to a call stack. Let's just assume that our call stack looked like this:

"C" "B" "A"

This means that function A called B which then calls C (the order is reversed), and the call stack is 3 levels deep. In my code, the call stack gets to as many as 41 levels deep. Since the stacks can be so deep and are presented in reverse order, this is more interpretable by software than by a human. Naturally, we begin cleaning and transforming this data. :)

Now, our data really comes along looking like:

".Call" "subCsp_cols" "[" "standardGeneric" "[" "eval" "eval" "callGeneric"
"[" "standardGeneric" "[" "myFunc2" "myFunc1" "eval" "eval" "doTryCatch"
"tryCatchOne" "tryCatchList" "tryCatch" "FUN" "lapply" "mclapply"
"<Anonymous>" "%dopar%"

Miserable, isn't it? It even has duplicates of things like eval, some guy called <Anonymous> - probably some darn hacker. (Anonymous is legion, by the way. :-))

The first step in transforming this into something useful was to split each line of Rprof() output and reverse the entries (via strsplit and rev). The first 12 entries (last 12 if you look at the raw call stack, rather than the post-rev version) were the same for every line (of which there were about 12000, the sampling interval was 0.5 seconds - so about 100 minutes of profiling), and these can be discarded.

Remember, we're still interested in knowing which line(s) led to .Call, which took so much time. Before we get to that question, we put on our statistical caps: the profiling reports, e.g. from summaryRprof, profr, ggplot, etc., only reflect the cumulative time spent for a given call or for calls beneath a given call. What does this cumulative information not tell us? Bingo: whether that call was made many times, or a few, and whether the time spent was constant over all invocations of that call or whether there are some outliers. A particular function might be executed 100 times or 100K times, but all of the cost may come from a single invocation (it shouldn't, but we don't know until we look at the data).

This only begins to describe the fun. The A->B->C example doesn't reflect the way things may really appear, such as A->B->C->D->B->E. Now, "B" may be counted a couple of times. What's more, suppose that a lot of time is spent in the C level, but we never sample at precisely that level, only seeing its child calls in the stack. We may see a sizable time for "total.time", but not for "self.time". If there are lots of different child calls under C, we may lose sight of what to optimize - should we take out C altogether or tweak the children, B, D, and E?

Just to account for the time spent, I took the sequences and ran them through digest, storing counts for the digested values, via hash. I also split up the sequences, storing {(A),(A,B), (A,B,C), etc.}. This doesn't seem so interesting, but removing singletons from the counts helps a lot in cleaning up the data. We can also store the time spent in each call by using rle(). This is useful for analyzing the distribution of time spent for a given call.

Still we're nowhere closer to finding the actual time spent per line of code. We'll never get lines of code from the call stack. A simpler way to do this is to store a list of times throughout the code, which stores the output of proc.time(), for a given invocation. Taking the difference of these times reveals which lines or sections of code are taking a long time. (Hint: that's what we're really looking for, not the actual calls.)

However, we have this call stack and we might as well do something useful. Going up the stack is somewhat interesting, but if we rewind the profile information to a little earlier, we can find which calls tend to precede the longer running calls. This allows us to look for landmarks in the call stack - positions where we can tie a call to a particular line of code. This makes it a bit easier to map more calls back to code, if all we have is the call stack, rather than instrumented code. (As I keep mentioning: out of context, there isn't a 1:1 mapping, but at a fine enough granularity, especially in repeatedly hit calls that are distinctive, you may be able to find landmarks in the calls that map to code.)

Altogether, I was able to find which calls were taking a lot of time, whether that was based on 1 long interval or many small ones, what the distribution of time spent was like, and, with some effort, I was able to map the most important & time consuming calls back to the code and discover which parts of the code could benefit the most from rewriting or from a change in algorithms.

Statistical analyses of the call stack is loads of fun, but investigating a particular call based on cumulative time consumption is not a very good way to go. The cumulative time consumed by a call is informative on a relative basis, but it doesn't enlighten us as whether one or many calls consumed this time, nor the depth of the call in the stack, nor the section of code responsible for the invocations. The first two things can be addressed via a bit more R code, while the latter is best pursued through instrumented code.

As R doesn't yet have line profilers like Python and Matlab, the simplest way to handle this is to just instrument one's code.


A line in a profile file might look like

"strsplit" ".parseTabix" ".readVcf" "readVcf" "standardGeneric" "readVcf" "system.time" 

which says, reading right to left, that the outermost function was system.time, which invoked readVcf, which was an S4 generic that dispatched to a readVcf method, invoking a function .readVcf, which invoked .parseTabix, which finally called strsplit.

Here we read in the profile file, sort the lines, tally them up (using rle -- run length encoding), then select the six most common paths in the profile file

r = rle(sort(readLines("readVcf.Rprof"))
o = order(r$lengths, decreasing=TRUE)
r$values[head(o)]

This

r$lengths[head(o)]

tells us how many times each of those call stacks were sampled.

There are some common patterns that can help interpret this. Here's an S4 generic being dispatched to its method

"readVcf" "standardGeneric" "readVcf"

an lapply iterating over its function

"FUN" "lapply"

and a tryCatch surrounding a .Call

".Call" "doTryCatch" "tryCatchOne" "tryCatchList" "tryCatch"

Usually one tries to profile relatively small chunks of code, rather than a whole script, with the small chunk identified by, e.g., stepping through the code interactively or making some educated guesses about what parts are likely to be slow. The fact that .Call is the most commonly sampled function is not encouraging -- it suggests most of the time is already being spent in C. Likely your best bet will involve coming up with a better overall algorithm, rather than say a brute force approach.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜