Dataset array indexing is very slow with Statistics Toolbox
Why is indexing into a dataset array so slow? A peak into the dataset.subsref function shows that all the columns of the dataset are stored in a cell array. However, cell indexing is much, much faster than dataset indexing, which is just indexing into a cell array under the hood. My guess is that this has to do with some overhead with MATLAB OOP. Any ideas on how to speed this up?
%% Using R2011a, PCWIN64
feature accel off; % turn off JIT
dat = (1:1e6)';
dat2 = repmat({'abc'}, 1e6, 1);
celldat = {dat dat2};
ds = dataset(dat, dat2);
N = 1e2;
tic;
for j 开发者_运维百科= 1:N
tmp = celldat{2};
end
toc;
tic;
for j = 1:N
tmp2 = ds.dat2; % 2.778sec spent on line 262 of dataset.subsref
end
toc;
feature accel on; % turn JIT back on
Elapsed time is 0.000165 seconds.
Elapsed time is 2.778995 seconds.
EDIT: I've updated the example to be more like the problem I'm seeing. A huge amount of time is spent on line 262 of dataset.subsref - "b = a.data{varIndex};". It's very strange to me since it is a simple cell dereference. I'm wondering if there is a OOP trick that will allow me to index into "a.data" without the strange overhead.
EDIT2: As per Andrew's suggestion, I've submitted this as a bug to MatWorks. Will update if I hear anything from them.
EDIT3: Matlab responded and said they are aware of the problem now and will fix it in a future release. They noted that the problem is specific to cell arrays, and to try to avoid them if possible.
Yes, you are most likely seeing the overhead of Matlab OOP method calls. They are expensive compared to cell indexing, or method calls in some other languages. Your .513872 seconds / 1e4 ~= 51 microseconds per call, which is the approximate cost of a few MCOS method calls; they're ~5-15 microsececonds each on machines I've seen. So that looks like method overhead of the subsref() call itself and other methods and property accesses it's calling in turn.
For some details and discussion, see: Is MATLAB OOP slow or am I doing something wrong?
I don't know of a way to make this faster, aside from structuring your code to minimize calls to "ds.dat" or other methods. If possible, when working with the data set, call "ds.dat" once, keep it in a local variable and work with it there, and then push it back in to the ds object.
Caveat: I don't know what "feature accel" does or how it could affect these timings.
Edit: I threw it in the profiler like Richie suggested. On my R2009b, looks like about half the time is method call overhead, and the rest in find(), strcmp(), and other operations inside subsref; subsref doesn't call any other methods in turn.
Edit 2: The revised example is showing much higher timings. Method call overhead doesn't account for all that.
精彩评论