Dataset array indexing is very slow with Statistics Toolbox

2023-03-20 04:00 问答作者：

Why is indexing into a dataset array so slow? A peak into the dataset.subsref function shows that all the columns of the dataset are stored in a cell array. However, cell indexing is much, much faster than dataset indexing, which is just indexing into a cell array under the hood. My guess is that this has to do with some overhead with MATLAB OOP. Any ideas on how to speed this up?

%% Using R2011a, PCWIN64
feature accel off;  % turn off JIT

dat = (1:1e6)';
dat2 = repmat({'abc'}, 1e6, 1);
celldat = {dat dat2};
ds = dataset(dat, dat2);
N = 1e2;

tic;
for j 开发者_运维百科= 1:N
    tmp = celldat{2};
end
toc;

tic;
for j = 1:N
    tmp2 = ds.dat2; % 2.778sec spent on line 262 of dataset.subsref
end
toc;

feature accel on;  % turn JIT back on

Elapsed time is 0.000165 seconds.
Elapsed time is 2.778995 seconds.

EDIT: I've updated the example to be more like the problem I'm seeing. A huge amount of time is spent on line 262 of dataset.subsref - "b = a.data{varIndex};". It's very strange to me since it is a simple cell dereference. I'm wondering if there is a OOP trick that will allow me to index into "a.data" without the strange overhead.

EDIT2: As per Andrew's suggestion, I've submitted this as a bug to MatWorks. Will update if I hear anything from them.

EDIT3: Matlab responded and said they are aware of the problem now and will fix it in a future release. They noted that the problem is specific to cell arrays, and to try to avoid them if possible.

Yes, you are most likely seeing the overhead of Matlab OOP method calls. They are expensive compared to cell indexing, or method calls in some other languages. Your .513872 seconds / 1e4 ~= 51 microseconds per call, which is the approximate cost of a few MCOS method calls; they're ~5-15 microsececonds each on machines I've seen. So that looks like method overhead of the subsref() call itself and other methods and property accesses it's calling in turn.

For some details and discussion, see: Is MATLAB OOP slow or am I doing something wrong?

I don't know of a way to make this faster, aside from structuring your code to minimize calls to "ds.dat" or other methods. If possible, when working with the data set, call "ds.dat" once, keep it in a local variable and work with it there, and then push it back in to the ds object.

Caveat: I don't know what "feature accel" does or how it could affect these timings.

Edit: I threw it in the profiler like Richie suggested. On my R2009b, looks like about half the time is method call overhead, and the rest in find(), strcmp(), and other operations inside subsref; subsref doesn't call any other methods in turn.

Edit 2: The revised example is showing much higher timings. Method call overhead doesn't account for all that.

Dataset array indexing is very slow with Statistics Toolbox

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？