Finding islands of zeros in a sequence

2023-01-07 23:42 问答作者：

Imagine you have a very long sequence. What is the most efficient way of finding the intervals where the sequence is all zeros (or more precisely the sequence drops to near-zero values abs(X)<eps):

For simplicity, lets assume the following sequence:

sig = [1 1 0 0 0 0 1 1 1 1 1 0 1 0 0 0 1 1 1 1 1 1 1 1 0 0 1 1 1 0];

I'm trying to get the following information:

startIndex   EndIndex    Duration
3            6           4
12           12          1
14           16          3
25           26          2
30           30          1

then using this information, we find the intervals with duration >= to some specified value (say 3), and returning the indices of the values in all these intervals combined:

indices = [3 4 5 6 14 15 16];

That last part is related to a previous question:

MATLAB: vectorized array creation from a list of start/end indices

This is what I have so far:

sig = [1 1 0 0 0开发者_如何学编程 0 1 1 1 1 1 0 1 0 0 0 1 1 1 1 1 1 1 1 0 0 1 1 1 0];
len = length(sig);
thresh = 3;

%# align the signal with itself successively shifted by one
%# v will thus contain 1 in the starting locations of the zero interval
v = true(1,len-thresh+1);
for i=1:thresh
    v = v & ( sig(i:len-thresh+i) == 0 );
end

%# extend the 1's till the end of the intervals
for i=1:thresh-1
    v(find(v)+1) = true;
end

%# get the final indices
v = find(v);

I'm looking to vectorize/optimize the code, but I'm open to other solutions. I have to stress that space and time efficiencies are very important, since I'm processing a large number of long bio-signals.

These are the steps I would take to solve your problem in a vectorized way, starting with a given vector sig:

First, threshold the vector to get a vector tsig of zeros and ones (zeroes where the absolute value of the signal drops close enough to zero, ones elsewhere):
```
tsig = (abs(sig) >= eps);  %# Using eps as the threshold
```
Next, find the starting indices, ending indices, and duration of each string of zeroes using the functions DIFF and FIND:
```
dsig = diff([1 tsig 1]);
startIndex = find(dsig < 0);
endIndex = find(dsig > 0)-1;
duration = endIndex-startIndex+1;
```
Then, find the strings of zeroes with a duration greater than or equal to some value (such as 3, from your example):
```
stringIndex = (duration >= 3);
startIndex = startIndex(stringIndex);
endIndex = endIndex(stringIndex);
```

Finally, use the method from my answer to the linked question to generate your final set of indices:

indices = zeros(1,max(endIndex)+1);
indices(startIndex) = 1;
indices(endIndex+1) = indices(endIndex+1)-1;
indices = find(cumsum(indices));

You can solve this as a string search task, by finding strings of zeros of length thresh (STRFIND function is very fast)

startIndex = strfind(sig, zeros(1,thresh));

Note that longer substrings will get marked in multiple locations but will eventually be joined once we add in-between locations from intervals start at startIndex to end at start+thresh-1.

indices = unique( bsxfun(@plus, startIndex', 0:thresh-1) )';

Note that you can always swap this last step with the CUMSUM/FIND solution by @gnovice from the linked question.

function indice=sigvec(sig,thresh)
    %extend sig head and tail to avoid 0 head and 0 tail

    exsig=[1,sig,1];
    %convolution sig with extend sig
    cvexsig=conv(exsig,ones(1,thresh));
    tempsig=double(cvexsig==0);

    indice=find(conv(tempsig,ones(1,thresh)))-thresh;

the above answer by genovice can be modified to find the indices of non-zero elements in a vector as:

    tsig = (abs(sig) >= eps);
    dsig = diff([0 tsig 0]);
    startIndex = find(dsig > 0);
    endIndex = find(dsig < 0)-1;
    duration = endIndex-startIndex+1;

I think the most MATLAB/"vectorized" way of doing it is by computing a convolution of your signal with a filter like [-1 1]. You should look at the documentation of the function conv. Then on the output of conv use find to get the relevant indexes.

As gnovice showed, we'll do a threshold test to make "near zero" really zero:

logcl = abs(sig(:)) >= zero_tolerance;

Then find regions where the cumulative sum isn't increasing:

cs = cumsum(logcl);
islands = cs(1+thresh:end) == cs(1:end-thresh);

Remembering gnovice's great method for filling in ranges of indexes

v = zeros(1,max(endInd)+1);   %# An array of zeroes
v(startInd) = 1;              %# Place 1 at the starts of the intervals
v(endInd+1) = v(endInd+1)-1;  %# Add -1 one index after the ends of the intervals
indices = find(cumsum(v));  %# Perform a cumulative sum and find the nonzero entries

We note that our islands vector already has ones in the startInd locations, and for our purposes endInd always comes thresh spots later (longer runs have runs of ones in islands)

endcap = zeros(thresh,1);
indices = find(cumsum([islands ; endcap] - [endcap ; islands]))

Test

sig = [1 1 0 0 0 0 1 1 1 1 1 0 1 0 0 0 1 1 1 1 1 1 1 1 0 0 1 1 1 0];
logcl = abs(sig(:)) >= .1;
cs = cumsum(logcl);
islands = cs(1+thresh:end) == cs(1:end-thresh);
endcap = zeros(thresh,1);
indices = find(cumsum([islands ; endcap] - [endcap ; islands]))

继续阅读：vectorization

Finding islands of zeros in a sequence

Test

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Test

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？