开发者

Find sequence in IEnumerable<T> using Linq

What is the most efficient way to find a seq开发者_开发百科uence within a IEnumerable<T> using LINQ

I want to be able to create an extension method which allows the following call:

int startIndex = largeSequence.FindSequence(subSequence)

The match must be adjacent and in order.


Here's an implementation of an algorithm that finds a subsequence in a sequence. I called the method IndexOfSequence, because it makes the intent more explicit and is similar to the existing IndexOf method:

public static class ExtensionMethods
{
    public static int IndexOfSequence<T>(this IEnumerable<T> source, IEnumerable<T> sequence)
    {
        return source.IndexOfSequence(sequence, EqualityComparer<T>.Default);
    }

    public static int IndexOfSequence<T>(this IEnumerable<T> source, IEnumerable<T> sequence, IEqualityComparer<T> comparer)
    {
        var seq = sequence.ToArray();

        int p = 0; // current position in source sequence
        int i = 0; // current position in searched sequence
        var prospects = new List<int>(); // list of prospective matches
        foreach (var item in source)
        {
            // Remove bad prospective matches
            prospects.RemoveAll(k => !comparer.Equals(item, seq[p - k]));

            // Is it the start of a prospective match ?
            if (comparer.Equals(item, seq[0]))
            {
                prospects.Add(p);
            }

            // Does current character continues partial match ?
            if (comparer.Equals(item, seq[i]))
            {
                i++;
                // Do we have a complete match ?
                if (i == seq.Length)
                {
                    // Bingo !
                    return p - seq.Length + 1;
                }
            }
            else // Mismatch
            {
                // Do we have prospective matches to fall back to ?
                if (prospects.Count > 0)
                {
                    // Yes, use the first one
                    int k = prospects[0];
                    i = p - k + 1;
                }
                else
                {
                    // No, start from beginning of searched sequence
                    i = 0;
                }
            }
            p++;
        }
        // No match
        return -1;
    }
}

I didn't fully test it, so it might still contain bugs. I just did a few tests on well-known corner cases to make sure I wasn't falling into obvious traps. Seems to work fine so far...

I think the complexity is close to O(n), but I'm not an expert of Big O notation so I could be wrong... at least it only enumerates the source sequence once, whithout ever going back, so it should be reasonably efficient.


The code you say you want to be able to use isn't LINQ, so I don't see why it need be implemented with LINQ.

This is essentially the same problem as substring searching (indeed, an enumeration where order is significant is a generalisation of "string").

Since computer science has considered this problem frequently for a long time, so you get to stand on the shoulders of giants.

Some reasonable starting points are:

http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm

http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm

http://en.wikipedia.org/wiki/Rabin-karp

Even just the pseudocode in the wikipedia articles is enough to port to C# quite easily. Look at the descriptions of performance in different cases and decide which cases are most likely to be encountered by your code.


I understand this is an old question, but I needed this exact method and I wrote it up like so:

public static int ContainsSubsequence<T>(this IEnumerable<T> elements, IEnumerable<T> subSequence) where T: IEquatable<T>
{
    return ContainsSubsequence(elements, 0, subSequence);
}

private static int ContainsSubsequence<T>(IEnumerable<T> elements, int index, IEnumerable<T> subSequence) where T: IEquatable<T>
{
    // Do we have any elements left?
    bool elementsLeft = elements.Any();

    // Do we have any of the sub-sequence left?
    bool sequenceLeft = subSequence.Any();

    // No elements but sub-sequence not fully matched
    if (!elementsLeft && sequenceLeft)
        return -1; // Nope, didn't match

    // No elements of sub-sequence, which means even if there are
    // more elements, we matched the sub-sequence fully
    if (!sequenceLeft)
        return index - subSequence.Count(); // Matched!

    // If we didn't reach a terminal condition,
    // check the first element of the sub-sequence against the first element
    if (subSequence.First().Equals(e.First()))
        // Yes, it matched - move onto the next. Consume (skip) one element in each
        return ContainsSubsequence(elements.Skip(1), index + 1 subSequence.Skip(1));
    else
        // No, it didn't match. Try the next element, without consuming an element
        // from the sub-sequence
        return ContainsSubsequence(elements.Skip(1), index + 1, subSequence);
}

Updated to not just return if the sub-sequence matched, but where it started in the original sequence.

This is an extension method on IEnumerable, fully lazy, terminates early and is far more linq-ified than the currently up-voted answer. Bewarned, however (as @wai-ha-lee points out) it is recursive and creates a lot of enumerators. Use it where applicable (performance/memory). This was fine for my needs, but YMMV.


You can use this library called Sequences to do that (disclaimer: I'm the author).

It has a IndexOfSlice method that does exactly what you need - it's an implementation of the Knuth-Morris-Pratt algorithm.

int startIndex = largeSequence.AsSequence().IndexOfSlice(subSequence);


UPDATE: Given the clarification of the question my response below isn't as applicable. Leaving it for historical purposes.

You probably want to use mySequence.Where(). Then the key is to optimize the predicate to work well in your environment. This can vary quite a bit depending on your requirements and typical usage patterns.

It is quite possible that what works well for small collections doesn't scale well for much larger collections depending on what type T is.

Of course, if the 90% use is for small collections then optimizing for the outlier large collection seems a bit YAGNI.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜