Algorithm to select groups of similar items in 2d array
There is a 2d array of items (in my case they are called Intersections
).
A certain item is given as a start. The task is to find all items directly or indirectly connected to this item that satisfy a certain function.
So the basic algorithm is like this:
Add the start to the result list. Repeat until no modification: Add each item in the array that satisfies the function and touches any item in the result list to the result list.
My current implementation looks like this:
private IList<Intersection> SelectGroup (
Intersection start,
Func<Intersection, Intersection, bool> select)
{
List<Intersection> result = new List<Intersection> ();
Queue<Intersection> source = new Queue<Intersection> ();
source.Enqueue (start);
while (source.Any ()) {
var s = source.Dequeue ();
result.Add (s);
foreach (var neighbour in Neighbours (s)) {
if (select (start, neighbour)
&& !result.Contains (neighbour)
&& !source.Contains (neighbour)) {
source.Enqueue (neighbour);
}
}
}
Debug.Assert (result.Distinct ().Count () == result.Count ());
Debug.Assert (result.All (x => select (x, result.First ())));
return result;
}
private List<Intersection> Neighbours (IIntersection intersection)
{
int x = intersection.X;
int y = intersection.Y;
List<Intersection> list = new List<Intersection> ();
if (x > 1) {
list.Add (GetIntersection (x - 1, y));
}
if (y > 1) {
list.Add (GetIntersection (x, y - 1));
}
if (x < Size) {
list.Add (GetIntersection (x + 1, y));
}
if (y < Size) {
list.Add (GetIntersection (x, y + 1));
}
return list;
}
(The select
function takes a start item and returns true iff the second item satisfies.)
This does its job and turned out to be reasonable fast for the usual array sizes (about 20*20). However, I'm interested in further improvements. Any ideas?
Example (X
satisfies in relation to other X
s, .
does never satisfy):
.... XX.. .XX. X...
In thi开发者_C百科s case, there are 2 groups: a central group of 4 items and a group of a single item in the lower left. Selecting the group (for instance by starting item [3, 3]) returns the former, while the latter can be selected using the starting item and sole return value [1, 4].
Example 2:
.A.. ..BB A.AA
This time there are 4 groups. The 3 A groups are not connected, so they are returned as separate groups. The bigger A and B groups are connected, but A does not related to B so they are returned as separate groups.
Step 1: trivial change for huge benefit
Simple, immediate improvement: Both your result.Contains
and source.Contains
memberships are on list types, so they'll be O(n) in the size of those lists, not very efficient. Since you really don't care about any particular ordering, I'd change both of those into HashSet
for constant-time lookups.
Note that your current implementation would be O(n^2) in the worst case, which happens when the entire array is valid (by the time you're inserting the last few elements, you'd be checking each one against the entire rest of the grid).
Step 2: further optimization
Better structural change: Keep a boolean visited
array of the same size as your Intersection array, and every time you look at an Intersection
, mark it as visited. This way, you don't have to check if something's in result
or source
every time, and better yet, you don't have to re-evaluate your select
predicate. Otherwise, given something like this:
XXX
X.X
XXX
you'd be evaluating select
on the center dot four times, which could be bad if it's expensive. This way, your
if (select (start, neighbour)
&& !result.Contains (neighbour)
&& !source.Contains (neighbour))
condition turns into: if (!visited(neighbour) && select(start, neighbour)
, which will only evaluate select
at most once on any given intersection.
Also, if you do this, you don't even need to make result
and contains
hashes any more, since you won't be doing containment-checks on them.
I am not really good at C# but reading your algorithm description, I can have a few suggestions for you:
1/ Do you know Dynamic programming or Bellman Algorithm? The idea is that for each iteration, you only get the newest result and continue searching. Dynamic Programming
For example :
1 st iteration : x1, x2
2 nd iteration : x3, x4
the third iteration, you only search and compare back with x3 and x4. It will save you a lot of CPU calculation and iteration
2/ Use HashSet to have a better searching and getting data for contains
Based on tzaman's answer, I used a HashSet. Since the queue does not support hashing, I found an improvement that only requires lookup in the result list instead of both result and source list:
HashSet<Intersection> result = new HashSet<Intersection> ();
result.Add (start); // need to add first item in advance
Queue<Intersection> source = new Queue<Intersection> ();
source.Enqueue (start);
while (source.Any ()) {
var s = source.Dequeue ();
foreach (var neighbour in Neighbours (s)) {
if (select (start, neighbour) && !result.Contains (neighbour)) {
result.Add (neighbour); // add immediately to hashset
source.Enqueue (neighbour);
}
}
}
精彩评论