Efficient Algorithm for comparing only the lists that have been updated
Even describing this problem is hard, But I'll give it a go. I've been struggling with this for a few days and decided to ask here.
OK, so I'm trying to model "concepts", or "things" as I call them. Just concepts in general. It's to do with processing logic.
So, each "thing" is defined by it's relationship to other things. I store this as a set of 5 bits per relationship. A 'thing' could be like this:
class Thing {
char* Name;
HashTable<Thing*, int> Relationships;
}
So, I model "Things" like that. 5 bits per relationship. Each bit stands for one possible relationship. Like this: 1 equals, 2 inside, 3 outside, 4 contains, 5 overlaps. Having all 5 bits on means we totally don't know what the relationship is. Having 2 bits means we think the relationship could be one of two possibilities. Relationships start off as "unknown" (all 5 bits are true) and get more specific as time goes on.
So this is how I model increasing knowledge over time. Things start off in a fully unknown state, and can pass through partially known states, and can reach fully known states.
A little more background:
I try to add extra definition to my modelling of "concepts" (things), by using extra classes. Like this:
class ArrayDefinition {
Array<Thing> Items;
}
And my Thing class becomes like this:
class Thing {
char* Name;
HashTable<Thing*, int> Relationships;
ArrayDefinition* ArrayDef;
}
This "ArrayDef" doesn't HAVE to be used. It's just there to be used, if needed. Some "things" don't have arrays, some do. But all "things" have relationships.
I can process this "ArrayDefinition" to figure out the relationship between two things! For example, if X = [ A B C D E ]
and Y = [ C D E ]
, my code can process these two arrays, and figure out that "Y inside X
".
OK, so that's enough background. I've explained the core problem, avoiding my real code which has all sorts of distracting details.
Here's the problem:
The problem is making this not run ridiculously slow.
Imagine, there are 2000 "things". Let's say 1000 of these have array definitions. Now, that makes 500,000(ish) possible "array-pairs" that we need to compare against each other.
I hope I'm starting to finally make sense now. How to avoid processing them all against each other? I've already realised that if two "things" have a fully known relationship, there is no point in comparing their "array definitions", because that's just used to figure out extra detail on the relationship, but we have the exact relationship, so there's no point.
So... let's say only 500 of these "things with arrays" have unknown or partially known relationships. That still makes 250000(ish) possible "array-pairs" to compare!
Now... to me, the most obvious place to start, is realising that unless a relationship used to define two arrays changes (Becomes more specific), then there is no point processing this "array-pair".
For example, let's say I have these two arrays:
X = [ A B C D E ]
Y = [ Q W R T ]
now, if I say that T=R
, that's very nice. But this does not affect the relationship between X and Y. So just because T's relationship to R has become known as "equal", whereas before it might have been fully unknown, this does not mean I need to compare X and Y again.
On the other hand, if I say "T outside E
", this is a relationship between things used to define the two arrays. So saying that "T outside E
" means I need to process X's array against Y's array.
I really don't want to have to compare 500,000 "array-pairs" just to process logic on 1000 arrays when almost nothing has changed between them!
So... my first attempt at simplifying this, was to keep a list of all the arrays that a thing is used to define.
Let's say I have 3 arrays:
A = [ X Y Z ]
B = [ X X X X ]
C = [ X Z X F ]
Well, X is used in 3 arrays. So, X could keep a list of all the arrays it is used inside of.
So, if I said "X inside Y"
, this could bring up a list of all the arrays that Y is used to define, and all the arrays X is used to define. Let's say X is used in 3 arrays, and Y is used in 1 array. From this, we can figure out that there are 2 "array-pairs" we need to compare (A vs B, and A vs C).
We can futher trim this list by checking if any of the array pairs already have fully known relationships.
The problem I have with this, is it STILL seems excessive.
Let's say X is a really common "thing". It's used in 10,000 arrays. And Y is a really common thing, used in 10,000 arrays.
I still end up with 100,000,000 array-pairs to compare. OK, so let's say I don't need to compare them all, actually, only 50 of them were partially known or totally unknown.
But... I still had to run over a list of 100,000,000 array-pairs, to figure out which of these was partially known. So it's still inefficient.
I'm really wondering if there is no efficient method of doing this. And开发者_如何学运维 if really all I can do is make a few effective "heuristicish" strategies. I haven't had too much luck coming up with good strategies yet.
I realise that this problem is highly specialised. And I realise that reading this long post may take too long. I'm just not sure how to shrink the post length or describe this in terms of more common problems.
If it helps... My best attempt to express this in common terms, is "how to compare only the lists that have been updated".
Anyone got any ideas? That would be great. If not... perhaps just me writing this out may help my thinking process.
The thing is, I just can't help but feel that there is some algorithm or approach that can make this problem run fast and efficient. I just don't know what that algorithm is.
Thanks all
In general, you won't be able to come up with a structure that is as-fast-as-possible for every operation. There are tradeoffs to be made.
This problem looks very similar to that of executing queries on a relational database - SELECT * WHERE ...
. You might consider looking there for inspiration.
I'm not sure I understand what you are doing completely (the purpose of ArrayDefinition is particularly hazy), but I think you should consider separating the modeling of the objects from their relationships. In other words, create a separate mapping from object to object for each relationship. If objects are represented by their integer index, you need only find an efficient way to represent integer to integer mappings.
I had a sleep and when I woke up, I had a new idea. It might work...
If each "thing" keeps a list of all the "array definitions" it is used to define.
class Thing {
char* Name;
HashTable<Thing*, int> Relationships;
ArrayDefinition* ArrayDef;
Set<ArrayDefinition*> UsedInTheseDefs;
}
class ArrayDefinition {
Array<Thing> Items;
Set<int> RelationModifiedTag;
}
And I keep a global list of all the "comparable array pairs".
And I also construct a global list, of all the "arrays that can be compared" (not in pairs, just one by one).
Then, everytime a relationship is changed, I can go over the list of "arrays definitions" I'm inside of, and add a little "tag" to it :)
So I can do something like this:
static CurrRel = 0;
CurrRel++; // the actual number doesn't matter, it's just used for matching
foreach(Arr in this->UsedInTheseDefs) {
Arr->RelationModifiedTag.Add( CurrRel );
}
foreach(Arr in other->UsedInTheseDefs) {
Arr->RelationModifiedTag.Add( CurrRel );
}
I altered both sides of the relationship. So if I did this: "A outside B"
, then I've added a "modifiedtag" to all the arrays A is used to define, and all the arrays B is used to define.
So, then I loop over my list of "comparable array-pairs". Each pair of course is two arrays, each one will have a "RelationModifiedTag" set.
So I check both RelationModifiedTag sets against each other, to see if they have any matching numbers. If they DO, then this means this array pair has a relationship that's just been altered! So... I can do my array comparison then.
It should work :)
It does require a bit of overhead, but the main thing is I think it scales well to larger data sets. For smaller datasets say only 10 arrays, a simpler more brute force approach could be used, just compare all array-pairs that don't have fully known relationship, and don't bother to keep track of what relationships have been altered.
There's further optimisations possible. But I won't go into those here, because it just distracts from the main algorithm, and they are kind of obvious. For example if I have two sets to compare, I should loop over the smaller set and check against the bigger set.
Apologies for having to read all this long text. And thanks for all the attempts to help.
Well, first of all, some vocabulary.
Design Pattern: Observer
It's a design pattern that allow objects to register themselves into others, and ask for notifications on events.
For example, each ThingWithArray
could register itself in the Thing
they managed, so that if the Thing
is updated the ThingWithArray
will get notified back.
Now, there is usually an unsubscribe
method, meaning that as soon as the ThingWithArray
no longer depends on some Thing
because all the relations that use them have been used, then they could unsubscribe
themselves, so as not to be notified of the changes any longer.
This way you only notify those which actually care about the update.
There is one point to take into account though: if you have recursive relationships, it might get hairy, and you'll need to come up with a way to avoid this.
Also, follow ergosys advise, and model relationships outside of the objects. Having 1 BIG class is usually the start of troubles... if you have difficulty cutting it into logical parts, it's that the problem is not clear for you, and you should ask help on how to model it... Once you've got a clear model, things usually fall into place a bit more easily.
From your own answer I deduce that the unknown relations are greatly over-numbered by known relationships. You could then keep track of the unknown relationships of each thing in a separate hash table/set. As a further optimization, instead of keeping track of all definitions that a thing is used in, keep track of which of these definitions have unknown relationships - which relationships can be affected. Then given a newly defined relationship between X and Y, take the affected definitions of one of them, and find the intersection of each of the unknown relations with the affected definitions of the other one. To keep the acceleration datastructure up to date, when a relationships becomes known, remove it from the unknown relationships and if no unknown relationships remain go over the array def and remove the thing from can-affect sets.
The datastructure would then look something like this:
class Thing {
char* Name;
HashTable<Thing*, int> Relationships;
Set<Thing*> UnknownRelationships;
ArrayDefinition* ArrayDef;
Set<Thing*> CanAffect; // Thing where this in ArrayDefinition and UnknownRelationships not empty
}
class ArrayDefinition {
Array<Thing> Items;
}
精彩评论