Finding the least-used permutation
I need to distribute a set of data evenly over time based on historical data such that each digit appears an equal (or close to equal) number of times in each position over time. The problem is, given a list of orderings used in the past, that look like this (but could have any number of elements):
1,2,5,3,4
4,1,5,2,3
1,3,5,2,4
4,1,2,3,5
2,4,1,3,5
5,1,4,3,2
1,5,3,2,4
5,1,3,2,4
3,开发者_JS百科2,5,4,1
4,3,1,5,2
how can I find an ordering of the values that is the least used and will lead to a "more balanced" set of orderings. The obvious answer is I could group by and count them and pick the least used one, but the problem is the least used permutation may not have ever been used, for example here, the ordering "1,2,3,4,5" is a candidate for least used because it doesn't appear at all.
The simple answer seems to be to identify which position "1" appears in the least frequent and set that position to "1" and so on for each digit. I suspect that works, but I feel like there's a more elegant solution that I haven't considered potentially with cross joins so that all possible combinations are included.
any ideas?
What you have here is a histogram leveling problem.
Consider the problem from this perspective: you have a set of N histograms that represent the frequency of occurrence of the value N values over a discrete range {1..N}. What you want to do is to add a new set of values to your population of data that shifts the all histograms closer to being level. Given the nature of your problem, we know that each value will, overall, appear the same number of times as every other value.
One way to do so, is to find which values N has the lowest frequency of occurence in any position - and assign it that position. Next, in the remaining histograms, find the next value with the lowest frequency of occurence in any position, and assign that value to that position. Continue repeating this process until all values have been assigned a unique position. This gives you your next set of values. You can now iteratively repeat this operation to continue generating new value sets that will attempt to re-balance the distribution of values with each iteration.
If you maintain the histograms as you distribute values, this becomes a relatively efficient operation (you don't have to constantly re-scan the data set).
Keep in mind, however, that for any sufficiently small population of values, you will always be "out of balance" to some degree. There's no way around this.
I presume that you have a way to generate a random permutation (e.g. Most efficient way to randomly "sort" (Shuffle) a list of integers in C#). Given that, one suggestion to generate a single new ordering is as follows:
1) Generate two random permuations
2) Keep whichever one of them would even out the imbalance the most.
One measure of balance would be to think of the list of all of the counts of digit frequencies at each position as a vector, which, in the case of perfect balance, would have each element the same. The imbalance would then be the length of the vector you get by subtracting off that perfect vector. By choosing between two random permutations you will pick a permutation from a distribution whose mean vector points in a direction opposite to the current imbalance, so you should tend to correct it while still producing a random-ish sequence of permutations.
If the total number of combinations is small enough there's an approach I used on a similar problem long ago:
Maintain a pool of choices that is periodically replenished.
In your example you have 120 possible permutations. Make an array of 120 elements, assign each an initial value of say 5. When you need a random value you pick from this pool, the number in the bin being the weight given to that bin. (At the start the bins sum to 600. Pick a random from 1 to 600, subtract bins from it until <= 0. The bin you just subtracted is your result.) When an entry is picked decrement that bin by one. Once you've made 120 draws from the pile add 1 to every bin.
Obviously this becomes impractical if the total number of possibilities is too high.
精彩评论