开发者

Sorting and Balancing Across Multiple Columns

Problem

I have a Hash of data that looks something like this.

{ "GROUP_A" => [22, 440],
"GROUP_B" => [14, 70],
"GROUP_C" => [60, 620],
"GROUP_D" => [174, 40],
"GROUP_E" => [4, 12]
# ...few hundred more
}

GROUP_A has 22 accounts and they are using 440GB of data...and so on. There are a couple hundred of these groups. Some have a lot of accounts but use very little storage and some have only a few users and use A LOT of storage, some are just average.

I have X number of buckets (servers) that I want to put these groups of accounts into, and I want there to be approximately the same number of accounts per bucket and have each bucket also contain approximately the same amount of data. Number of groups is not important, so if a bucket had 1 group of 1000 accounts using 500GB of data and the next bucket had 10 groups of 97 accounts (970 total) using 450GB of data...I'd call it good.

So far I've not come up with an algorithm that will do this. In my mind I'm thinking of something like this perhaps?

PASS 1
  Bucket 1:  Group with largest data, 60 users.
  Bucket 2:  Next largest data group, 37 users.
  Bucket 3:  Next largest data group, 72 users.
  Bucket 4:  etc....

PASS 2
  Bucket 1:  Add a group with small amount of data, but more users than average.
  # There's probably a ratio I can calculate to figure this out...divide users/datavmaybe?
  Bucket 2:  Find a "small data" group where sum of users in Bucket 1 ~= sum of users in Bucket 2
  # But then there's no guarantee that the data usages will be close enough
  Bucket 3:  etc...

PASS 3
  Bucket 1:  Now what?  Back to next largest data group?  

I still think there's a better way to figure this out but it's not coming to me. If anyone has any thoughts I'm open to suggestions.

Matt

Solution 1.1 - Brute Force Update

Well....here's an update to the first attempt. This is still not a "knapsack-problem" solution. Just brute forcing the data so the accounts balance across buckets. This time I added some logic so that if a bucket has a higher full percentage of accounts vs. data...it will find the largest group (by data) that fits best based on number of accounts. I get a lot better distribution of data now vs. my first attempt (see the edit history if you want to look at the first attempt).

Right now I load each bucket in sequence, filling bucket one, then bucket two, etc... I think if I was to modify the code so that I filled them simultaneously (or nearly so) I'd get a better data balance.

e.g. 1st department into bucket 1, 2nd department into bucket 2, etc...until all buckets have one department... Then start back with bucket 1 again.

dept_arr_sorted_by_acct =  dept_hsh.sort_by {|key, value| value[0]}
ap "MAX ACCTS: #{max_accts}    AVG ACCTS: #{avg_accts}"
ap "MAX SIZE:  #{max_size}     AVG SIZE:  #{avg_data}"

# puts dept_arr_sorted_by_acct
# exit


bucket_arr = Array.new
used_hsh = Hash.new

server_names.each do |s|
  bucket_hsh = Hash.new
  this_accts=0
  this_data=0
  my_key=""
  my_val=[]
  accts=0
  data=0
  accts_space_pct_used = 0
  data_space_pct_used = 0
  while this_accts < avg_accts

    if accts_space_pct_used <= data_space_pct_used
    # This loop runs if the % used of accts is less than % used of data
      dept_arr_sorted_by_acct.each do |val|
        # Sorted by num accts - ascending.  Loop until we find the last entry in the array that has <= accts than what we need
        next if used_hsh.has_key?(val[0])
          #do nothing
        if val[1][0] <= avg_accts-this_accts
          my_key = val[0]
          my_val = val[1]
          开发者_高级运维accts = val[1][0]
          data = val[1][1]
        end
      end
    else
    # This loop runs if the % used of data is less than % used of accts
      dept_arr_sorted_by_data = dept_arr_sorted_by_acct.sort { |a,b| b[1][1] <=> a[1][1] }
      dept_arr_sorted_by_data.each do |val|
        # Sorted by size - descending.  Find the first (largest data) entry where accts <= what we need
        next if used_hsh.has_key?(val[0])
          # do nothing
        if val[1][0] <= avg_accts-this_accts
          my_key = val[0]
          my_val = val[1]
          accts = val[1][0]
          data = val[1][1]
          break
        end
      end
    end

    used_hsh[my_key] = my_val
    bucket_hsh[my_key] = my_val
    this_accts = this_accts + accts
    this_data = this_data + data
    accts_space_pct_used = this_accts.to_f / avg_accts * 100
    data_space_pct_used = this_data.to_f / avg_data * 100
  end
  bucket_arr << [this_accts, this_data, bucket_hsh]
end

x=0
while x < bucket_arr.size do
  th = bucket_arr[x][2]
  list_of_depts = []
  th.each_key do |key|
    list_of_depts << key
  end
  ap "Bucket #{x}:  #{bucket_arr[x][0]} accounts :: #{bucket_arr[x][1]} data :: #{list_of_depts.size} departments"
  #ap list_of_depts
  x = x+1
end

...and the results...

"MAX ACCTS: 2279    AVG ACCTS: 379"
"MAX SIZE:  1693315     AVG SIZE:  282219"
"Bucket 0:  379 accounts :: 251670 data :: 7 departments"
"Bucket 1:  379 accounts :: 286747 data :: 10 departments"
"Bucket 2:  379 accounts :: 278226 data :: 14 departments"
"Bucket 3:  379 accounts :: 281292 data :: 19 departments"
"Bucket 4:  379 accounts :: 293777 data :: 28 departments"
"Bucket 5:  379 accounts :: 298675 data :: 78 departments"

(379 * 6 <> 2279) I still need to figure out how to account for when the MAX_ACCTS are not evenly divisible by the number of buckets. I tried adding a 1% pad to the AVG_ACCTS value, which in this case means the average would be 383 I think, but then all the buckets say they have 383 accounts in them...which can't be true because then there are more accounts in the buckets than MAX_ACCTS. I've got a mistake in the code somewhere that I haven't found yet.


This is an example of the knapsack problem. There are a few solutions, but it's a really tricky problem and it's better to research a good solution than to try and make your own.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜