开发者

How can I speed up my Ruby/Rake task, which counts occurrences of dates among 300K date strings?

I have an array of 300K strings which represent dates:

date_array = [
  "2007-03-25 14:24:29",
  "2007-03-25 14:27:00",
  ...
]

I need to count occurrences of each date in this array (e.g., all date strings for "2011-03-25"). The exact time doesn't matter -- just the date. I know the range of dates within the file. So I have:

Date.parse('2007-03-23').upto Date.parse('2011-10-06') do |date_to_count|
  count = 0
  date_array.each do |date_string|
    if Date.parse(date_string) >= date_to_count && 
       Date.parse(date_string) <= date_to_count
      count += 1
    end
  end
  puts "#{date_to_count} occurred #{count} times."
end

Counting occurrences of just one date takes longer than 60 seconds on my machine. In what ways can I optimize the performance of this task?

Possibly useful notes: I'm using Ruby 1.9.2. This script is running in a Rake task with rake 0.9.2. The date_array is loaded from a CSV file. On each iteration, the count is saved as a record in my 开发者_运维技巧Rails project database.


Yes, you don't need to parse the dates at all if they are formatted the same. Knowing your data is one of the most powerful tools you can have.

If the datetime strings are all in the same format (yyyy-mm-dd HH:MM:SS) then you could do something like

data_array.group_by{|datetime| datetime[0..9]}

This will give you a hash like with the date strings as the keys and the array of dates as values

{
  "2007-05-06" => [...],
  "2007-05-07" => [...],
  ...
}

So you'd have to get the length of each array

data_array.group_by{|datetime| datatime[0..9]}.each do |date_string, date_array|
  puts "#{date_string} occurred #{date_array.length} times."
end

Of course that method is wasting memory by arrays of dates when you don't need them.

so how about

A more memory-efficient method

date_counts = {}
date_array.each do |date_string|
  date = date_string[0..9]
  date_counts[date] ||= 0 # initialize count if necessary
  date_counts[date] += 1
end

You'll end up with a hash with the date strings as the keys and the counts as values

{
  "2007-05-06" => 123,
  "2007-05-07" => 456,
  ...
}

Putting everything together

date_counts = {}
date_array.each do |date_string|
  date = date_string[0..9]
  date_counts[date] ||= 0 # initialize count if necessary
  date_counts[date] += 1
end

Date.parse('2007-03-23').upto Date.parse('2011-10-06') do |date_to_count|
  puts "#{date_to_count} occurred #{date_counts[date_to_count.to_s].to_i} times."
end


This is a really awful algorithm to use. You're scanning through the entire list for each date, and further, you're parsing the same date twice for no apparent reason. That means for N dates in the range and M dates in the list you're doing N*M*2 date parses.

What you really need is to use group_by and do it in one pass:

dates = date_array.group_by do |date_string|
  Date.parse(date_string)
end

Then you can use this as a reference for your counts:

Date.parse('2007-03-23').upto Date.parse('2011-10-06') do |date_to_count|
  puts "#{date_to_count} occurred #{dates[date_to_count] ? dates[date_to_count].length : 0} times."
end
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜