Making batch calls to a web service and saving the progress
I need to make a lot of calls to a web service to get more than 180,000 rows of data, I using Ruby 1.9.2.
There's no way to know the total amount of results, it might be 150,000 rows one day and 200,000 next week, so I need to make all these calls in batches until the result is zero.
Right now I have something like this (of course this is not the actual code, I put it this way just for showing purposes):
limit = 1000
offset = 0
@data = @client.get_data :limit => limit, :offset => offset
until @data.length.zero?
# save @data to database
offset += limit
@data = @client.get_data :limit => limit, :offset => offset
end
but I'd like to have several threads making the calls and save the progress to avoid losing data when a call times out and retry the call with the same parameters when that happens.
The main problem here is that I don't know the total amount of rows I will get, in either case I'd use something like Resque and I'd define the necessary jobs to get all the data, but this is not the case, I just need to increase the offse开发者_Go百科t value until I get no results.
Any suggestions?
I'm not sure exactly what you mean. But since you don't know how much total data there is AND you still want to use threads, it means you have no option but to do a blind guess if you want to do it with threads.
Why not use something like a thread pool:
https://github.com/fizx/thread_pool
Then you just loop through the threads in the pool, and assign each one a new offset from your list of "available" offsets. Something like this:
next_offset = 0
while @available_data {
pool.execute(next_offset) { |offset|
# Get data
@data[offset] = @client.get_data ...
# Check if we got to the end
@available_data = false if @data[offset].length.zero?
}
next_offset += limit
}
# Wait for all threads to finish
pool.join
# Consolidate your data
@data.keys.sort.each { |offset|
# Whatever you do to gather the data from all threads into one place...
@all_data += @data[offset]
}
Then your get_data
handles exceptions, failures and retries internally automatically. That way get_data
ensures that the data for that offset will be returned no matter what.
Of course you need to add some sort of "force majeure" error checking. Let's say you give up after 100 retries or whatever..but at that point you just exit everything and log a failure attempt or however you want to handle it..
精彩评论