开发者

Ruby threads - out of resources

I wrote the below crawler to take list of urls from a file and fetch the pages. The problem being, after 2 hours or so, the system becomes very slow and almost unusable. The system is quad core linux with 8gb ram. Can someone tell me how to resolve this issue.

require 'rubygems'
require 'net/http'
require 'uri'

threads = []
to_get = File.readlines(ARGV[0])

dir = ARGV[1]
errorFile = ARGV[2]

error_f = File.open(errorFile, "w")

puts "Need to get #{to_get.length} queries ..!!"
start_time = Time.now

100.times do
  threads << Thread.new do
    while q_word = to_get.pop
      toks = q_word.chop.split("\t")

      entity = toks[0]
      urls = toks[1].chop.split("::")
      count = 1

      urls.each do |url|
        q_final = URI.escape(url)
        q_parsed = URI.parse(q_final)

        filename开发者_开发知识库 = dir+"/"+entity+"_"+count.to_s

        if(File.exists? filename)
          count = count + 1
        else
          begin
            res_http = Net::HTTP.get(q_parsed.host, q_parsed.request_uri)
            File.open(filename, 'w') {|f| f.write(res_http) }
          rescue Timeout::Error
            error_f.write("timeout error " + url+"\n")
          rescue 
            error_f.write($!.inspect + " " + filename + " " + url+"\n")
          end
          count = count + 1
        end
      end
    end
  end 
end

puts "waiting here"

threads.each { |x| x.join }
puts "finished in #{Time.now - start_time}"
#puts "#{dup} duplicates found"
puts "writing output ..."
error_f.close()
puts "Done."


In general, you can't modify objects that are shared among threads unless those objects are thread safe. I would replace to_get with an instance of Queue, which is thread safe.

Before creating any threads:

to_get = Queue.new
File.readlines(ARGV[0]).each do |url|
  to_get.push url.chomp
end
number_of_threads.times do
  to_get.push :done
end

And in the thread:

loop do
  url = to_get.pop
  break if url == :done
  ...
end


For such type of problems I highly recommend that you look at EventMachine. Check this example on how to fetch URLs in parallell with EventMachine and Ruby.


The problem is, probably, with the RAM. All downloaded files keeps themselves on memory after you download and save them. (I don't know if they're big files, how much can you download in 2 hours with your internet?) Try clean the memory with GC.start. Something like adding this on start of the file:

Thread.new do
  while true
    sleep(60*5) # 5 minutes
    GC.start
  end
end

Note that GC.start will freeze all others running threads while run. If it are breaking some download, put less time (will be less things to clean).


I don't know much about managing memory or finding out what's using up too much memory in Ruby (I wish I knew more), but you've currently got 100 threads operating at the same time. Maybe you should have only 4 or 8 operating at once?

If that didn't work, another stab I'd take at the program is to put some of the code into a method. At least that way you'd know when certain variables go out of scope.


When I have a bunch of urls to process I use Typhoeus and Hydra. Hydra makes it easy to process multiple requests at once. Check the times.rb example for a starting point.

Something else to watch out for is a case of diminishing returns as you crank up your concurrent connections. You can hit a point where your throughput doesn't increase when you add more threads, so it's a good exercise to try some low numbers of concurrent connections, then start raising the limit until you see your throughput no longer improve.

I'd also recommend using a database to track your file queue. You're hitting another server to retrieve those files, and having to start at the beginning of a run and retrieve the same files again is a big time and resource waster for you and whoever is serving them. At the start of the job run through the database and look for any files that have not been retrieved, grab them and set their "downloaded" flag. If you start up and all the files have been downloaded you know the previous run was successful so clear them all and run from the start of the list. You'll need to spend some time to figure out what needs to be in such a database, but, if your needs grow, your run times will increase, and you'll encounter times you've been running for most of a day and have a power outage, or system crash. You don't want to have to start at the beginning at that point. There's no speed penalty for using a database in comparison to the slow file transfers across the internet.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜