Ruby threads - out of resources

2023-02-13 18:25 问答作者：

I wrote the below crawler to take list of urls from a file and fetch the pages. The problem being, after 2 hours or so, the system becomes very slow and almost unusable. The system is quad core linux with 8gb ram. Can someone tell me how to resolve this issue.

require 'rubygems'
require 'net/http'
require 'uri'

threads = []
to_get = File.readlines(ARGV[0])

dir = ARGV[1]
errorFile = ARGV[2]

error_f = File.open(errorFile, "w")

puts "Need to get #{to_get.length} queries ..!!"
start_time = Time.now

100.times do
  threads << Thread.new do
    while q_word = to_get.pop
      toks = q_word.chop.split("\t")

      entity = toks[0]
      urls = toks[1].chop.split("::")
      count = 1

      urls.each do |url|
        q_final = URI.escape(url)
        q_parsed = URI.parse(q_final)

        filename开发者_开发知识库 = dir+"/"+entity+"_"+count.to_s

        if(File.exists? filename)
          count = count + 1
        else
          begin
            res_http = Net::HTTP.get(q_parsed.host, q_parsed.request_uri)
            File.open(filename, 'w') {|f| f.write(res_http) }
          rescue Timeout::Error
            error_f.write("timeout error " + url+"\n")
          rescue 
            error_f.write($!.inspect + " " + filename + " " + url+"\n")
          end
          count = count + 1
        end
      end
    end
  end 
end

puts "waiting here"

threads.each { |x| x.join }
puts "finished in #{Time.now - start_time}"
#puts "#{dup} duplicates found"
puts "writing output ..."
error_f.close()
puts "Done."

In general, you can't modify objects that are shared among threads unless those objects are thread safe. I would replace to_get with an instance of Queue, which is thread safe.

Before creating any threads:

to_get = Queue.new
File.readlines(ARGV[0]).each do |url|
  to_get.push url.chomp
end
number_of_threads.times do
  to_get.push :done
end

And in the thread:

loop do
  url = to_get.pop
  break if url == :done
  ...
end

For such type of problems I highly recommend that you look at EventMachine. Check this example on how to fetch URLs in parallell with EventMachine and Ruby.

The problem is, probably, with the RAM. All downloaded files keeps themselves on memory after you download and save them. (I don't know if they're big files, how much can you download in 2 hours with your internet?) Try clean the memory with GC.start. Something like adding this on start of the file:

Thread.new do
  while true
    sleep(60*5) # 5 minutes
    GC.start
  end
end

Note that GC.start will freeze all others running threads while run. If it are breaking some download, put less time (will be less things to clean).

I don't know much about managing memory or finding out what's using up too much memory in Ruby (I wish I knew more), but you've currently got 100 threads operating at the same time. Maybe you should have only 4 or 8 operating at once?

If that didn't work, another stab I'd take at the program is to put some of the code into a method. At least that way you'd know when certain variables go out of scope.

When I have a bunch of urls to process I use Typhoeus and Hydra. Hydra makes it easy to process multiple requests at once. Check the times.rb example for a starting point.

Something else to watch out for is a case of diminishing returns as you crank up your concurrent connections. You can hit a point where your throughput doesn't increase when you add more threads, so it's a good exercise to try some low numbers of concurrent connections, then start raising the limit until you see your throughput no longer improve.

I'd also recommend using a database to track your file queue. You're hitting another server to retrieve those files, and having to start at the beginning of a run and retrieve the same files again is a big time and resource waster for you and whoever is serving them. At the start of the job run through the database and look for any files that have not been retrieved, grab them and set their "downloaded" flag. If you start up and all the files have been downloaded you know the previous run was successful so clear them all and run from the start of the list. You'll need to spend some time to figure out what needs to be in such a database, but, if your needs grow, your run times will increase, and you'll encounter times you've been running for most of a day and have a power outage, or system crash. You don't want to have to start at the beginning at that point. There's no speed penalty for using a database in comparison to the slow file transfers across the internet.

继续阅读：ruby rubygems web-crawler

Ruby threads - out of resources

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？