Why is curl in Ruby slower than command-line curl?

2022-12-30 10:52 问答作者：

I am trying to download more than 1m pages (URLs ending by a sequence ID). I have implemented kind of multi-purpose download manager with configurable number of download threads and one processing thread. The downloader downloads files in batches:

curl = Curl::Easy.new

batch_urls.each { |url_info|
    curl.url = url_info[:url]
    curl.perform
    file = File.new(url_info[:file], "wb")
    file << curl.body_str
    file.close
    # ... some other stuff
}

I have tried to download 8000 pages sample. When using the code above, I get 1000 in 2 minutes. When I write all URLs into a file and do in shell:

cat list | xargs curl

I gen all 8000 pages in two minutes.

Thing is, I need it to have it in ruby code, because there is other monitoring and processing code.

I have tried:

Curl::Multi - it is somehow faster, but misses 50-90% of files (does not download them and gives no reason/code)
multiple threads with Curl::Easy - around the same speed as single threaded

Why is reused Curl::Easy slower than subsequent command line curl calls and how can I make it faster? Or what I am doing wrong?

I would prefer to fix my download manager code than to make downloading for this case in a different way.

Before this, I was calling command-line wget which I provided with a file with list of URLs. Howerver, not all errors were handled, also it was not possible to specify output file for each URL separately when using URL list.

Now it seems to me that the best way would be to use multiple threads with system call to 'curl' command. But why when I can use directly Curl in Ruby?

Code for the download manager is here, if it m开发者_运维百科ight help: Download Manager (I have played with timeouts, from not-setting it to various values, it did not seem help)

Any hints appreciated.

This could be a fitting task for Typhoeus

Something like this (untested):

require 'typhoeus'

def write_file(filename, data)
    file = File.new(filename, "wb")
    file.write(data)
    file.close
      # ... some other stuff
end

hydra = Typhoeus::Hydra.new(:max_concurrency => 20)

batch_urls.each do |url_info|
    req = Typhoeus::Request.new(url_info[:url])
    req.on_complete do |response|
      write_file(url_info[:file], response.body)
    end
    hydra.queue req
end

hydra.run

Come to think of it, you might get a memory problem because of the enormous amout of files. One way to prevent that would be to never store the data in a variable but instead stream it to the file directly. You could use em-http-request for that.

EventMachine.run {
  http = EventMachine::HttpRequest.new('http://www.website.com/').get
  http.stream { |chunk| print chunk }
  # ...
}

So, if you don't set a on_body handler than curb will buffer the download. If you're downloading files you should use an on_body handler. If you want to download multiple files using Ruby Curl, try the Curl::Multi.download interface.

require 'rubygems'
require 'curb'

urls_to_download = [
  'http://www.google.com/',
  'http://www.yahoo.com/',
  'http://www.cnn.com/',
  'http://www.espn.com/'
]
path_to_files = [
  'google.com.html',
  'yahoo.com.html',
  'cnn.com.html',
  'espn.com.html'
]

Curl::Multi.download(urls_to_download, {:follow_location => true}, {}, path_to_files) {|c,p|}

If you want to just download a single file.

Curl::Easy.download('http://www.yahoo.com/')

Here is a good resource: http://gist.github.com/405779

There's been benchmarks done that has compared curb with other methods such as HTTPClient. The winner, in almost all categories was HTTPClient. Plus, there have been some documented scenarios where curb does NOT work in multi-threading scenarios.

Like you, I've had your experience. I ran system commands of curl in 20+ concurrent threads and it was 10 X fasters than running curb in 20+ concurrent threads. No matter, what I tried, this was always the case.

I've since then switched to HTTPClient, and the difference is huge. Now it runs as fast as 20 concurrent curl system commands, and uses less CPU as well.

First let me say that I know almost nothing about Ruby.

What I do know is that Ruby is an interpreted language; it's not surprising that it's slower than heavily optimised code that's been compiled for a specific platform. Every file operation will probably have checks around it that curl doesn't. The "some other stuff" will slow things down even more.

Have you tried profiling your code to see where most of the time is being spent?

Stiivi,

any chance that Net::HTTP would suffice for simple downloading of HTML pages?

You didn't specify a Ruby version, but threads in 1.8.x are user-space threads, not scheduled by the OS, so the entire Ruby interpreter only ever use one CPU/core. On top of that there is a Global Interpreter Lock, and probably other locks as well, interfering with concurrency. Since you're trying to maximize network throughput, you're probably underutilizing CPUs.

Spawn as many processes as the machine has memory for, and limit the reliance on threads.

继续阅读：curb curl download http ruby

Why is curl in Ruby slower than command-line curl?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？