开发者

How to maximize performance in Python when doing many I/O bound operations?

I have a situation where I'm downloading a lot of files. Right now everything runs on one main Python thread, and downloads as many as 3000 files every few minutes. The problem is that the time it takes to do this is too long. I realize开发者_StackOverflow中文版 Python has no true multi-threading, but is there a better way of doing this? I was thinking of launching multiple threads since the I/O bound operations should not require access to the global interpreter lock, but perhaps I misunderstand that concept.


Multithreading is just fine for the specific purpose of speeding up I/O on the net (although asynchronous programming would give even greater performance). CPython's multithreading is quite "true" (native OS threads) -- what you're probably thinking of is the GIL, the global interpreter lock that stops different threads from simultaneously running Python code. But all the I/O primitives give up the GIL while they're waiting for system calls to complete, so the GIL is not relevant to I/O performance!

For asynchronous programming, the most powerful framework around is twisted, but it can take a while to get the hang of it if you're never done such programming. It would probably be simpler for you to get extra I/O performance via the use of a pool of threads.


Could always take a look at multiprocessing.


is there a better way of doing this?

Yes

I was thinking of launching multiple threads since the I/O bound operations

Don't.

At the OS level, all the threads in a process are sharing a limited set of I/O resources.

If you want real speed, spawn as many heavyweight OS processes as your platform will tolerate. The OS is really, really good about balancing I/O workloads among processes. Make the OS sort this out.

Folks will say that spawning 3000 processes is bad, and they're right. You probably only want to spawn a few hundred at a time.

What you really want is the following.

  1. A shared message queue in which the 3000 URI's are queued up.

  2. A few hundred workers which are all reading from the queue.

    Each worker gets a URI from the queue and gets the file.

The workers can stay running. When the queue's empty, they'll just sit there, waiting for work.

"every few minutes" you dump the 3000 URI's into the queue to make the workers start working.

This will tie up every resource on your processor, and it's quite trivial. Each worker is only a few lines of code. Loading the queue is a special "manager" that's just a few lines of code, also.


Gevent is perfect for this.

Gevent's use of Greenlets (lightweight coroutines in the same python process) offer you asynchronous operations without compromising code readability or introducing abstract 'reactor' concepts into your mix.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜