How to maximize performance in Python when doing many I/O bound operations?

2023-01-02 11:15 问答作者：

I have a situation where I'm downloading a lot of files. Right now everything runs on one main Python thread, and downloads as many as 3000 files every few minutes. The problem is that the time it takes to do this is too long. I realize开发者_StackOverflow中文版 Python has no true multi-threading, but is there a better way of doing this? I was thinking of launching multiple threads since the I/O bound operations should not require access to the global interpreter lock, but perhaps I misunderstand that concept.

Multithreading is just fine for the specific purpose of speeding up I/O on the net (although asynchronous programming would give even greater performance). CPython's multithreading is quite "true" (native OS threads) -- what you're probably thinking of is the GIL, the global interpreter lock that stops different threads from simultaneously running Python code. But all the I/O primitives give up the GIL while they're waiting for system calls to complete, so the GIL is not relevant to I/O performance!

For asynchronous programming, the most powerful framework around is twisted, but it can take a while to get the hang of it if you're never done such programming. It would probably be simpler for you to get extra I/O performance via the use of a pool of threads.

Could always take a look at multiprocessing.

is there a better way of doing this?

Yes

I was thinking of launching multiple threads since the I/O bound operations

Don't.

At the OS level, all the threads in a process are sharing a limited set of I/O resources.

If you want real speed, spawn as many heavyweight OS processes as your platform will tolerate. The OS is really, really good about balancing I/O workloads among processes. Make the OS sort this out.

Folks will say that spawning 3000 processes is bad, and they're right. You probably only want to spawn a few hundred at a time.

What you really want is the following.

A shared message queue in which the 3000 URI's are queued up.
A few hundred workers which are all reading from the queue.

Each worker gets a URI from the queue and gets the file.

The workers can stay running. When the queue's empty, they'll just sit there, waiting for work.

"every few minutes" you dump the 3000 URI's into the queue to make the workers start working.

This will tie up every resource on your processor, and it's quite trivial. Each worker is only a few lines of code. Loading the queue is a special "manager" that's just a few lines of code, also.

Gevent is perfect for this.

Gevent's use of Greenlets (lightweight coroutines in the same python process) offer you asynchronous operations without compromising code readability or introducing abstract 'reactor' concepts into your mix.

继续阅读：python

How to maximize performance in Python when doing many I/O bound operations?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？