开发者

Python Queue - Threads bound to only one core

I wrote a python script that: 1. submits search queries 2. waits for the results 3. parses the returned results(XML)

I used the threading and Queue modules to perform this in parallel (5 workers).

It works great for the querying portion because i can submit multiple search jobs and deal with the results as they come in.

However, it appears that all my threads get bound to the same core. This is apparent when it gets to the part where it processes the XML(cpu intensive).

Has anyone else encountered this problem? Am i missing something conceptually?

Also, i was pondering the idea of having two separate work queues, one for making the queries and one for parsing the XML. As it is now, one worker will do both in serial. I'm not sure what that will buy me, if anything. Any help is greatly appreciated.

Here is the code: (proprietary data removed)

def addWork(source_list):
    for item in source_list:
        #print "adding: '%s'"%(item)
        work_queue.put(item)

def doWork(thread_id):
    while 1:
        try:
            gw = work_queue.get(block=False)
        except Queue.Empty:
            #print "thread '%d' is terminating..."%(thread_id)
            sys.exit() # no more work in the queue for this thread, die quietly

    ##Here is where i make the call to the REST API
    ##Here is were i wait for the results
    ##Here is where i parse the XML results and dump the data into a "global" dict

#MAIN
producer_thread = Thread(target=addWork, args=(sources,))
producer_thread.start开发者_开发知识库() # start the thread (ie call the target/function)
producer_thread.join() # wait for thread/target function to terminate(block)

#start the consumers
for i in range(5):
    consumer_thread = Thread(target=doWork, args=(i,))
    consumer_thread.start()
    thread_list.append(consumer_thread)

for thread in thread_list:
    thread.join()


This is a byproduct of how CPython handles threads. There are endless discussions around the internet (search for GIL) but the solution is to use the multiprocessing module instead of threading. Multiprocessing is built with pretty much the same interface (and synchronization structures, so you can still use queues) as threading. It just gives every thread its own entire process, thus avoiding the GIL and forced serialization of parallel workloads.


Using CPython, your threads will never actually run in parallel in two different cores. Look up information on the Global Interpreter Lock (GIL).

Basically, there's a mutual exclusion lock protecting the actual execution part of the interpreter, so no two threads can compute in parallel. Threading for I/O tasks will work just fine, because of blocking.

edit: If you want to fully take advantage of multiple cores, you need to use multiple processes. There's a lot of articles about this topic, I'm trying to look one up for you I remember was great, but can't find it =/.

As Nathon suggested, you can use the multiprocessing module. There are tools to help you share objects between processes (take a look at POSH, Python Object Sharing).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜