python multithreading synchronization
I am having a synchronization problem while threading with cPython. I have two files, I parse them and return the desired result. However, the code below acts strangely and returns three times instead of two plus doesn't return in the order I put them into queue. Here's the code:
import Queue
import threading
from HtmlDoc import Document
OUT_LIST = []
class Threader(threading.Thread):
"""
Start threading
"""
def __init__(self, queue, out_queue):
threading.Thread.__init__(self)
self.queue = queue
self.out_q开发者_JS百科ueue = out_queue
def run(self):
while True:
if self.queue.qsize() == 0: break
path, host = self.queue.get()
f = open(path, "r")
source = f.read()
f.close()
self.out_queue.put((source, host))
self.queue.task_done()
class Processor(threading.Thread):
"""
Process threading
"""
def __init__(self, out_queue):
self.out_queue = out_queue
self.l_first = []
self.f_append = self.l_first.append
self.l_second = []
self.s_append = self.l_second.append
threading.Thread.__init__(self)
def first(self, doc):
# some code to to retrieve the text desired, this works 100% I tested it manually
def second(self, doc):
# some code to to retrieve the text desired, this works 100% I tested it manually
def run(self):
while True:
if self.out_queue.qsize() == 0: break
doc, host = self.out_queue.get()
if host == "first":
self.first(doc)
elif host == "second":
self.second(doc)
OUT_LIST.extend(self.l_first + self.l_second)
self.out_queue.task_done()
def main():
queue = Queue.Queue()
out_queue = Queue.Queue()
queue.put(("...first.html", "first"))
queue.put(("...second.html", "second"))
qsize = queue.qsize()
for i in range(qsize):
t = Threader(queue, out_queue)
t.setDaemon(True)
t.start()
for i in range(qsize):
dt = Processor(out_queue)
dt.setDaemon(True)
dt.start()
queue.join()
out_queue.join()
print '<br />'.join(OUT_LIST)
main()
Now, when I print, I'd like to print the content of the "first" first of all and then the content of the "second". Can anyone help me?
NOTE: I am threading because actually I will have to connect more than 10 places at a time and retrieve its results. I believe that threading is the most appropriate way to accomplish such a task
I am threading because actually I will have to connect more than 10 places at a time and retrieve its results. I believe that threading is the most appropriate way to accomplish such a task
Threading is actually one of the most error-prone ways to manage multiple concurrent connections. A more powerful, more debuggable approach is to use event-driven asynchronous networking, such as implemented by Twisted. If you're interested in using this model, you might want to check out this introduction.
I dont share the same opinion that threading is the best way to do this (IMO some events/select mechanism would be better) but problem with your code could be in variables t and dt. You have the assignements in the cycle and object instances are to stored anywhere - so it may be possible that your new instance of Thread/Processor get deleted at the end of the each cycle.
It would be more clarified if you show us precise output of this code.
1) You cannot control order of job completion. It depends on execution time, so to return results as you want you can create global dictionary with job objects, like job_results : {'first' : None, 'second' : None} and store results here, than you can fetch data on desired order
2) self.first
and self.second
should be cleared after each processed doc, else you will have duplicates in OUT_LIST
3) You may use multi-processing with subprocess module and put all result data to CSV files for example and them sort them as you wish.
精彩评论