Multi threaded web scraper using urlretrieve on a cookie-enabled site

2023-03-08 05:53 问答作者：

I am trying to write my first Python script, and with lots of Googling, I think that I am just about done. However, I will need some help getting myself across the finish line.

I need to write a script that logs onto a cookie-enabled site, scrape a bunch of links, and then 开发者_如何学Pythonspawn a few processes to download the files. I have the program running in single-threaded, so I know that the code works. But, when I tried to create a pool of download workers, I ran into a wall.

#manager.py
import Fetch # the module name where worker lives
from multiprocessing import pool

def FetchReports(links,Username,Password,VendorID):
    pool = multiprocessing.Pool(processes=4, initializer=Fetch._ProcessStart, initargs=(SiteBase,DataPath,Username,Password,VendorID,))
    pool.map(Fetch.DownloadJob,links)
    pool.close()
    pool.join()


#worker.py
import mechanize
import atexit

def _ProcessStart(_SiteBase,_DataPath,User,Password,VendorID):
    Login(User,Password)

    global SiteBase
    SiteBase = _SiteBase

    global DataPath
    DataPath = _DataPath

    atexit.register(Logout)

def DownloadJob(link):
    mechanize.urlretrieve(mechanize.urljoin(SiteBase, link),filename=DataPath+'\\'+filename,data=data)
    return True

In this revision, the code fails because the cookies have not been transferred to the worker for urlretrieve to use. No problem, I was able to use mechanize's .cookiejar class to save the cookies in the manager, and pass them to the worker.

#worker.py
import mechanize
import atexit

from multiprocessing import current_process

def _ProcessStart(_SiteBase,_DataPath,User,Password,VendorID):
    global cookies
    cookies = mechanize.LWPCookieJar()

    opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies))

    Login(User,Password,opener)  # note I pass the opener to Login so it can catch the cookies.

    global SiteBase
    SiteBase = _SiteBase

    global DataPath
    DataPath = _DataPath

    cookies.save(DataPath+'\\'+current_process().name+'cookies.txt',True,True)

    atexit.register(Logout)

def DownloadJob(link):
    cj = mechanize.LWPCookieJar()
    cj.revert(filename=DataPath+'\\'+current_process().name+'cookies.txt', ignore_discard=True, ignore_expires=True)
    opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cj))

    file = open(DataPath+'\\'+filename, "wb")
    file.write(opener.open(mechanize.urljoin(SiteBase, link)).read())
    file.close

But, THAT fails because opener (I think) wants to move the binary file back to the manager for processing, and I get an "unable to pickle object" error message, referring to the webpage it's trying to read to the file.

The obvious solution is to read the cookies in from the cookie jar and manually add them to the header when making the urlretrieve request, but I am trying to avoid that, and that is why I am fishing for suggestions.

Creating a multi-threaded web scraper the right way is hard. I'm sure you could handle it, but why not use something that has already been done?

I really really suggest you to check out Scrapy http://scrapy.org/

It is a very flexible open source web scraper framework that will handle most of the stuff you would need here as well. With Scrapy, running concurrent spiders is a configuration issue, not a programming issue (http://doc.scrapy.org/topics/settings.html#concurrent-requests-per-spider). You will also get support for cookies, proxies, HTTP Authentication and much more.

For me, it took around 4 hours to rewrite my scraper in Scrapy. So please ask yourself: do you really want to solve the threading issue yourself or instead climb to the shoulders of others and focus on the issues of web scraping, not threading?

PS. Are you using mechanize now? Please notice this from mechanize FAQ http://wwwsearch.sourceforge.net/mechanize/faq.html:

"Is it threadsafe?

No. As far as I know, you can use mechanize in threaded code, but it provides no synchronisation: you have to provide that yourself."

If you really want to keep using mechanize, start reading through documentation on how to provide synchronization. (e.g. http://effbot.org/zone/thread-synchronization.htm, http://effbot.org/pyfaq/what-kinds-of-global-value-mutation-are-thread-safe.htm)

After working for most of the day, it turns out that Mechanize was not the problem, it looks more like a coding error. After extensive tweaking and cursing, I have gotten the code to work properly.

For future Googlers like myself, I am providing the updated code below:

#manager.py [unchanged from original]
def FetchReports(links,Username,Password,VendorID):
    import Fetch
    import multiprocessing

    pool = multiprocessing.Pool(processes=4, initializer=Fetch._ProcessStart, initargs=(SiteBase,DataPath,Username,Password,VendorID,))
    pool.map(Fetch.DownloadJob,_SplitLinksArray(links))
    pool.close()
    pool.join()


#worker.py
import mechanize
from multiprocessing import current_process

def _ProcessStart(_SiteBase,_DataPath,User,Password,VendorID):
    global cookies
    cookies = mechanize.LWPCookieJar()
    opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies))

    Login(User,Password,opener)

    global SiteBase
    SiteBase = _SiteBase

    global DataPath
    DataPath = _DataPath

    cookies.save(DataPath+'\\'+current_process().name+'cookies.txt',True,True)

def DownloadJob(link):
    cj = mechanize.LWPCookieJar()
    cj.revert(filename=DataPath+'\\'+current_process().name+'cookies.txt',True,True)
    opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cj))

    mechanize.urlretrieve(url=mechanize.urljoin(SiteBase, link),filename=DataPath+'\\'+filename,data=data)

Because I am just downloading links from a list, the non-threadsafe nature of mechanize doesn't seem to be a problem [full disclosure: I have run this process exactly three times, so a problem may appear under further testing]. The multiprocessing module and it's worker pool does all the heavy lifting. Maintaining cookies in files was important for me because the webserver I am downloading from has to give each thread it's own session ID, but other people implementing this code may not need to use it. I did notice that it seems to "forget" variables between the init call and the run call, so the cookiejar may not make the jump.

In order to enable cookie session in the first code example, add the following code to the function DownloadJob:

cj = mechanize.LWPCookieJar()
opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cj))
mechanize.install_opener(opener)

And then you may retrieve the url as you do:

mechanize.urlretrieve(mechanize.urljoin(SiteBase, link),filename=DataPath+'\\'+filename,data=data)

继续阅读：cookies multiprocessing python urllib urllib2

Multi threaded web scraper using urlretrieve on a cookie-enabled site

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？