Python web-scraping threaded performance
I have a web app that needs both functionality and performance tested, and part of the test suite that we plan on using is already written in Python. When I first wrote this, I used mechanize as my means of web-scraping, but it seems to be too bulky开发者_运维问答 for what I'm trying to do (either that or I'm missing something).
The basic layout of what I'm trying to do is as follows. All are objects.
- User has Comm (used to be the interface between my stuff and mechanize)
- Comm has Browser (holds my CookieJar, urllib2, and BeautifulSoup objects, used to be mechanize)
- Browser has Form(s) (used to be mechanize-handled)
Now, as far as threading goes, I have that down. Adjustment between dealing with the GIL and having separate instances of Python running will be made as needed, but suggestions will be taken.
So what I need to do is thread users hitting the application and doing various things (logging in, filling out forms, submitting forms for processing, etc.) while not making the testing box scream too loudly. My current problem with mechanize seems to be RAM.
Part of what's causing the RAM issue is the need for separate browser instances for each user to keep from overwriting the JSESSIONID
cookie every time I do something with a different user.
Much of this might seem trivial, but I'm trying to run thousands of threads here, so little tweaks can mean a lot. Any input is appreciated.
Threading causes problems with the GIL, more so with more cores. Try using mechanize with eventlet to achieve concurrency (via multiple processes) also check out multi-mechanize
Have you considered Twisted, the asynchronous library, for at least doing interaction with users?
I actually went without using mechanize and used the Threading module. This allowed for fairly quick transactions, and I also made sure not to have too much inside of each thread. Login information, and getting the webapp in the state necessary before I threaded helped the threads to run shorter and therefore more quickly.
精彩评论