302s and losing cookies with urllib2
I am using liburl2 with CookieJar / HTTPCookieProcessor in an attempt to simulate a login to a page to automate an upload.
I've seen some questions and answers on this, but nothing which solves my problem. I am losing my cookie when I simulate the login which ends up at a 302 redirect. The 302 response is where the cookie gets set by the server, but urllib2 HTTPCookieProcessor does not seem to save the cookie during a redirect. I tried creating a HTTPRedirectHandler class to ignore the redirect, but that didn't seem to do the trick. I tried referencing the CookieJar globally to handle the cookies from the HTTPRedirectHandler, but 1. This didn't work (because I was handling the header from the redirector, and the CookieJar funct开发者_如何学Goion that I was using, extract_cookies, needed a full request) and 2. It's an ugly way to handle it.
I probably need some guidance on this as I'm fairly green with Python. I think I'm mostly barking up the right tree here, but maybe focusing on the wrong branch.
cj = cookielib.CookieJar()
cookieprocessor = urllib2.HTTPCookieProcessor(cj)
class MyHTTPRedirectHandler(urllib2.HTTPRedirectHandler):
def http_error_302(self, req, fp, code, msg, headers):
global cj
cookie = headers.get("set-cookie")
if cookie:
# Doesn't work, but you get the idea
cj.extract_cookies(headers, req)
return urllib2.HTTPRedirectHandler.http_error_302(self, req, fp, code, msg, headers)
http_error_301 = http_error_303 = http_error_307 = http_error_302
cookieprocessor = urllib2.HTTPCookieProcessor(cj)
# Oh yeah. I'm using a proxy too, to follow traffic.
proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8888'})
opener = urllib2.build_opener(MyHTTPRedirectHandler, cookieprocessor, proxy)
Addition: I had tried using mechanize as well, without success. This is probably a new question, but I'll pose it here since it is the same ultimate goal:
This simple code using mechanize, when used with a 302 emitting url (http://fxfeeds.mozilla.com/firefox/headlines.xml) -- note that the same behavior occurs when not using set_handle_robots(False). I just wanted to ensure that wasn't it:
import urllib2, mechanize
browser = mechanize.Browser()
browser.set_handle_robots(False)
opener = mechanize.build_opener(*(browser.handlers))
r = opener.open("http://fxfeeds.mozilla.com/firefox/headlines.xml")
Output:
Traceback (most recent call last):
File "redirecttester.py", line 6, in <module>
r = opener.open("http://fxfeeds.mozilla.com/firefox/headlines.xml")
File "build/bdist.macosx-10.6-universal/egg/mechanize/_opener.py", line 204, in open
File "build/bdist.macosx-10.6-universal/egg/mechanize/_urllib2_fork.py", line 457, in http_response
File "build/bdist.macosx-10.6-universal/egg/mechanize/_opener.py", line 221, in error
File "build/bdist.macosx-10.6-universal/egg/mechanize/_urllib2_fork.py", line 332, in _call_chain
File "build/bdist.macosx-10.6-universal/egg/mechanize/_urllib2_fork.py", line 571, in http_error_302
File "build/bdist.macosx-10.6-universal/egg/mechanize/_opener.py", line 188, in open
File "build/bdist.macosx-10.6-universal/egg/mechanize/_mechanize.py", line 71, in http_request
AttributeError: OpenerDirector instance has no attribute '_add_referer_header'
Any ideas?
I have been having the exact same problem recently but in the interest of time scrapped it and decided to go with mechanize
. It can be used as a total replacement for urllib2
that behaves exactly as you would expect a browser to behave with regards to Referer headers, redirects, and cookies.
import mechanize
cj = mechanize.CookieJar()
browser = mechanize.Browser()
browser.set_cookiejar(cj)
browser.set_proxies({'http': '127.0.0.1:8888'})
# Use browser's handlers to create a new opener
opener = mechanize.build_opener(*browser.handlers)
The Browser
object can be used as an opener itself (using the .open()
method). It maintains state internally but also returns a response object on every call. So you get a lot of flexibility.
Also, if you don't have a need to inspect the cookiejar
manually or pass it along to something else, you can omit the explicit creation and assignment of that object as well.
I am fully aware this doesn't address what is really going on and why urllib2
can't provide this solution out of the box or at least without a lot of tweaking, but if you're short on time and just want it to work, just use mechanize.
Depends on how the redirect is done. If it's done via a HTTP Refresh, then mechanize has a HTTPRefreshProcessor you can use. Try to create an opener like this:
cj = mechanize.CookieJar()
opener = mechanize.build_opener(
mechanize.HTTPCookieProcessor(cj),
mechanize.HTTPRefererProcessor,
mechanize.HTTPEquivProcessor,
mechanize.HTTPRefreshProcessor)
I've just got a variation of the below working for me, at least when trying to read Atom from http://www.fudzilla.com/home?format=feed&type=atom
I can't verify that the below snippet will run as-is, but might give you a start:
import cookielib
cookie_jar = cookielib.LWPCookieJar()
cookie_handler = urllib2.HTTPCookieProcessor(cookie_jar)
handlers = [cookie_handler] #+others, we have proxy + progress handlers
opener = apply(urllib2.build_opener, tuple(handlers + [_FeedURLHandler()])) #see http://code.google.com/p/feedparser/source/browse/trunk/feedparser/feedparser.py#2848 for implementation of _FeedURLHandler
opener.addheaders = [] #may not be needed but see the comments around the link referred to below
try:
return opener.open(request) #see http://code.google.com/p/feedparser/source/browse/trunk/feedparser/feedparser.py#2954 for implementation of request
finally:
opener.close()
I was also having the same problem where the server would respond to the login POST request with a 302 and the session token in the Set-Cookie header. Using Wireshark it was clearly visible that urllib was following the redirect but not including the session token in the Cookie.
I literally just ripped out urllib and did a direct replacement with requests and it worked perfectly first time without having to change a thing. Big props to those guys.
精彩评论