302s and losing cookies with urllib2

2023-02-22 14:54 问答作者：

I am using liburl2 with CookieJar / HTTPCookieProcessor in an attempt to simulate a login to a page to automate an upload.

I've seen some questions and answers on this, but nothing which solves my problem. I am losing my cookie when I simulate the login which ends up at a 302 redirect. The 302 response is where the cookie gets set by the server, but urllib2 HTTPCookieProcessor does not seem to save the cookie during a redirect. I tried creating a HTTPRedirectHandler class to ignore the redirect, but that didn't seem to do the trick. I tried referencing the CookieJar globally to handle the cookies from the HTTPRedirectHandler, but 1. This didn't work (because I was handling the header from the redirector, and the CookieJar funct开发者_如何学Goion that I was using, extract_cookies, needed a full request) and 2. It's an ugly way to handle it.

I probably need some guidance on this as I'm fairly green with Python. I think I'm mostly barking up the right tree here, but maybe focusing on the wrong branch.

cj = cookielib.CookieJar()
cookieprocessor = urllib2.HTTPCookieProcessor(cj)


class MyHTTPRedirectHandler(urllib2.HTTPRedirectHandler):
  def http_error_302(self, req, fp, code, msg, headers):
    global cj
    cookie = headers.get("set-cookie")
    if cookie:
      # Doesn't work, but you get the idea
      cj.extract_cookies(headers, req)

    return urllib2.HTTPRedirectHandler.http_error_302(self, req, fp, code, msg, headers)

  http_error_301 = http_error_303 = http_error_307 = http_error_302

cookieprocessor = urllib2.HTTPCookieProcessor(cj)

# Oh yeah.  I'm using a proxy too, to follow traffic.
proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8888'})
opener = urllib2.build_opener(MyHTTPRedirectHandler, cookieprocessor, proxy)

Addition: I had tried using mechanize as well, without success. This is probably a new question, but I'll pose it here since it is the same ultimate goal:

This simple code using mechanize, when used with a 302 emitting url (http://fxfeeds.mozilla.com/firefox/headlines.xml) -- note that the same behavior occurs when not using set_handle_robots(False). I just wanted to ensure that wasn't it:

import urllib2, mechanize

browser = mechanize.Browser()
browser.set_handle_robots(False)
opener = mechanize.build_opener(*(browser.handlers))
r = opener.open("http://fxfeeds.mozilla.com/firefox/headlines.xml")

Output:

Traceback (most recent call last):
  File "redirecttester.py", line 6, in <module>
    r = opener.open("http://fxfeeds.mozilla.com/firefox/headlines.xml")
  File "build/bdist.macosx-10.6-universal/egg/mechanize/_opener.py", line 204, in open
  File "build/bdist.macosx-10.6-universal/egg/mechanize/_urllib2_fork.py", line 457, in http_response
  File "build/bdist.macosx-10.6-universal/egg/mechanize/_opener.py", line 221, in error
  File "build/bdist.macosx-10.6-universal/egg/mechanize/_urllib2_fork.py", line 332, in _call_chain
  File "build/bdist.macosx-10.6-universal/egg/mechanize/_urllib2_fork.py", line 571, in http_error_302
  File "build/bdist.macosx-10.6-universal/egg/mechanize/_opener.py", line 188, in open
  File "build/bdist.macosx-10.6-universal/egg/mechanize/_mechanize.py", line 71, in http_request
AttributeError: OpenerDirector instance has no attribute '_add_referer_header'

Any ideas?

I have been having the exact same problem recently but in the interest of time scrapped it and decided to go with mechanize. It can be used as a total replacement for urllib2 that behaves exactly as you would expect a browser to behave with regards to Referer headers, redirects, and cookies.

import mechanize
cj = mechanize.CookieJar()
browser = mechanize.Browser()
browser.set_cookiejar(cj)
browser.set_proxies({'http': '127.0.0.1:8888'})

# Use browser's handlers to create a new opener
opener = mechanize.build_opener(*browser.handlers)

The Browser object can be used as an opener itself (using the .open() method). It maintains state internally but also returns a response object on every call. So you get a lot of flexibility.

Also, if you don't have a need to inspect the cookiejar manually or pass it along to something else, you can omit the explicit creation and assignment of that object as well.

I am fully aware this doesn't address what is really going on and why urllib2 can't provide this solution out of the box or at least without a lot of tweaking, but if you're short on time and just want it to work, just use mechanize.

Depends on how the redirect is done. If it's done via a HTTP Refresh, then mechanize has a HTTPRefreshProcessor you can use. Try to create an opener like this:

cj = mechanize.CookieJar()
opener = mechanize.build_opener(
    mechanize.HTTPCookieProcessor(cj),
    mechanize.HTTPRefererProcessor,
    mechanize.HTTPEquivProcessor,
    mechanize.HTTPRefreshProcessor)

I've just got a variation of the below working for me, at least when trying to read Atom from http://www.fudzilla.com/home?format=feed&type=atom

I can't verify that the below snippet will run as-is, but might give you a start:

import cookielib
cookie_jar = cookielib.LWPCookieJar()
cookie_handler = urllib2.HTTPCookieProcessor(cookie_jar)
handlers = [cookie_handler] #+others, we have proxy + progress handlers
opener = apply(urllib2.build_opener, tuple(handlers + [_FeedURLHandler()])) #see http://code.google.com/p/feedparser/source/browse/trunk/feedparser/feedparser.py#2848 for implementation of _FeedURLHandler
opener.addheaders = [] #may not be needed but see the comments around the link referred to below
try:
    return opener.open(request) #see http://code.google.com/p/feedparser/source/browse/trunk/feedparser/feedparser.py#2954 for implementation of request
finally:
    opener.close()

I was also having the same problem where the server would respond to the login POST request with a 302 and the session token in the Set-Cookie header. Using Wireshark it was clearly visible that urllib was following the redirect but not including the session token in the Cookie.

I literally just ripped out urllib and did a direct replacement with requests and it worked perfectly first time without having to change a thing. Big props to those guys.

继续阅读：cookiejar mechanize python urllib2

302s and losing cookies with urllib2

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？