Python urllib2, how to avoid errors - need help
I am using python urllib2 to download pages from the web. I am not using any kind of user_agent etc. I am getting below sample errors. Can someone tell me a easy way to avoid them.
http://www.rottentomatoes.com/m/foxy_brown/
The server couldn't fulfill the request.
Error code: 403
http://www.spiritus-temporis.com/m开发者_JAVA技巧arc-platt-dancer-/
The server couldn't fulfill the request.
Error code: 503
http://www.golf-equipment-guide.com/news/Mark-Nichols-(golfer).html!!
The server couldn't fulfill the request.
Error code: 500
http://www.ehx.com/blog/mike-matthews-in-fuzz-documentary!!
We failed to reach a server.
Reason: timed out
IncompleteRead(5621 bytes read)
Traceback (most recent call last):
File "download.py", line 43, in <module>
localFile.write(response.read())
File "/usr/lib/python2.6/socket.py", line 327, in read
data = self._sock.recv(rbufsize)
File "/usr/lib/python2.6/httplib.py", line 517, in read
return self._read_chunked(amt)
File "/usr/lib/python2.6/httplib.py", line 563, in _read_chunked
raise IncompleteRead(value)
IncompleteRead: IncompleteRead(5621 bytes read)
Thank you
BalaMany web resources require some kind of cookie or other authentication to access, your 403 status codes are most likely the result of this.
503 errors tend to mean you're rapidly accessing resources from a server in a loop and you need to wait briefly before attempting another access.
The 500 example doesn't even appear to exist...
The timeout error may not need the "!!", I can only load the resource without it.
I recommend you read up on http status codes.
For those more complicated tasks, You might want to consider using mechanize, twill or even Selenium or Windmill, which will support more compliated scenerios, including cookies or javascript support.
For random website, it might be tricky to work around with urllib2 only (signed cookies, anyone?).
精彩评论