URL redirection problem
i have the below url
http://bit.ly/cDdh1c
When you place the above url in a browser and hit enter it will redirect to the below url http://www.kennystopproducts.info/Top/?hop=arnishad
But where as when i try to find the base url (after eliminating all the redirect urls) for the same above url http://bit.ly/cDdh1c via a python program (below you can see the code) iam getting the following url http://www.cbtrends.com/ as base url.Please see the log file below
Why the same url is behaving different with browser and with a python program.What should i change in the python program so that it can redirect to the proper url?Iam wondering how this strange behaviour can happen.?
Other url for which iam observing similar behaviour is
- http://bit.ly/bEKyOx ====> http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&item=150413977509 ( via browser)
http://www.ebay.com (via python program)
maxattempts = 5 turl = url while (maxattempts > 0) : host,path = urlparse.urlsplit(turl)[1:3] if len(host.strip()) == 0 : return None try: connection = httplib.HTTPConnection(host,timeout=10) connection.request("HEAD", path) resp = connection.getresponse() except: return None maxattempts = maxattempts - 1 if (resp.status >= 300) and (resp.status <= 399): self.logger.debug("The present %s is a redirection one" %turl) turl = resp.getheader('location') elif (resp.status >= 200) and (resp.status <= 299) : self.logger.debug("The present url %s is a proper one" %turl) return turl else : #some problem with this url return None return None
Log file for your reference
2010-02-14 10:29:43,260 - paypallistener.views.MrCrawler - INFO - Bringing down the base URL
2010-02-14 10:29:43,261 - paypallistener.views.MrCrawler - DEBUG - what is the url=http://bit.ly/cDdh1c
2010-02-14 10:29:43,994 - paypallistener.views.MrCrawler - DEBUG - The present http://bit.ly/cDdh1c is a redirection one
2010-02-14 10:29:43,995 - paypallistener.views.MrCrawler - DEBUG - what is the url=http://www.cbtrends.com/get-product.html?productid=reFfJcmpgGt95hoiavbXUAYIMP7OfiQn0qBA8BC7%252BV8%253开发者_运维问答D&affid=arnishad&tid=arnishad&utm_source=twitterfeed&utm_medium=twitter
2010-02-14 10:29:44,606 - paypallistener.views.MrCrawler - DEBUG - The present http://www.cbtrends.com/get-product.html?productid=reFfJcmpgGt95hoiavbXUAYIMP7OfiQn0qBA8BC7%252BV8%253D&affid=arnishad&tid=arnishad&utm_source=twitterfeed&utm_medium=twitter is a redirection one
2010-02-14 10:29:44,607 - paypallistener.views.MrCrawler - DEBUG - what is the url=http://www.cbtrends.com/
2010-02-14 10:29:45,547 - paypallistener.views.MrCrawler - DEBUG - The present url http://www.cbtrends.com/ is a proper one
http://www.cbtrends.com/
Your problem is that when you call urlsplit, your path variable only contains the path and is missing the query.
So, instead try:
import httplib
import urlparse
def getUrl(url):
maxattempts = 10
turl = url
while (maxattempts > 0) :
host,path,query = urlparse.urlsplit(turl)[1:4]
if len(host.strip()) == 0 :
return None
try:
connection = httplib.HTTPConnection(host,timeout=10)
connection.request("GET", path+'?'+query)
resp = connection.getresponse()
except:
return None
maxattempts = maxattempts - 1
if (resp.status >= 300) and (resp.status <= 399):
turl = resp.getheader('location')
elif (resp.status >= 200) and (resp.status <= 299) :
return turl
else :
#some problem with this url
return None
return None
print getUrl('http://bit.ly/cDdh1c')
Your problem comes from this line :
host,path = urlparse.urlsplit(turl)[1:3]
You're leaving out the query string. So on the example log you're providing, the second HEAD
request you will do will be on http://www.cbtrends.com/get-product.html
without the GET parameters. Open that URL in your browser and you'll see it redirects to http://www.cbtrends.com/
.
You have to calculate the path using all elements of the tuple returned by urlsplit
.
parts = urlparse.urlsplit(turl)
host = parts[1]
path = "%s?%s#%s" % parts[2:5]
精彩评论