URL redirection problem

2022-12-20 16:00 问答作者：

i have the below url

http://bit.ly/cDdh1c

When you place the above url in a browser and hit enter it will redirect to the below url http://www.kennystopproducts.info/Top/?hop=arnishad

But where as when i try to find the base url (after eliminating all the redirect urls) for the same above url http://bit.ly/cDdh1c via a python program (below you can see the code) iam getting the following url http://www.cbtrends.com/ as base url.Please see the log file below

Why the same url is behaving different with browser and with a python program.What should i change in the python program so that it can redirect to the proper url?Iam wondering how this strange behaviour can happen.?

Other url for which iam observing similar behaviour is

http://bit.ly/bEKyOx ====> http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&item=150413977509 ( via browser)

http://www.ebay.com (via python program)

      maxattempts = 5
      turl = url
      while (maxattempts  >  0) :               
        host,path = urlparse.urlsplit(turl)[1:3]
        if  len(host.strip()) == 0 :
           return None

        try: 
                connection = httplib.HTTPConnection(host,timeout=10)
                connection.request("HEAD", path)
                resp = connection.getresponse()                      
        except:                         
                 return None                     
        maxattempts = maxattempts - 1
        if (resp.status >= 300) and (resp.status <= 399):
            self.logger.debug("The present %s is a redirection one" %turl)
            turl = resp.getheader('location')
        elif (resp.status >= 200) and (resp.status <= 299) :
            self.logger.debug("The present url %s is a proper one" %turl)
            return turl
        else :
            #some problem with this url
            return None               
      return None

Log file for your reference

2010-02-14 10:29:43,260 - paypallistener.views.MrCrawler - INFO - Bringing down the base URL
2010-02-14 10:29:43,261 - paypallistener.views.MrCrawler - DEBUG - what is the url=http://bit.ly/cDdh1c
2010-02-14 10:29:43,994 - paypallistener.views.MrCrawler - DEBUG - The present http://bit.ly/cDdh1c is a redirection one
2010-02-14 10:29:43,995 - paypallistener.views.MrCrawler - DEBUG - what is the url=http://www.cbtrends.com/get-product.html?productid=reFfJcmpgGt95hoiavbXUAYIMP7OfiQn0qBA8BC7%252BV8%253开发者_运维问答D&affid=arnishad&tid=arnishad&utm_source=twitterfeed&utm_medium=twitter
2010-02-14 10:29:44,606 - paypallistener.views.MrCrawler - DEBUG - The present http://www.cbtrends.com/get-product.html?productid=reFfJcmpgGt95hoiavbXUAYIMP7OfiQn0qBA8BC7%252BV8%253D&affid=arnishad&tid=arnishad&utm_source=twitterfeed&utm_medium=twitter is a redirection one
2010-02-14 10:29:44,607 - paypallistener.views.MrCrawler - DEBUG - what is the url=http://www.cbtrends.com/
2010-02-14 10:29:45,547 - paypallistener.views.MrCrawler - DEBUG - The present url http://www.cbtrends.com/ is a proper one
http://www.cbtrends.com/

Your problem is that when you call urlsplit, your path variable only contains the path and is missing the query.

So, instead try:

import httplib
import urlparse

def getUrl(url):
    maxattempts = 10
    turl = url
    while (maxattempts  >  0) :               
        host,path,query = urlparse.urlsplit(turl)[1:4]
        if  len(host.strip()) == 0 :
            return None
        try: 
            connection = httplib.HTTPConnection(host,timeout=10)
            connection.request("GET", path+'?'+query)
            resp = connection.getresponse()
        except:                         
            return None                     
        maxattempts = maxattempts - 1
        if (resp.status >= 300) and (resp.status <= 399):
            turl = resp.getheader('location')
        elif (resp.status >= 200) and (resp.status <= 299) :
            return turl
        else :
            #some problem with this url
            return None               
    return None
print getUrl('http://bit.ly/cDdh1c')

Your problem comes from this line :

host,path = urlparse.urlsplit(turl)[1:3]

You're leaving out the query string. So on the example log you're providing, the second HEAD request you will do will be on http://www.cbtrends.com/get-product.html without the GET parameters. Open that URL in your browser and you'll see it redirects to http://www.cbtrends.com/.

You have to calculate the path using all elements of the tuple returned by urlsplit.

parts = urlparse.urlsplit(turl)
host = parts[1]
path = "%s?%s#%s" % parts[2:5]

继续阅读：bit.ly python redirect

URL redirection problem

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？