开发者

Function in Python to clean up and normalize a URL

I am using URLs as a key so I need them to be consistent and clean. I need a python function that will take a URL and clean it up so that I can do a get from the DB. For example, it will take the following:

example.com
example.com/
http://example.com/
http://example.com
http://example.com?
http://example.com/?
http://example.com//

and output a clean consistent version:

http://example.com/

I looked through std libs and GitHub and couldn't find anything like this

Update

I couldn't find a Python library that implements everything discussed here and in the RFC:

http://e开发者_如何学Cn.wikipedia.org/wiki/URL_normalization

So I am writing one now. There is a lot more to this than I initially imagined.


Take a look at urlparse.urlparse(). I've had good success with it.


note: This answer is from 2011 and is specific to Python2. In Python3 the urlparse module has been named to urllib.parse. The corresponding Python3 documentation for urllib.parse can be found here:

https://docs.python.org/3/library/urllib.parse.html


It's done in scrapy:

http://nullege.com/codes/search/scrapy.utils.url.canonicalize_url

Canonicalize the given url by applying the following procedures:

  • sort query arguments, first by key, then by value
  • percent encode paths and query arguments. non-ASCII characters are percent-encoded using UTF-8 (RFC-3986)
  • normalize all spaces (in query arguments) '+' (plus symbol)
  • normalize percent encodings case (%2f -> %2F)
  • remove query arguments with blank values (unless keep_blank_values is True)
  • remove fragments (unless keep_fragments is True)


url-normalize might be what you're looking for.

Depending on your preference you can also probably:

  1. remove UTM parameters
  2. remove http(s)://
  3. remove www.
  4. remove trailing /

here is an example which does this:

from w3lib.url import url_query_cleaner
from url_normalize import url_normalize

urls = ['example.com',
'example.com/',
'http://example.com/',
'http://example.com',
'http://example.com?',
'http://example.com/?',
'http://example.com//',
'http://example.com?utm_source=Google']


def canonical_url(u):
    u = url_normalize(u)
    u = url_query_cleaner(u,parameterlist = ['utm_source','utm_medium','utm_campaign','utm_term','utm_content'],remove=True)

    if u.startswith("http://"):
        u = u[7:]
    if u.startswith("https://"):
        u = u[8:]
    if u.startswith("www."):
        u = u[4:]
    if u.endswith("/"):
        u = u[:-1]
    return u

list(map(canonical_url,urls))

Which gives this result:

['example.com',
 'example.com',
 'example.com',
 'example.com',
 'example.com',
 'example.com',
 'example.com',
 'example.com']

There are still issues with shortened links and redirects of various sorts but you'd need to make a request to the url to sort through those.


Have you considered using regular xpressions? They will help you check for malformed URLs. I have used this in one of my applications

"^[, .a-zA-Z0-9]*$"

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜