Python: How to resolve URLs containing '..'

2023-01-27 21:52 问答作者：

I need to uniquely identify and store some U开发者_开发百科RLs. The problem is that sometimes they come containing ".." like http://somedomain.com/foo/bar/../../some/url which basically is http://somedomain.com/some/url if I'm not wrong.

Is there a Python function or a tricky way to resolve this URLs ?

There’s a simple solution using urllib.parse.urljoin:

>>> from urllib.parse import urljoin
>>> urljoin('http://www.example.com/foo/bar/../../baz/bux/', '.')
'http://www.example.com/baz/bux/'

However, if there is no trailing slash (the last component is a file, not a directory), the last component will be removed.

This fix uses the urlparse function to extract the path, then use (the posixpath version of) os.path to normalize the components. Compensate for a mysterious issue with trailing slashes, then join the URL back together. The following is doctestable:

from urllib.parse import urlparse
import posixpath

def resolve_components(url):
    """
    >>> resolve_components('http://www.example.com/foo/bar/../../baz/bux/')
    'http://www.example.com/baz/bux/'
    >>> resolve_components('http://www.example.com/some/path/../file.ext')
    'http://www.example.com/some/file.ext'
    """
    parsed = urlparse(url)
    new_path = posixpath.normpath(parsed.path)
    if parsed.path.endswith('/'):
        # Compensate for issue1707768
        new_path += '/'
    cleaned = parsed._replace(path=new_path)
    return cleaned.geturl()

Those are file paths. Look at os.path.normpath:

>>> import os
>>> os.path.normpath('/foo/bar/../../some/url')
'/some/url'

EDIT:

If this is on Windows, your input path will use backslashes instead of slashes. In this case, you still need os.path.normpath to get rid of the .. patterns (and // and /./ and whatever else is redundant), then convert the backslashes to forward slashes:

def fix_path_for_URL(path):
    result = os.path.normpath(path)
    if os.sep == '\\':
        result = result.replace('\\', '/')
    return result

EDIT 2:

If you want to normalize URLs, do it (before you strip off the method and such) with urlparse module, as shown in the answer to this question.

EDIT 3:

It seems that urljoin doesn't normalize the base path it's given:

>>> import urlparse
>>> urlparse.urljoin('http://somedomain.com/foo/bar/../../some/url', '')
'http://somedomain.com/foo/bar/../../some/url'

normpath by itself doesn't quite cut it either:

>>> import os
>>> os.path.normpath('http://somedomain.com/foo/bar/../../some/url')
'http:/somedomain.com/some/url'

Note the initial double slash got eaten.

So we have to make them join forces:

def fix_URL(urlstring):
    parts = list(urlparse.urlparse(urlstring))
    parts[2] = os.path.normpath(parts[2].replace('/', os.sep)).replace(os.sep, '/')
    return urlparse.urlunparse(parts)

Usage:

>>> fix_URL('http://somedomain.com/foo/bar/../../some/url')
'http://somedomain.com/some/url'

urljoin won't work, as it only resolves dot segments if the second argument isn't absolute(!?) or empty. Not only that, it doesn't handle excessive ..s properly according to RFC 3986 (they should be removed; urljoin doesn't do so). posixpath.normpath can't be used either (much less os.path.normpath), since it resolves multiple slashes in a row to only one (e.g. ///// becomes /), which is incorrect behavior for URLs.

The following short function resolves any URL path string correctly. It shouldn't be used with relative paths, however, since additional decisions about its behavior would then need to be made (Raise an error on excessive ..s? Remove . in the beginning? Leave them both?) - instead, join URLs before resolving if you know you might handle relative paths. Without further ado:

def resolve_url_path(path):
    segments = path.split('/')
    segments = [segment + '/' for segment in segments[:-1]] + [segments[-1]]
    resolved = []
    for segment in segments:
        if segment in ('../', '..'):
            if resolved[1:]:
                resolved.pop()
        elif segment not in ('./', '.'):
            resolved.append(segment)
    return ''.join(resolved)

This handles trailing dot segments (that is, without a trailing slash) and consecutive slashes correctly. To resolve an entire URL, you can then use the following wrapper (or just inline the path resolution function into it).

try:
    # Python 3
    from urllib.parse import urlsplit, urlunsplit
except ImportError:
    # Python 2
    from urlparse import urlsplit, urlunsplit

def resolve_url(url):
    parts = list(urlsplit(url))
    parts[2] = resolve_url_path(parts[2])
    return urlunsplit(parts)

You can then call it like this:

>>> resolve_url('http://example.com/../thing///wrong/../multiple-slashes-yeah/.')
'http://example.com/thing///multiple-slashes-yeah/'

Correct URL resolution has more than a few pitfalls, it turns out!

I wanted to comment on the resolveComponents function in the top response.

Notice that if your path is /, the code will add another one which can be problematic. I therefore changed the IF condition to:

if parsed.path.endswith( '/' ) and parsed.path != '/':

According to RFC 3986 this should happen as part of "relative resolution" process. So answer could be urlparse.urljoin(url, ''). But due to bug urlparse.urljoin does not remove dot segments when second argument is empty url. You can use yurl — alternative url manipulation library. It do this right:

>>> import yurl
>>> print yurl.URL('http://somedomain.com/foo/bar/../../some/url') + yurl.URL()
http://somedomain.com/some/url

import urlparse
import posixpath

parsed = list(urlparse.urlparse(url))
parsed[2] = posixpath.normpath(posixpath.join(parsed[2], rel_path))
proper_url = urlparse.urlunparse(parsed)

First, you need to the base URL, and then you can use urljoin from urllib.parse

example :

from urllib.parse import urljoin
    
def resolve_urls( urls, site):
    for url in urls:
        print(urljoin(site, url))
    return

urls= ["/aboutMytest", "#", "/terms-of-use", "https://www.example.com/Admission"]

resolve_urls(urls,'https://example.com/')

output :

https://example.com/aboutMytest
https://example.com/
https://example.com/terms-of-use
https://www.example.com/Admission

继续阅读：python

Python: How to resolve URLs containing '..'

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？