Which should I be using: urlparse or urlsplit?
Which URL parsing function pair should I be using and why?
urlparse
and 开发者_运维技巧urlunparse
, orurlsplit
andurlunsplit
?
Directly from the docs you linked yourself:
urllib.parse.urlsplit(urlstring, scheme='', allow_fragments=True)
This is similar tourlparse()
, but does not split the params from the URL. This should generally be used instead ofurlparse()
if the more recent URL syntax allowing parameters to be applied to each segment of the path portion of the URL (see RFC 2396) is wanted.
Given the documentation you linked didn't include an example with an nonempty params
I was also confused until I found this.
>>> urllib.parse.urlparse("http://example.com/pa/th;param1=foo;param2=bar?name=val#frag")
ParseResult(scheme='http', netloc='example.com', path='/pa/th', params='param1=foo;param2=bar', query='name=val', fragment='frag')
(Some history because I got nerd-sniped.)
I'd never heard of the URL "parameters" other than url component params i.e. /user/213/settings
or query params /user?id=213
and I think it's essentially obsolete.
In the beginning, RFC 1738 defined the HTTP URL to never allow ;
in the path
:
http://<host>:<port>/<path>?<searchpart>
Within the
<path>
and<searchpart>
components, "/", ";", "?" are reserved.
;
was reserved with special meaning in other schemes, like the ftp:// url-path
:
<cwd1>/<cwd2>/.../<cwdN>/<name>;type=<typecode>
Apparently in 1995, RFC 1808 defined URL params
as a top-level component between path
and query
:
<scheme>://<net_loc>/<path>;<params>?<query>#<fragment>
Then in 1998, RFC 2396 defined URIs as having adjacent top-level components path
and query
:
<scheme>://<authority><path>?<query>
where the path
is defined as multiple path_segments
that each could include param
:
path = [ abs_path | opaque_part ]
abs_path = "/" path_segments
path_segments = segment *( "/" segment )
segment = *pchar *( ";" param )
Finally in 2005, RFC 3986 obsoleted RFC 1808 and 2396, defining URI
similarly to RFC 2396:
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
hier-part = "//" authority path-abempty
/ path-absolute
/ path-rootless
/ path-empty
And the special syntax of ;params
is considered an opaque part of the URI syntax that may be specific to the HTTP(S) scheme or just some specific implementation:
Aside from dot-segments in hierarchical paths, a path segment is considered opaque by the generic syntax. URI producing applications often use the reserved characters allowed in a segment to delimit scheme-specific or dereference-handler-specific subcomponents. For example, the semicolon (";") and equals ("=") reserved characters are often used to delimit parameters and parameter values applicable to that segment. The comma (",") reserved character is often used for similar purposes. For example, one URI producer might use a segment such as "name;v=1.1" to indicate a reference to version 1.1 of "name", whereas another might use a segment such as "name,1.1" to indicate the same. Parameter types may be defined by scheme-specific semantics, but in most cases the syntax of a parameter is specific to the implementation of the URI's dereferencing algorithm.
As the document says
urlparse.urlparse
returns 6-tuple(with additional parameter tuple)
urlparse.urlsplit
returns 5-tuple
Attribute |Index | Value | Value if not present
params | 3 | Parameters for last path element | empty string
FYI: According to [RFC2396](https://www.rfc-editor.org/rfc/rfc2396.html#appendix-C), _parameter_ in URL specification > Extensive testing of current client applications demonstrated that the majority of deployed systems do not use the ";" character to indicate trailing parameter information, and that the presence of a semicolon in a path segment does not affect the relative parsing of that segment. Therefore, parameters have been removed as a separate component and may now appear in any path segment. Their influence has been removed from the algorithm for resolving a relative URI reference.
精彩评论