开发者

Sitemap Encoding Woes

I'm having real trouble understanding the specification and guidelines on how to properly escape and encode a URL for submission in a sitemap.

In the sitemap.org (entity escaping) examples, they have an example URL:

http://www.example.com/ümlat.php&q=name

Which when UTF-8 encoded ends up as (according to them):

http://www.example.com/%C3%BCmlat.php&q=name

However, when I try this (rawurlencode) on PHP I end up with:

http%3A%2F%2Fwww.example.com%2F%C3%BCmlat.php%26q%3Dname

I've sort of beaten this by using this function found on PHP.net

$entities = array('%21开发者_如何学编程', '%2A', '%27', '%28', '%29', '%3B', '%3A', '%40', 
    '%26', '%3D', '%2B', '%24', '%2C', '%2F', '%3F', '%23', '%5B', '%5D');
    
$replacements = array('!', '*', "'", "(", ")", ";", ":", "@", "&", "=", "+",
    "$", ",", "/", "?", "#", "[", "]");

$string = str_replace($entities, $replacements, rawurlencode($string));

but according to someone I spoke to (Kohana BDFM), this interpretation is wrong. Honestly, I'm so confused I don't even know what's right.

What's the correct way to encode a URL for use in the sitemap?

Relevant RFC 3986


The problem is that http://www.example.com/ümlat.php&q=name is not a valid url.

(source: RFC 1738, which is obsolete but serves its purpose here, RFC 3986 indeed allows more characters, but no harm is done by escaping characters that don't need escaping)

httpurl        = "http://" hostport [ "/" hpath [ "?" search ]]
hpath          = hsegment *[ "/" hsegment ]
hsegment       = *[ uchar | ";" | ":" | "@" | "&" | "=" ]
uchar          = unreserved | escape
unreserved     = alpha | digit | safe | extra
safe           = "$" | "-" | "_" | "." | "+"
extra          = "!" | "*" | "'" | "(" | ")" | ","
escape         = "%" hex hex
search         = *[ uchar | ";" | ":" | "@" | "&" | "=" ]

So any character except ;:@&=$-_.+!*'(),, a 0-9a-zA-Z character or an escape sequence (e.g. %A0 or, equivalently, %a0) must be escaped. The ? character can appear at most once. The / character can appear in the path portion, but not in the query string. The convention for encoding the other characters is to compute their UTF-8 representation and escape that sequence.

Your algorithm should (assuming the host part is not a problem...):

  • extract the path part
  • extract the query string part
  • for each of those, look for invalid characters
  • encode those characters in UTF-8
  • pass the result to rawurlencode
  • replace the character in the URL with the result of rawurlencode
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜