How should I sanitize urls so people don't put 漢字 or á or other things in them?
How should I sanitize urls so people don't put 漢字 or other things in them?
EDIT: I'm using java. The url will be generated from a question the user asks on a form. It seems StackOverflow just removed the offending characters, but it also turns an 开发者_StackOverflowá into an a.
Is there a standard convention for doing this? Or does each developer just write their own version?
The process you're describing is slugify
. There's no fixed mechanism for doing it; every framework handles it in their own way.
Yes, I would sanitize/remove. It will either be inconsistent or look ugly encoded
Using Java see URLEncoder API docs
Be careful! If you are removing elements such as odd chars, then two distinct inputs could yield the same stripped URL when they don't mean to.
The specification for URLs (RFC 1738, Dec. '94) poses a problem, in that it limits the use of allowed characters in URLs to only a limited subset of the US-ASCII character set
This means it will get encoded. URLs should be readable. Standards tend to be English biased (what's that? Langist? Languagist?).
Not sure what convention is other countries, but if I saw tons of encoding in a URL send to me, I would think it was stupid or suspicious ...
Unless the link is displayed properly, encoded by the browser and decoded at the other end ... but do you want to take that risk?
StackOverflow seems to just remove those chars from the URL all together :)
StackOverflow can afford to remove the characters because it includes the question ID in the URL. The slug containing the question title is for convenience, and isn't actually used by the site, AFAIK. For example, you can remove the slug and the link will still work fine: the question ID is what matters and is a simple mechanism for making links unique, even if two different question titles generate the same slug. Actually, you can verify this by trying to go to stackoverflow.com/questions/2106942/… and it will just take you back to this page.
Thanks Mike Spross
Which language you are talking about? In PHP I think this is the easiest and would take care of everything:
http://us2.php.net/manual/en/function.urlencode.php
精彩评论