Best practices for sanitizing Unicode input
I'm working on a web application at the moment (using Ruby) that I would ultimately like to be usable by people from anywhere in the world. With that in mind, support for non-AS开发者_JAVA百科CII characters is essential. However, I don't want the database to be full of "noise" characters in fields such as username etc.
Are there any accepted best practices for dealing with Unicode input under these circumstances without alienating users? Any thoughts on dealing with homographs in usernames to make impersonation harder?
Some of my thoughts so far -
- normalizing text before storing or using it in queries
- filtering non-printable characters
- limiting the number of sequential combining diacritics allowed in input
Any further thoughts, or am I making unnecessary work for myself?
Thanks.
http://www.ietf.org/rfc/rfc3454.txt will tell you what you should be doing, which is to say worrying about normalization and security issues.
精彩评论