non-latin email address validation
Now that ICann is allowing non-latin-character domain names, should I be concerned about e-mail validation? Currently, my sites are using php f开发者_如何学Cunctions to ensure some alpha-numeric character set in each segment of an email address. Will these other character sets, such as Cyrillic, Arabic, and Chinese, pass validation? Are there recommended php functions to utilize for this?
I think the ultimately best way would be using a proper IDN function to convert the incoming string into an ACE string (xn--xyz-blah.com
). If that process works, the domain name is valid. If it doesn't, it isn't.
There is a PHP function named idn_to_ascii()
that does this, but it needs additional libraries. You'd have to see whether it is available on your system.
There also seems to be an external Linux command named idn
that does IDN conversions. I don't know anything further about it, though.
If you want to use PHP built-in methods only, delfuego provides a regular expression in this question that looks very good.
I was going to recommend using filter_var()
with the FILTER_VALIDATE_EMAIL
filter. But after a Google search it turns out it doesn't support multi-byte characters yet. It looks like, for now, your best bet is to strip out non-latin characters and perform the usual validations against that (although checkdnsrr will obviously fail since you've changed the domain by removing the non-Latin characters and replaced them with their Latin equivalents so if you use that to verify the MX records of the email's domain then you will need to temporarily disable that).
It is not ICANN allowing non-latin email addresses, but the arrival of new norms, coming from IETF standards body and its "EAI" working group.
So, yes, technically, today, café@café.été
is a valid email address: non ASCII left part, non ASCII domain, non ASCII TLD.
But, a lot of existing, or even new codes, will fail to accept those cases. Of course it is a chicken and egg problem because people wanting to use that and seeing refusal by many sites will go back to ASCII which will show little appetence for non ASCII and hence little evolution.
There is an ICANN initiative about all of that called "Universal Acceptance" which concerns itself not just with IDNs but even with new gTLDs, as there are still places hardcoding TLDs and hence not reacting to new TLDs that were opened few years back, or with silly regular expression such as a TLD must be 2 or 3 characters long, which is wrong.
You can find it at: https://uasg.tech/
It as advices and links for various kind of public, starting with developers, and hence list of things to do/not do.
They recently published a new article, that show trends over 3 years about the highest visited sites based on Alexa and which kind of email addresses they allow or not: https://www.circleid.com/posts/20210712-acceptance-of-all-domain-names-in-open-source-software/
With the report at https://uasg.tech/wp-content/uploads/documents/UASG033-en-digital.pdf going into more details about Java and Python libraries and their handling of IDNs.
精彩评论