How can I convert CGI input to UTF-8 without Perl's Encode module?

2023-01-16 14:03 问答作者：

Through this forum, I have learned that it is not a good idea to use the following for converting CGI input (from either an escape()d Ajax call or a normal HTML form post) to UTF-8:

read (STDIN, $_, $ENV{CONTENT_LENGTH});
s{%([a-fA-F0-9]{2})}{ pack ('C', hex ($1)) }eg;
utf8::decode $_;

A safer way (which for example does not allow bogus characters through) is to do the following:

use Encode qw (decode);
read (STDIN, $_, $ENV{CONTENT_LENGTH});
s{%([a-fA-F0-9]{2})}{ pack ('C', hex ($1)) }eg;
decode ('UTF-8', $_, Encode::FB_CROAK);

I would, however, very much like to avoid using any modules (including XSLoader, Exporter, and whatever else they bring with them). The function is for a high-volume mod_perl driven website and I think both performance and maintainability will be better without modules (especially since the current code does not use any).

I guess one approach would be to examine the Encode module and strip out the functions and constants used for the “decode ('UTF-8', $_, Encode::FB_CROAK)” call. I am not sufficiently familiar with Unicode and Perl modules to do this. Maybe somebody else is capable of doing this or know a similar, safe “native” way o开发者_开发技巧f doing the UTF-8 conversion?

UPDATE:

I prefer keeping things non-modular, because then the only black-box is Perl's own compiler (unless of course you dig down into the module libs).

Sometimes you see large modules being replaced with a few specific lines of code. For example, instead of the CGI.pm module (which people are also in love with), one can use the following for parsing AJAX posts:

my %Input;
if ($ENV{CONTENT_LENGTH}) {
    read (STDIN, $_, $ENV{CONTENT_LENGTH});
    foreach (split (/&/)) {
        tr/+/ /; s/%([a-fA-F0-9]{2})/pack("C", hex($1))/eg;
        if (m{^(\w+)=\s*(.*?)\s*$}s) { $Input{$1} = $2; }
        else { die ("bad input ($_)"); }
    }
}

In a similar way, it would be great if one could extract or replicate Encode's UTF-8 decode function.

Don't pre-optimize. Do it the conventional way first then profile and benchmark later to see where you need to optimize. People usually waste all their time somewhere else, so starting off blindfolded and hadcuffed doesn't give you any benefit.

Don't be afraid of modules. The point of mod_perl is to load up everything as few times as possible so the startup time and module loading time are insignificant.

Don't use escape() to create your posted data. This isn't compatible with URL-encoding, it's a mutant JavaScript oddity which should normally never be used. One of the defects is that it will encode non-ASCII characters to non-standard %uNNNN sequences based on UTF-16 code units, instead of standard URL-encoded UTF-8. Your current code won't be able to handle that.

You should typically use encodeURIComponent() instead.

If you must URL-decode posted input yourself rather than using a form library (and this does mean you won't be able to handle multipart/form-data), you will need to convert + symbols to spaces before replacing %-sequences. This replacement is standard in form submissions (though not elsewhere in URL-encoded data).

To ensure input is valid UTF-8 if you really don't want to use a library, try this regex. It also excludes some control characters (you may want to tweak it to exclude more).

继续阅读：perl unicode utf-8

How can I convert CGI input to UTF-8 without Perl's Encode module?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？