How to make Mason2 UTF-8 clean?
Reformulating the question, because
- @optional asked me
- it wasn't clear and linked one HTML::Mason based solution Four easy steps to make Mason UTF-8 Unicode clean with Apache, mod_perl, and DBI , what caused confusions
- the original is 4 years old and meantime (in 2012) the "poet" is created
Comment: This question already earned the "popular question badge", so probably i'm not the only hopeless person. :)
Unfortunately, demonstrating the full problem stack leads to an very long question and it is very Mason specific.
First, the opinions-only part :)
I'm using HTML::Mason over ages, and now trying to use Mason2. The Poet and Mason are the most advanced frameworks in the CPAN. Found nothing comparamble, what out-of-box allows write so clean /but very hackable :)/ web-apps, with many batteries included (logging, cacheing, config-management, native PGSI based, etc...)
Unfortunately, the author doesn't care about the rest of the word, e.g. by default, it is only ascii based, without any manual, faq or advices about: how to use it with unicode
Now the facts. Demo. Create an poet app:
poet new my #the "my" directory is the $poet_root
mkdir -p my/comps/xls
cd my/comps/xls
and add into the dhandler.mc
the following (what will demostrating the two basic problems)
<%class>
has 'dwl';
use Excel::Writer::XLSX;
</%class>
<%init>
my $file = $m->path_info;
$file =~ s/[^\w\.]//g;
my $cell = lc join ' ', "ÅNGSTRÖM", "in the", $file;
if( $.dwl ) {
#create xlsx in the memory
my $excel;
open my $fh, '>', \$excel or die "Failed open scalar: $!";
my $workbook = Excel::Writer::XLSX->new( 开发者_如何学编程$excel );
my $worksheet = $workbook->add_worksheet();
$worksheet->write(0, 0, $cell);
$workbook->close();
#poet/mason output
$m->clear_buffer;
$m->res->content_type("application/vnd.ms-excel");
$m->print($excel);
$m->abort();
}
</%init>
<table border=1>
<tr><td><% $cell %></td></tr>
</table>
<a href="?dwl=yes">download <% $file %></a>
and run the app
../bin/run.pl
go to http://0:5000/xls/hello.xlsx and you will get:
+----------------------------+
| ÅngstrÖm in the hello.xlsx |
+----------------------------+
download hello.xlsx
Clicking the download hello.xlsx, you will get hello.xlsx
in the downloads.
The above demostrating the first problem,
e.g. the component's source arent "under" the use utf8;
,
so the lc
doesn't understand characters.
The second problem is the following, try the [http://0:5000/xls/hélló.xlsx] , or http://0:5000/xls/h%C3%A9ll%C3%B3.xlsx and you will see:
+--------------------------+
| ÅngstrÖm in the hll.xlsx |
+--------------------------+
download hll.xlsx
#note the wrong filename
Of course, the input (the path_info
) isn't decoded, the script works with the utf8 encoded octets and not with perl characters.
So, telling perl - "the source is in utf8", by adding the use utf8;
into the <%class%>
, results
+--------------------------+
| �ngstr�m in the hll.xlsx |
+--------------------------+
download hll.xlsx
adding use feature 'unicode_strings'
(or use 5.014;
) even worse:
+----------------------------+
| �ngstr�m in the h�ll�.xlsx |
+----------------------------+
download h�ll�.xlsx
Of course, the source now contains wide characters, it needs Encode::encode_utf8
at the output.
One could try use an filter such:
<%filter uencode><% Encode::encode_utf8($yield->()) %></%filter>
and filter the whole output:
% $.uencode {{
<table border=1>
<tr><td><% $cell %></td></tr>
</table>
<a href="?dwl=yes">download <% $file %></a>
% }}
but this helps only partially, because need care about the encoding in the <%init%>
or <%perl%>
blocks.
Encoding/decoding inside of the perl code at many places, (read: not at the borders) leads to an spagethy code.
The encoding/decoding should be clearly done somewhere at the Poet/Mason borders - of course, the Plack operates on the byte level.
Partial solution.
Happyly, the Poet cleverly allows modify it's (and Mason's) parts, so,
in the $poet_root/lib/My/Mason
you could modify the Compilation.pm
to:
override 'output_class_header' => sub {
return join("\n",
super(), qq(
use 5.014;
use utf8;
use Encode;
)
);
};
what will insert the wanted preamble into every Mason component. (Don't forget touch every component, or simply remove the compiled objects from the $poet_root/data/obj
).
Also you could try handle the request/responses at the borders,
by editing the $poet_root/lib/My/Mason/Request.pm
to:
#found this code somewhere on the net
use Encode;
override 'run' => sub {
my($self, $path, $args) = @_;
#decode values - but still missing the "keys" decode
foreach my $k (keys %$args) {
$args->set($k, decode_utf8($args->get($k)));
}
my $result = super();
#encode the output - BUT THIS BREAKS the inline XLS
$result->output( encode_utf8($result->output()) );
return $result;
};
Encode everything is an wrong strategy, it breaks e.g. the XLS.
So, 4 years after (i asked the original question in 2011) still don't know :( how to use correctly the unicode in the Mason2 applications and still doesn't exists any documentation or helpers about it. :(
The main questions are: - where (what methods should be modified by Moose's method modifiers) and how correctly decode the inputs and where the output (in the Poet/Mason app.)
- but only textual ones, e.g.
text/plain
ortext/html
and such... - a do the above "surprise free" - e.g. what will simply works. ;)
Could someone please help with real code - what i should modify in the above?
The Mason2 manual presents the way component inheritance works, so I think that putting this common code in your main Base.mp component (from which all the other inherit) might solve your issue.
Creating plugins is described in Mason::Manual::Plugins.
So, you can build your own plugin that modifies Mason::Request and by overriding the request_args()
you can return the UTF-8 decoded parameters.
Edit:
Regarding the UTF-8 output, you can add an Apache directive to ensure that text/plain and text/HTML outputs are always interpreted as UTF-8 :
AddDefaultCharset utf-8
OK, I've tested this with Firefox. The HTML displays the UTF-8 correctly and leaves the zip alone, so should work everywhere.
If you start with poet new My
to apply the patch you need patch -p1 -i...path/to/thisfile.diff
.
diff -ruN orig/my/comps/Base.mc new/my/comps/Base.mc
--- orig/my/comps/Base.mc 2015-05-20 21:48:34.515625000 -0700
+++ new/my/comps/Base.mc 2015-05-20 21:57:34.703125000 -0700
@@ -2,9 +2,10 @@
has 'title' => (default => 'My site');
</%class>
-<%augment wrap>
- <html>
+<%augment wrap><!DOCTYPE html>
+ <html lang="en-US">
<head>
+ <meta charset="utf-8">
<link rel="stylesheet" href="/static/css/style.css">
% $.Defer {{
<title><% $.title %></title>
diff -ruN orig/my/comps/xls/dhandler.mc new/my/comps/xls/dhandler.mc
--- orig/my/comps/xls/dhandler.mc 1969-12-31 16:00:00.000000000 -0800
+++ new/my/comps/xls/dhandler.mc 2015-05-20 21:53:42.796875000 -0700
@@ -0,0 +1,30 @@
+<%class>
+ has 'dwl';
+ use Excel::Writer::XLSX;
+</%class>
+<%init>
+ my $file = $m->path_info;
+ $file = decode_utf8( $file );
+ $file =~ s/[^\w\.]//g;
+ my $cell = lc join ' ', "ÅNGSTRÖM", "in the", $file ;
+ if( $.dwl ) {
+ #create xlsx in the memory
+ my $excel;
+ open my $fh, '>', \$excel or die "Failed open scalar: $!";
+ my $workbook = Excel::Writer::XLSX->new( $fh );
+ my $worksheet = $workbook->add_worksheet();
+ $worksheet->write(0, 0, $cell);
+ $workbook->close();
+
+ #poet/mason output
+ $m->clear_buffer;
+ $m->res->content_type("application/vnd.ms-excel");
+ $m->print($excel);
+ $m->abort();
+ }
+</%init>
+<table border=1>
+<tr><td><% $cell %></td></tr>
+</table>
+<p> <a href="%c3%85%4e%47%53%54%52%c3%96%4d%20%68%c3%a9%6c%6c%c3%b3">ÅNGSTRÖM hélló</a>
+<p> <a href="?dwl=yes">download <% $file %></a>
diff -ruN orig/my/lib/My/Mason/Compilation.pm new/my/lib/My/Mason/Compilation.pm
--- orig/my/lib/My/Mason/Compilation.pm 2015-05-20 21:48:34.937500000 -0700
+++ new/my/lib/My/Mason/Compilation.pm 2015-05-20 21:49:54.515625000 -0700
@@ -5,11 +5,13 @@
extends 'Mason::Compilation';
# Add customizations to Mason::Compilation here.
-#
-# e.g. Add Perl code to the top of every compiled component
-#
-# override 'output_class_header' => sub {
-# return join("\n", super(), 'use Foo;', 'use Bar qw(baz);');
-# };
-
+override 'output_class_header' => sub {
+ return join("\n",
+ super(), qq(
+ use 5.014;
+ use utf8;
+ use Encode;
+ )
+ );
+};
1;
\ No newline at end of file
diff -ruN orig/my/lib/My/Mason/Request.pm new/my/lib/My/Mason/Request.pm
--- orig/my/lib/My/Mason/Request.pm 2015-05-20 21:48:34.968750000 -0700
+++ new/my/lib/My/Mason/Request.pm 2015-05-20 21:55:03.093750000 -0700
@@ -4,20 +4,27 @@
extends 'Mason::Request';
-# Add customizations to Mason::Request here.
-#
-# e.g. Perform tasks before and after each Mason request
-#
-# override 'run' => sub {
-# my $self = shift;
-#
-# do_tasks_before_request();
-#
-# my $result = super();
-#
-# do_tasks_after_request();
-#
-# return $result;
-# };
+use Encode qw/ encode_utf8 decode_utf8 /;
-1;
\ No newline at end of file
+override 'run' => sub {
+ my($self, $path, $args) = @_;
+ foreach my $k (keys %$args) {
+ my $v = $args->get($k);
+ $v=decode_utf8($v);
+ $args->set($k, $v);
+ }
+ my $result = super();
+ my( $ctype, $charset ) = $self->res->headers->content_type_charset;
+ if( ! $ctype ){
+ $ctype = 'text/html';
+ $charset = 'UTF-8';
+ $self->res->content_type( "$ctype; $charset");
+ $result->output( encode_utf8(''.( $result->output())) );
+ } elsif( ! $charset and $ctype =~ m{text/(?:plain|html)} ){
+ $charset = 'UTF-8';
+ $self->res->content_type( "$ctype; $charset");
+ $result->output( encode_utf8(''.( $result->output())) );
+ }
+ return $result;
+};
+1;
In the mason-users mailing list was a question about handling UTF-8 for
- components output with UTF-8
- handling UTF-8 GET/POST arguments
Here is Jon's answer:
I'd like Mason to handle encoding intelligently, but since I don't regularly work with utf8, you and others will have to help me with the design.
This should probably be in a plugin, e.g. Mason::Plugin::UTF8.
So for the things you particularly mention, something like this might work:
package Mason::Plugin::UTF8;
use Moose;
with 'Mason::Plugin';
1;
package Mason::Plugin::UTF8::Request;
use Mason::PluginRole;
use Encode;
# Encode all output in utf8 - ** only works with Mason 2.13 and beyond **
#
after 'process_output' => sub {
my ($self, $outref) = @_;
$$outref = encode_utf8( $$outref );
};
# Decode all parameters as utf8
#
around 'run' => sub {
my $orig = shift;
my $self = shift;
my %params = @_;
while (my ($key, $value) = each(%params)) {
$value = decode_utf8($value);
}
$self->$orig(%params);
}
1;
It would probably be best if you or someone else knowledgable about utf8 issues created this plugin rather than myself. But let me know if there are things needed in the Mason core to make this easier.
IMHO, it is needed add the following too, for adding "use utf8;" into every component.
package Mason::Plugin::UTF8::Compilation;
use Mason::PluginRole;
override 'output_class_header' => sub {
return(super() . 'use utf8;');
};
1;
精彩评论