开发者

Perl UTF8 CGI and DBI ... what's the correct workflow?

I am having the pleasure of rebuilding a perl based web framework to UTF8 support. I took the following steps

for the main script:

use open IO => ":utf8",":std";

use utf8;

for the DBI Adapter:

$self->{dbh}->{'mysql_enable_utf8'} = 1;'

and in my request parser for POST and GET, based on CGI:

foreach (@val) { $_ = decode("UTF-8",$_); }

This, as far as I can tell, works just fine on my local Ubuntu with Perl 5.10.1, but on the webserver which runs 5.10, decoding POST or GET will mess up the text.

I must admit, I am very confused by the whole UTF8 thing. I need to

Read Templates

Get data from mySQL

Process POST and GET inser开发者_如何学运维t into mySQL

write Templates

Is there anything I'm forgetting here? What could cause the inconstant behaviour? Does every module I use in the main script need to specifically use utf8 or is it enough if the main script does that?

Thanks for any hints,

thomas


use utf8; is, as several people have said, a no-op as far as your i/o problems are concerned: all it says is 'treat my source code as utf8 encoded'.

MySQL/DBI approach is bang on the money.

For CGI, update to a recent CGI and set $CGI::PARAM_UTF8=1 and it'll do the decode() for you. (As a general tip, BTW, decode_utf8() is considerably faster!)

As for the other problem, you may want to compare your Apache server configs to see if AddDefaultCharset is set to some non-helpful value.

Also, see my talk at last year's London Perl Workshop for a more detailed look at Perl and Unicode.


The solution here is the ordering.

$dbh->{mysql_enable_utf8} = 1;
$dbh->connect ...
$dbh->do('SET NAMES \'utf8\';') || die;

Enjoy :)


Thomas,

With the risk of extra negative points, I don't know if this is still needed, but in the past I needed to make sure my DBI behaved properly with utf8 by doing:

my $dbh = DBI->connect(...); $dbh->{mysql_enable_utf8} = 1; $dbh->do("set names 'utf8';");

Maybe it can be of help


First of all my condolances about your latin->utf8 job. I did that for a large application a few years back and the wrinkles it got me still haven't worn off.

What I recommend you to do is turn everything into UTF8 and not try to do decoding and stuff. That will definitely screw up somewhere. storing utf8 data in a latin table is a recipe for disaster. I remember at one point having double and tripple encoded utf8 strings in my database and no way to tell how to get back the original string.

The steps you should take:

  1. Create a secondary database structure with UTF8 collated table instead of latin
  2. extract everything out of your primary database and insert into the new database (hoping you haven't stored any utf8 strings in there yet)
  3. make sure the Mime headers your application sends the browser specifies the encoding is in utf8, all data you get back from these pages automatically take the encoding of the page itself
  4. cross your fingers and take a vacation...

You shouldn't have to change much in your application since the DBI utf8 handling is fairly good at this time.

Good luck!

Rob


Have a look at this. It is fairly general but it will get your lexicon straight and though many examples are in python, per is also there. BTW, if you try to stuff latin-1 (or other) encoded stuff without decoding/reencoding, disaster will ensue.

For more help, post specifics.

Cheers


You'll find a complete (and tested) guide here.
It misses nothing out; Perl, DBI and MySQL. All utf8'd.
I had similar pain but got it all done in the end.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜