开发者

Browser codepage detection

I have an ASP.Net page, where a user can enter some text in a TEXTAREA and submit it to the server. This text will be stored in a database and will be presented in a winform application.

How can I make sure that the winform application presents the exact characters that the user entered in the TEXTAREA.

That is, do I have a potential problem like for example if the user enters special language specific letters such as Æ, Ø and Å, wh开发者_StackOverflow中文版ich are Danish letters?

Those letters have different codes depending on the codepage, so as far as I can see, I need to know what codepage the TEXTAREA control is showing its input in. Or am i missing something here?

I have tried to find material on this on the net, but it is difficult to find something that addresses this issue. I typically found pages talking about what codepage the server requires the browser to use, in order to display the sent data correctly.

But my question goes the other way, i.e. from client to server.


You could also use the HEBCI: HTML Entity-Based Codepage Inference technique if you REALY want to be sure that users sending text with crappy browsers don't corrupt your data-backbone.

In essence this is how it works:

Every codepage has its own finger-print. For instance the single entity "º" could be used to distinguish between the Big Three: ISO-8859-1/Windows-1252 (=BA), MacRoman(=BC), and UTF-8 (=C2BA).

In a form you simply add a hidden input containing those fingerprints as entity's (like °, ÷, and —) and when the users submits the form you simply check the returned hex-values and compare them against your finger-print table. IF this does not give a match, only THEN continue other fall-back solutions.

A slightly larger implementation works great with only five codepoints:

my @fp_ents = qw/deg divide mdash bdquo euro/;
my %fingerprints = (
  "UTF-8" => ['c2b0','c3b7','e28094','e2809e','e282ac'],
  "WINDOWS-1252" => ['b0','f7','97','84','80'],
  "MAC"          => ['a1','d6','d1','e3','db'],
  "MS-HEBR"      => ['b0','ba','97','84','80'],
  "MAC-CYRILLIC" => ['a1','d6','d1','d7',''],
  "MS-GREEK"     => ['b0','','97','84','80'],
  "MAC-IS"       => ['a1','d6','d0','e3',''],
  "MS-CYRL"      => ['b0','','97','84','88'],
  "MS932"        => ['818b','8180','815c','',''],
  "WINDOWS-31J"  => ['818b','8180','815c','',''],
  "WINDOWS-936"  => ['a1e3','a1c2','a1aa','',''],
  "MS_KANJI"     => ['818b','8180','','',''],
  "ISO-8859-15"  => ['b0','f7','','','a4'],
  "ISO-8859-1"   => ['b0','f7','','',''],
  "CSIBM864"     => ['80','dd','','',''],
 );


You can look at the content-type header to find out the encoding.

For more details see this SO answer to a related question.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜