开发者

Problem with displaying russian letters in browser even though UTF-8 encoding is set

I am aware that there were some similar problems. However after reading answers and gooling about the topic I am still struggling with displaying Russian letters in the browser. I have them stored inside .csv file (which is encoded in UTF-8 no BOM). In my php file which reads .csv (which is also encoded in UTF-8 no BOM) I declared charset:

 <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

To open and iterate through .csv file I am using following code:

  if(($handle = fopen($path, "r")) !== FALSE) {
    while (($data = fgetcsv($handle, 1000, $delimiter)) !== FALSE) {
      ...
    }
  }

and either nothing is displayed or something like this:

 -ам-Зее

instead of

 Целль-ам-Зее

Any ideas what else I can try?

UPDATE:

After setting browser encoding to UTF-8 I get correct russian letters. However still some of the text is not displayed at all. I suspect that I do something incorectly while reading .csv file, the simplified version is:

     if(($handle = fopen($path, "r")) !== FALSE) {
       while (($data = fgetc开发者_JAVA技巧sv($handle, 1000, $delimiter)) !== FALSE) {
         echo $data[1];
        }
     }

( I omit first column and display the content of the second one, which is always filled )


Check Your Server Config

Do you have Apache configured to honor the <meta> charset override? By default it uses ISO-8859-1 for its default and ignores any overrides that appear in web pages it serves up.

Solution #1 of 3

For example, you can put this in your .htaccess file for an enclosing directory, and now your web pages will have their <meta> overrides honored:

AddDefaultCharset Off
AddCharset UTF-8 .html

The Apache documentation states:

This directive specifies a default value for the media type charset parameter (the name of a character encoding) to be added to a response if and only if the response's content-type is either text/plain or text/html. This should override any charset specified in the body of the response via a META element, though the exact behavior is often dependent on the user's client configuration. A setting of AddDefaultCharset Off disables this functionality. AddDefaultCharset On enables a default charset of iso-8859-1. Any other value is assumed to be the charset to be used, which should be one of the IANA registered charset values for use in MIME media types. For example:

   AddDefaultCharset utf-8     

AddDefaultCharset should only be used when all of the text resources to which it applies are known to be in that character encoding and it is too inconvenient to label their charset individually. One such example is to add the charset parameter to resources containing generated content, such as legacy CGI scripts, that might be vulnerable to cross‐site scripting attacks due to user‐provided data being included in the output. Note, however, that a better solution is to just fix (or delete) those scripts, since setting a default charset does not protect users that have enabled the “auto‐detect character encoding” feature on their browser.

Until I turned off AddDefaultCharset, I could not get my <meta> tags to work. It was quite mysterious and frustrating. Once I did, though, everything worked smoothly.

Solution #2 of 3

If you have write access to Apache’s configuration files, then you can change the server itself. However, you have to make sure nothing else relies on the old unoverridable setting. This is another reason to use .htaccess.


When All Else Fails: Solution #3 of 3

If you can neither change the overall server configuration itself nor create a .htaccess whose own settings will be respected for anything underneath it, then your only option is to use numeric entities for all code points over 127. For example, instead of

Целль-ам-Зее

you must instead use

&#1062;&#1077;&#1083;&#1083;&#1100;-&#1072;&#1084;-&#1047;&#1077;&#1077;

or

&#x426;&#x435;&#x43B;&#x43B;&#x44C;-&#x430;&#x43C;-&#x417;&#x435;&#x435;

The advantage of that is that it no longer requires a <meta> override and fiddling with the server or with .htaccess files. The disadvantage is that it takes an extra translation pass, which interferes with being able to directly edit the file with an editor that understand literal UTF‑8.

Entities Ignore Encodings

The reason it works is because all HTML is always in Unicode, so character number 1062 is always CYRILLIC CAPITAL LETTER TSE, etc. Entity numbers always represent Unicode code point numbers; they are never the numbers from the document encoding. Only encoded bytes count as being in the server or page encoding, not unencoded code point numbers which are always Unicode.

That’s why we can use something like &#233; and it always means LATIN SMALL LETTER E WITH ACUTE, because code point 233 is always that character, even if the web page itself should be in some other encoding (like 142 in MacRoman or 221 in NextStep).

The numbers of characters are always Unicode numbers, and pay no attention to the encoding. That’s because markup languages like HTML, XHTML, and XML always use logical Unicode code point numbers, just like programming languages like Perl and Go do. (PHP is really just bytes with some UTF‑8 APIs on top of it, but as you have yourself learned, one still has issues with it. This is both because of its internal model but also due to web servers and even web clients, all of which makes everything more complicated in PHP than in most other languages.)

Even if you had encoded your web page in ISO-8859-1 for Cyrillic, where a literal 0xC6 byte encodes Unicode U+0426, CYRILLIC CAPITAL LETTER TSE, as a character entity you would use &#1062; or &#x426; — and not &#xC6; which would be wrong since U+00C6 is LATIN CAPITAL LETTER AE.

Similarly, if you were using the MacCyrillic encoding, the literal 0x96 byte would be a CYRILLIC CAPITAL LETTER TSE, but because the numeric entity is always in Unicode, you must use &#1062; or &#x426; — and not &#x96;.

I prefer using only UTF‑8 for all web pages. Well, for new ones, that is. I do recognize that legacy non‐Unicode pages exist. Those I just leave as is.


You need to set correct locale on your server.

if(!setlocale(LC_ALL, 'ru_RU.utf8')) 
    setlocale(LC_ALL, 'en_US.utf8');

And then you can check if your server has accepted needed locale

if(setlocale(LC_ALL, 0) == 'C')
    echo 'Error setting locale';

The problem is in fgetcsv function which is using incorrect locale. If you have no possibility to change locale you could replace fgetcsv function with your own using explode

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜