Problem with displaying russian letters in browser even though UTF-8 encoding is set

2023-03-30 06:53 问答作者：

I am aware that there were some similar problems. However after reading answers and gooling about the topic I am still struggling with displaying Russian letters in the browser. I have them stored inside .csv file (which is encoded in UTF-8 no BOM). In my php file which reads .csv (which is also encoded in UTF-8 no BOM) I declared charset:

 <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

To open and iterate through .csv file I am using following code:

  if(($handle = fopen($path, "r")) !== FALSE) {
    while (($data = fgetcsv($handle, 1000, $delimiter)) !== FALSE) {
      ...
    }
  }

and either nothing is displayed or something like this:

 -Ð°Ð¼-Ð—ÐµÐµ

instead of

 Целль-ам-Зее

Any ideas what else I can try?

UPDATE:

After setting browser encoding to UTF-8 I get correct russian letters. However still some of the text is not displayed at all. I suspect that I do something incorectly while reading .csv file, the simplified version is:

     if(($handle = fopen($path, "r")) !== FALSE) {
       while (($data = fgetc开发者_JAVA技巧sv($handle, 1000, $delimiter)) !== FALSE) {
         echo $data[1];
        }
     }

( I omit first column and display the content of the second one, which is always filled )

Check Your Server Config

Do you have Apache configured to honor the <meta> charset override? By default it uses ISO-8859-1 for its default and ignores any overrides that appear in web pages it serves up.

Solution #1 of 3

For example, you can put this in your .htaccess file for an enclosing directory, and now your web pages will have their <meta> overrides honored:

AddDefaultCharset Off
AddCharset UTF-8 .html

The Apache documentation states:

This directive specifies a default value for the media type charset parameter (the name of a character encoding) to be added to a response if and only if the response's content-type is either text/plain or text/html. This should override any charset specified in the body of the response via a META element, though the exact behavior is often dependent on the user's client configuration. A setting of AddDefaultCharset Off disables this functionality. AddDefaultCharset On enables a default charset of iso-8859-1. Any other value is assumed to be the charset to be used, which should be one of the IANA registered charset values for use in MIME media types. For example:
   AddDefaultCharset utf-8     
AddDefaultCharset should only be used when all of the text resources to which it applies are known to be in that character encoding and it is too inconvenient to label their charset individually. One such example is to add the charset parameter to resources containing generated content, such as legacy CGI scripts, that might be vulnerable to cross‐site scripting attacks due to user‐provided data being included in the output. Note, however, that a better solution is to just fix (or delete) those scripts, since setting a default charset does not protect users that have enabled the “auto‐detect character encoding” feature on their browser.

Until I turned off AddDefaultCharset, I could not get my <meta> tags to work. It was quite mysterious and frustrating. Once I did, though, everything worked smoothly.

Solution #2 of 3

If you have write access to Apache’s configuration files, then you can change the server itself. However, you have to make sure nothing else relies on the old unoverridable setting. This is another reason to use .htaccess.

When All Else Fails: Solution #3 of 3

If you can neither change the overall server configuration itself nor create a .htaccess whose own settings will be respected for anything underneath it, then your only option is to use numeric entities for all code points over 127. For example, instead of

Целль-ам-Зее

you must instead use

&#1062;&#1077;&#1083;&#1083;&#1100;-&#1072;&#1084;-&#1047;&#1077;&#1077;

&#x426;&#x435;&#x43B;&#x43B;&#x44C;-&#x430;&#x43C;-&#x417;&#x435;&#x435;

The advantage of that is that it no longer requires a <meta> override and fiddling with the server or with .htaccess files. The disadvantage is that it takes an extra translation pass, which interferes with being able to directly edit the file with an editor that understand literal UTF‑8.

Entities Ignore Encodings

The reason it works is because all HTML is always in Unicode, so character number 1062 is always CYRILLIC CAPITAL LETTER TSE, etc. Entity numbers always represent Unicode code point numbers; they are never the numbers from the document encoding. Only encoded bytes count as being in the server or page encoding, not unencoded code point numbers which are always Unicode.

That’s why we can use something like é and it always means LATIN SMALL LETTER E WITH ACUTE, because code point 233 is always that character, even if the web page itself should be in some other encoding (like 142 in MacRoman or 221 in NextStep).

The numbers of characters are always Unicode numbers, and pay no attention to the encoding. That’s because markup languages like HTML, XHTML, and XML always use logical Unicode code point numbers, just like programming languages like Perl and Go do. (PHP is really just bytes with some UTF‑8 APIs on top of it, but as you have yourself learned, one still has issues with it. This is both because of its internal model but also due to web servers and even web clients, all of which makes everything more complicated in PHP than in most other languages.)

Even if you had encoded your web page in ISO-8859-1 for Cyrillic, where a literal 0xC6 byte encodes Unicode U+0426, CYRILLIC CAPITAL LETTER TSE, as a character entity you would use Ц or Ц — and not Æ which would be wrong since U+00C6 is LATIN CAPITAL LETTER AE.

Similarly, if you were using the MacCyrillic encoding, the literal 0x96 byte would be a CYRILLIC CAPITAL LETTER TSE, but because the numeric entity is always in Unicode, you must use Ц or Ц — and not .

I prefer using only UTF‑8 for all web pages. Well, for new ones, that is. I do recognize that legacy non‐Unicode pages exist. Those I just leave as is.

You need to set correct locale on your server.

if(!setlocale(LC_ALL, 'ru_RU.utf8')) 
    setlocale(LC_ALL, 'en_US.utf8');

And then you can check if your server has accepted needed locale

if(setlocale(LC_ALL, 0) == 'C')
    echo 'Error setting locale';

The problem is in fgetcsv function which is using incorrect locale. If you have no possibility to change locale you could replace fgetcsv function with your own using explode

继续阅读：character-encoding php unicode utf-8

Problem with displaying russian letters in browser even though UTF-8 encoding is set

Check Your Server Config

Solution #1 of 3

Solution #2 of 3

When All Else Fails: Solution #3 of 3

Entities Ignore Encodings

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Check Your Server Config

Solution #1 of 3

Solution #2 of 3

When All Else Fails: Solution #3 of 3

Entities Ignore Encodings

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？