utf8_encode function purpose

2023-03-20 16:23 问答作者：

Supposed that im encoding my files with UTF-8.

Within PHP script, a string will be compared:

$string="ぁ";
$string = utf8_encode($string); //Do i need this step?
if(preg_match('/ぁ/u',$string))
//Do if match...

Its that string really UTF-8 without the utf8_encode() function? If you encode your files wi开发者_Python百科th UTF-8 dont need this function?

If you read the manual entry for utf8_encode, it converts an ISO-8859-1 encoded string to UTF-8. The function name is a horrible misnomer, as it suggests some sort of automagic encoding that is necessary. That is not the case. If your source code is saved as UTF-8 and you assign "あ" to $string, then $string holds the character "あ" encoded in UTF-8. No further action is necessary. In fact, trying to convert the UTF-8 string (incorrectly) from ISO-8859-1 to UTF-8 will garble it.

To elaborate a little more, your source code is read as a byte sequence. PHP interprets the stuff that is important to it (all the keywords and operators and so on) in ASCII. UTF-8 is backwards compatible to ASCII. That means, all the "normal" ASCII characters are represented using the same byte in both ASCII and UTF-8. So a " is interpreted as a " by PHP regardless of whether it's supposed to be saved in ASCII or UTF-8. Anything between quotes, PHP simply takes as the literal bit sequence. So PHP sees your "あ" as "11100011 10000001 10000010". It doesn't care what exactly is between the quotes, it'll just use it as-is.

PHP does not care about string encoding generally, strings are binary data within PHP. So you must know the encoding of data inside the string if you need encoding. The question is: does encoding matter in your case?

If you set a string variables content to something like you did:

$string="ぁ";

It will ~~not~~ contain UTF-8. Instead it contains a binary sequence that is not a valid UTF-8 character. That's why the browser or editor displays a questionmark or similar. So before you go on, you already see that something might not be as intended. (Turned out it was a missing font on my end)

This also shows that your file in the editor is supporting UTF-8 or some other flavor of unicode encoding. Just keep the following in mind: One file - one encoding. If you store the string inside the file, it's in the encoding of that file. Check your editor in which encoding you save the file. Then you know the encoding of the string.

Let's just assume it is some valid UTF-8 like so (support for my font):

$string="ä";

You can then do a binary comparison of the string later on:

if ( 'ä' === $string )
  # do your stuff

Because it's in the same file and PHP strings are binary data, this works with every encoding. So normally you don't need to re-encode (change the encoding) the data if you use functions that are binary safe - which means that the encoding of the data is not changed.

For regular expressions encoding does play a role. That's why there is the u modifier to signal you want to make the expression work on and with unicode encoded data. However, if the data is already unicode encoded, you don't need to change it into unicode before you use preg_match. However with your code example, regular expressions are not necessary at all and a simple string comparison does the job.

Summary:

$string="ä";
if ( 'ä' === $string )
  # do your stuff

Your string is not a utf-8 character so it can't preg match it, hence why you need to utf8_encode it. Try encoding the PHP file as utf-8 (use something like Notepad++) and it may work without it.

Summary:

The utf8_encode() function will encode every byte from a given string to UTF-8. No matter what encoding has been used previously to store the file. It's purpose is encode strings¹ that arent UTF-8 yet.

1.- The correctly use of this function is giving as a parameter an ISO-8859-1 string. Why? Because Unicode and ISO-8859-1 have the same characters at same positions.

                [Char][Value/Position]          [Encoded Value/Position]
[Windows-1252]  [€][80]                 ---->   [C2|80] Is this the UTF-8 encoded value/position of the [€]?    No
[ISO-8859-1]    [¢][A2]                 ---->   [C2|A2] Is this the UTF-8 encoded value/position of the [¢]?    Yes

The function seems that work with another encodings: it work if the string to encode contains only characters with same values that the ISO-8859-1 encoding (e.g On Windows-1252 00-EF & A0-FF positions).

We should take into account that if the function receive an UTF-8 string (A file encoded as a UTF-8) will encode again that UTF-8 string and will make garbage.

继续阅读：character-encoding php regex utf-8

utf8_encode function purpose

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？