Unicode PHP source files
For a project I'm currently working on I needed to add some unicode characters to some php file.
So I ne开发者_Go百科eded to use unicode encoding of course.
That made me wonder:
What prevents me of using unicode for all my PHP files?
Nothing prevents you using unicode in all your php files, only if you do you may need to edit your scripts if the unicode setting that is set interferes with the script processing.
There are some things to remember when you work with UTF8 encoded source files:
- Some editors may add BOM in the beginning of the files - this may damage the script output - you should save you files without BOM.
- strlen and other string functions may work not as you expecting - you should use multibyte string functions for string length, etc: http://php.net/manual/en/book.mbstring.php
- regex requires u modifier to work with unicode characters.
- you should be careful when you work with files - pay attention to the current encoding, because when the file does not contain BOM (see #1) editor may open it in system default encoding.
- some source code tools may do not work correctly with UTF8 files (because they do not contain BOM, but some of them work incorrectly even when the files have it).
From my experience, I can say that it is better sometimes to store strings in resources (text files or so) and do not use UTF8 in code files, but sometimes it is ok - this depends on whether you have problems with it or not.
What's “Unicode encoding”?
Unicode is a character set; there are lots of encodings between Unicode and bytes, many of them mapping only a subset of possible characters.
When you want to use non-ASCII Unicode characters in a PHP script, the usual best choice of encoding is UTF-8, as it's an ASCII-superset encoding (ie. the lower 128 values of each byte always mean the standard ASCII characters) that can still represent any Unicode character. PHP, like many other byte-oriented tools, can only reliably work with ASCII-superset encodings.
If by “Unicode encoding” you mean the thing that Notepad and other Windows tools call “Unicode”, that's quite a different proposition. This is a misleading name for what is correctly known as the UTF-16LE encoding. This encoding has a two-byte-per-code-unit width, which means eg that normal ASCII characters come out with zero bytes between them. It's not an ASCII-superset, so PHP and other byte-based tools can't do much with it directly.
When saving scripts in Windows-based editors, look to save in UTF-8 (without BOM), and serve your pages with a UTF-8 Content-Type charset. Although it's the default in-memory representation for Windows, Java and JavaScript, UTF-16LE is of pretty much zero use for storing files or serving web pages.
What prevents me of using Unicode for all my PHP files?
The specific encoding might. PHP itself does not treat the file-input specifically but only as a binary sequence.
The only Unicode encoding that is compatible with PHP on the source-file level is UTF-8.
Take care to not save the php-files with the UTF-8-BOM. PHP Does treat it as a standard text and outputs it because it is before the opening <?php
tag:
{UTF8-BOM}<?php
The output is invisible but has a byte-length of three causing either headers already sent errors or inserting text-nodes inside the DOM where those are not expected.
精彩评论