Help with proper character encoding

2022-12-27 13:34 问答作者：

I have a HTML form that is sometimes submitted with accented characters: à, è, ì, ò, ù

I have a PHP script that exports these form submissions into CSV format, when I look at the CSV fo开发者_如何学Gormat in a text editor (vim or notepad for example) the characters look fine, but when opened with Open Office or Word, I get some funky results: ��

I am also passing these submission to salesforce and am getting an error: "The entity "Atilde" was referenced, but not declared."

What can I do to ensure portability of my CSV file? What's the proper way to handle the encoding?

My HTML file is content-type is set as: Content-Type: text/html; charset=utf-8 Data is being stored in MySQL as latin1_swedish_ci collation.

Total encoding confusion! :-)

The table character set

The MySQL table character set only determines what encoding MySQL should use internally, and thus the range of characters permitted.

If you set it to Latin-1 (aka ISO 8859-1), you will not be able to store international characters in your table.
Importantly, the character set does not affect the encoding MySQL uses when communicating with your PHP script.
The table collation specifies rules for sorting.

The connection character set

The MySQL connection character set determines the encoding you receive table data in (and should send data to MySQL in).

The encoding is set using SET NAMES, e.g. SET NAMES "utf8".
If this does not match the table encoding, MySQL automatically converts data on the fly.
If this does not match your page character set, you'll have to manually perform character set conversion in PHP, using e.g. utf8_encode or mb_convert_encoding.

Page character set

The page character set, specified using the Content-Type header, tells the browser how to interpret the PHP script output.

As an HTTP header, it is not saved when you save the file from within your browser. The information is thus not available to OpenOffice or other programs.

Recommendations

Ideally, you should use the same encoding in all three places, and ideally, that encoding should be UTF-8.

However, CSV will cause problems, since the file format does not include encoding information. It is thus up to the application to guess the encoding, and as you've seen, the guess will be wrong.

I don't know about OpenOffice, but Microsoft Office will assume the Windows "ANSI" encoding, which usually means Latin-1 (or CP1252 to be specific).
Microsoft Office will also cause problems in countries that use "," as a decimal separator, since Office then switches to using ";" as a field separator for CSV-files.

Your best bet is to use Latin-1 for the CSV-file. I'd still use UTF-8 for the table and connection character sets though, and also UTF-8 for HTML pages.

If you use UTF-8 for the connection character set (by executing SET NAMES "utf8" after connecting), you'll need to run the text through utf8_decode to convert to Latin-1.

That entity problem

I am also passing these submission to salesforce and am getting an error: "The entity "Atilde" was referenced, but not declared."

This sounds like you're passing HTML code in an XML context, and is unrelated to character sets. Try running the text through html_entity_decode.

Also, what document type have you set, is it?

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

Try using the htmlentities() function for any text that is not showing correctly.

You may also want to have a look PHP Normalizer.

Make sure you are writing the CSV file as UTF-8. See http://www.php.net/manual/en/function.fwrite.php#55054 if you are unsure how to.

(Also, your sql table should be using utf8, not latin1)

It's up to you to decide which charset encoding you'll use for writing your CSV file (but, note, that must be a concious decision on your part).

Which charset encoding to use ? CSV does not defines a charset encoding - So I'd go for some Unicode charset, presumably UTF8. But some CSV consumers (eg Excel) might not be happy with it. If you are restricted to "western" langs, then latin1 or its variants (iso-8859-1 or iso-8859-15) might be more appropiate. But then (in any case, actually) you must think the conversion from user input to your particular encoding - and what to do if there are invalid characters.

(BTW: same consideration goes for the html-input-to-db conversion - you are using latin1 for your database, have you asked yourself what happens if the user types a non-latin1 character ? eg a japanese char ? ).

继续阅读：character-encoding php portability

Help with proper character encoding

The table character set

The connection character set

Page character set

Recommendations

That entity problem

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

The table character set

The connection character set

Page character set

Recommendations

That entity problem

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？