开发者

Website conversion help - UTF-8, Covering all the bases... functions, metas, and sql utf-8

You all did such an amazing job answering a question earlier I thought - I'll ask this one before I get too deep in my conversion only to find out I did something wrong. I only have 3 pages to a website I'm making for myself. It has forms, sqli db. I was told to use UTF-8 (I partially did, but not fully) lol. Ok, sounds cool. Now that I开发者_运维问答 want to fix it to be 100% UTF-8 aware I have already written about 1,900 lines of code in PHP, JS, and HTML without using multibyte functions.. SO... here's my question... in my conversion I have done this... (snippits of code from various places...)

PHP

date_default_timezone_set('America/Toronto'); // sets the timezone to Eastern Stand Time

HTML

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>

etc

SQL

(from cpanel interface) MySQL connection collation": utf8_general_ci

SQL DB (still in pre utf-8 mode)

username varchar(50) latin1_general_cs

companyname varchar(50) latin1_swedish_ci

fname varchar(25) latin1_swedish_ci

I have NO valuable data in the tables. I will be changing those to one of the following (I'm not sure which one however)...

utf8_general_ci or utf8_unicode_ci

While I would like to make the site available for foreign people, it's not a high priority BUT, since I'm doing it UTF-8 style it's probably already going to work for foreign languages.

My questions are...

1) I set my timezone, i didn't set my locale in php because I have never done that. Do I need to do that? How do I do that for my Toronto/Canada Location?

2) Is setting each page via meta tag ok enough to make the entire page UTF-8

3) By using the meta tag does that mean all my form fields are already being input as UTF-8 data? If not, how do I change it so they are.

4) Which one do I use for my DB? utf8_general_ci or utf8_unicode_ci

5) I NEED certain things to be case sensitive. I only see ci for utf8. Is this because a "Dave" is different than "dave" so using multibyte compares automatically compares case...??!?!?!

6) My DB currently has say 50 characters for storage for ASCII stuff - I assume that by switching to utf-8 in the DB that for english people like myself that 50 storage will be fine - but if some foreign person comes along and enters a bunch of weird symbols I would need to increase my storage by x4 to accomodate all the extra bytes for unicode? I don't mind using up more storage but I'm curious how the proper way to allocate this would be. And since it's a VARCHAR(50) would it really matter anyways? If the name is "Dave" it would be 4 characters. If it was some foreign name, "Dave" in symbols might be 12 characters! lol. So, if I allocate say 100 to the username field that should do since it's unlikely ALL characters would be 4 bytes. Or, just set it to x4 what I would for english and make them all VARCHARS to save space. When they enter data on the form I'll be using MB_LENGTH functions (I forget the exact function) so I would still be able to control how many characters would be input.

7) How can I test my unicode website? I have never used anything other than beautiful english :) lol. How can I switch my browser? to pretend like I'm from somewhere else and enter a pile of codes and see if my functions work once I re-write them to use mb_ (multibyte) functions. Or, is there nothing to switch over... I just type in ALT 245 or something and I get symbols?!?!? I don't know how to enter foreign test characters! It would suck to get english working only to have all foreign customers not able to enter a password because I didn't test my website enough :)

8) I know to use certain functions ctype, mb_ to handle unicode compares, strings, etc. Any surprises in store for me? Things that don't work as they should?

Yes... I'm wordy! :) I use Dreamweaver CS3 but that shouldn't matter. There is no UTF-8 characters embeded in my actual files.

Awaiting all your wisdom...


I'll start with some of the answers:

2) Your server should also send headers that indicate that the content sent is in UTF-8:

header('Content-Type: text/html; charset=UTF-8');

3) Browsers will send their data in UTF-8, yes. But hackers may not, so you should also in your htmlententies and similar HTML-encoding function give the UTF-8-Charset (see example exploit)

5) A case insensitive collation does only mean that when doing a WHERE-clause, case doesn't play a role.

6) Actually, it is the contrary: in ASCII you may need a bigger VARCHAR than in UTF-8 ("Dave" is 4 chars, 4 bytes; "ǝʌɐp" is 4 chars, 8 bytes.)


1) About the Set locale information , it can affect in some string functions (i.e. strtoupper()), its purpose is affect and changes the way some "things" operate. For example within a Regular expression, it changes the way of \w \W (Word characters) are expected. But As more and more applications change to using Unicode, the need for this locale support is expected to die away.

7) W3C can help you a little bit.

About testing characters and pretend you are another person from China or another place:

index.php:

<head>
<meta charset="UTF-8"><!-- This tag encode the text that will be typed within a text area (If the accep-charset="" is not speciefied)
If the character typed isnt part of the encoding the character will be escaped**-->
</head>

<form method="POST" action="encode.php" accept-charset="UTF-8"><!--accept-charset"" is used to set the encoding that will be used to transmit the characters over a form-->
<p><textarea name="input" maxlength="256" rows="5" cols="100"></textarea></p>
<p><button>Submit</button></p>
</form>

**Escaped Characters

Then in the encode.php you can controll your input with:

$input=$_POST["input"];
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜