开发者

charset-utf8 and character entities

I am proposing to convert my windows-1252 XHTML web pages to UTF-8.

I have the following character entities in my coding:

  • ' — apostrophe,
  • ► — right pointer,
  • ◄ — left pointer.

If I change the charset and save the pages as UTF-8 using my editor:

  • the apostrophe remains in as a character entity;
  • the pointers are converted to symbols within the code (presumably because the entitie开发者_如何学JAVAs are not supported in UTF-8?).

Questions:

  1. If I understand UTF-8 correctly, you don't need to use the entities and can type characters directly into the code. In which case is it safe for me to replace #39 with a typed in apostrophe?

  2. Is it correct that the editor has placed the pointer symbols directly into my code and will these be displayed reliably on modern browsers, it seems to be ok? Presumably, I can't revert to the entities anyway, if I use UTF-8?

Thanks.


It's charset, not chartset.

1) it depends on where the apostrophe is used, it's a valid ASCII character as well so depending on the characters intention (wether its for display only (inside a DOMText node) or used in code) you may or may not be able to use a literal apostrophe.

2) if your editor is a modern editor, it will be using utf sequences instead of just char to display text. most of the sequences used in code are just plain ASCII (and ASCII is a subset of utf8) so those characters will take up one byte. other characters may take up two, three or even four bytes in a specialized manner. they will still be displayed to you as one character, but the relation between character and byte has become different.

Anyway; since all valid ASCII characters are exactly the same in ASCII, utf8 and even windows-1252. you should not see any problems using utf8. And you can still use numeric and named entities because they are written in those valid characters. You just don't have to.

P.S. All modern browsers can do utf8 just fine. but our definitions of "modern" may vary.


Entities have three purposes: Encoding characters it isn't possible to encode in the character encoding used (not relevant with UTF-8), encoding characters it is not convenient to type on a given keyboard, and encoding characters that are illegal unescaped.

► should always produce ► no matter what the encoding. If it doesn't, it's a bug elsewhere.

directly in the source is fine in UTF-8. You can do either that or the entity, and it makes no difference.

' is fine in most contexts, but not some. The following are both allowed:

<span title="Jon's example">This is Jon's example</span>

But would have to be encoded in:

<span title='Jon&#x27;s example'>This is Jon's example</span>

because otherwise it would be taken as the ' that ends the attribute value.


Use entities if you copy/paste content from a word processor or if the code is an XML dialect. Use a macro in your text-editor to find/replace the common ones in one shot. Here is a simple list:

  • Half: ½ => &#189;
  • Acute Accent: é => &#233;
  • Ampersand: & => &#38;
  • Apostrophe: ’ => &#39;
  • Backtick: ‘ => &#96;
  • Backslash: \ => &#92;
  • Bullet: • => &#8226;
  • Dollar Sign: $ => &#36;
  • Cents Sign: ¢ => &#162;
  • Ellipsis: … => &#8230;
  • Emdash: — => &#8212;
  • Endash: – => &#8211;
  • Left Quote: “ => &#8220;
  • Right Quote: ” => &#8221;

References

  • XML Entity Names
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜