开发者

Are unicode characters better or more semantic than the simple text versions?

When I copy/paste text from most sites and pdfs, the following characters are almost always in the unicode equivalent:

  • double quote: " is “ and ” (“ and ”)
  • single quote: ' is ‘ and ’ (‘ and ’)
  • ellipsis: ... is … (…)

I understand ones that can't be represented wit开发者_如何学运维hout unicode like © and ¢, but even for those, I wonder.

When should you use these unicode equivalents? Are they more semantic than not using them? Are they better interpreted by devices (copy/paste/print)? I always find it annoying getting those quote and ellipsis characters because with textmate + programming, you don't use them.


When should you use these unicode equivalents? Are they more semantic than not using them?

Note that these are not “unicode equivalents”. Those characters are available in many character sets other than Unicode, and they are strictly distinct from the alternatives that you propose.

In typography, the left and right versions of the single and double quotation marks are correct. They provide the traditional appearance for those characters that has been used in print media for many years. The ellipsis character provides the correct spacing for an ellipsis that does not naturally occur when using consecutive full stop characters. So the reason all of these are used is to make the text appear correctly to human readers.

Are they better interpreted by devices (copy/paste/print)?

Any system that uses any character set should be designed to correctly handle that character set. If the text is encoded in Unicode, then any recent system (from the last 15 years at least) should be able to handle it, since Unicode is the de facto standard character set for all modern systems.

Not all Unicode-conformant systems will be able to display all characters correctly. This will depend on the fonts available, and even the rendering system that uses the fonts. But any Unicode-conformant system will be able to transmit the characters unaltered (such as in a copy and paste operation).

I always find it annoying getting those quote and ellipsis characters because with textmate + programming, you don't use them.

It is unusual to copy English (or whatever language) text directly into a program without having to add separate delimiters to that text. But most modern programming languages will not have any difficulty handling the text once it is property delimited.

Any systems that cannot handle Unicode correctly should be updated. Legacy character encodings will have no place in the future.


I think there's a simple explanation: MS Word converts these characters/sequences automatically as you type and a lot of text in the internet has been copied from this text editor.

Most of the articles I get for my site from other authors are sent as .doc file and I have to convert it. Usually, it contains these characters you've mentioned.

I'd also add one more: many different types of dashes instead of the hyphen. And also the low opening double quote (as seen in some european languages).

I usually let them stay in the text (all my pages are unicode). It's just important to remember it when playing around with regex etc (especially the dashes can be tricky and hard to spot).


HTML entities serve a triple purpose:

  1. Being able to use characters that do not belong to the document character set, e.g., insert an euro symbol in a ISO-8859-1 document.

  2. Escape characters that have a special meaning in HTML, such as angle brackets.

  3. Make it easier to type characters that are not in your keyboard or are not supported by your editor, e.g. a copyright symbol.

Update:

My info is correct but I suspect I've answered the wrong question...


On the web, I would consider that markup adds semantic meaning, content does not. So it doesn't really matter which you use in this context.

Typographers would insist on “ and ”, where programmers don't care and just use regular old quotes ".

The key here is interoperability. There are different encoding schemes. As we've all been victim to, people paste content into an editor from WORD, which uses windows-1251 encoding. When you serve this content up via AJAX is usually breaks because AJAX uses UTF-8 encoding by default.

Office 2010 now allows for the saving of documents in UTF-8 format. Also, databases have different unicode encoding schemes. The best bet is to use UTF-8 end-to-end.


When you copy-pasta text that includes special characters, they will be left as they are. This is perfectly fine if the characters match the charset used by the webpage.

HTML entities are just a convenience for producing specific characters in any character set. Keyboards tend not to have keys to get symbols like ©, so the HTML entity is a shortcut.

I'm going to generalize and say that most of the time the content is UTF-8 (please correct me if I'm wrong). The copied characters are usually copied correctly and everything works great, if they aren't copied correctly, or the charset is subject to change, or you're after i18n support, go with the HTML or XML entities. Otherwise, leave them as they are, the browser will display them just fine.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜