开发者

Character \u260e

During web scraping, I got character \u260e in unicode. My output is "The Last Resort, ☎ +977 1 4700525". So instead of â˜开发者_开发问答Ž, there should be ☎.

How can I get it back to telephone sign (☎)? So output will be "The Last Resort, ☎ +977 1 4700525".

Krish


When you scraped a site, Python recognized a "☎" character and stored it in a string.

This character has codepoint 260e. When characters are stored, however, they are stored as sequences of one or more bytes. What those bytes are depends on the encoding being used. In your case UTF-8 was probably used.

The UTF-8 encoding of this character is E2 98 8E (See http://www.fileformat.info/info/unicode/char/260e/index.htm).

So now you have a byte sequence representing your character. What are you going to do with it? You are going to output it somewhere. But you want to convert this byte string into characters, so you have to specify an encoding. Let's say you specify the encoding Windows-1252 (see http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT).

  • E2 is â
  • 98 is ˜
  • 8E is Ž

which is what you see. You need to write out your Python string in UTF-8. Or if you are writing to HTML, use DruvPathak's suggestion of using HTML character entity references, in this case

☎

or

☎

I suspect what happened is that you did not specify an encoding when you wrote out your string and that Windows-1252 was the default. Or, maybe your browser was set to display Windows-1252 by default.

An interesting thing about sending data out in HTML is that you can send out a UTF-8 byte stream, set the HTTP content-type to UTF-8 and put meta tags in your HTML document stating that the page is encoded in UTF-8, but if an enduser is using a browser that lets him or her override the encoding sent by the server, there is a chance, I suppose, that the enduser will see the data wrongly.

If you use character entity references, the browser will always show it properly.

It may be inconvenient, though, to use these entity references, everywhere. Most people these days don't manually set their browser to override the encoding sent by the server.

ADDENDUM

So let's say you have a unicode string and you want to produce a regular (non-unicode) string (of type str) containing HTML character entity references. Here is an full example script that illustrates a direct, though not necessarily the most Pythonic way to do it:

def to_character_entity_reference_string(s):
    return "".join(["&#" + str(ord(c)) + ";" for c in s])

print(to_character_entity_reference_string(u'काठमाण्डु'))

If you run this script, you get the output

काठमाण्डु

You can put that output into a file and open it a Web browser and you will see काठमाण्डु displayed as expected.

You can create variations on this base script so that characters with codepoints less than 128 are preserved while everything else becomes a character entity reference. You might also want to explore Python's encode and decode functions. And once again, the character entity references guard against people manually changing their browser settings to override your encodings, which is of course just fine, but may be considered overkill. End users that mess with these settings can be said to get what they deserve so it is generally accepted to set things up to just encode everything in UTF-8, period. Nevertheless, it is nice to know about character entity references.


You can print them in your result page using HTML entities with the given code.

eg : http://www.danshort.com/HTMLentities/index.php?w=dingb

Or use string.encode function to encode it in required encoding.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜