Character \u260e

2023-04-02 04:43 问答作者：

During web scraping, I got character \u260e in unicode. My output is "The Last Resort, â˜Ž +977 1 4700525". So instead of â˜开发者_开发问答Ž, there should be ☎.

How can I get it back to telephone sign (☎)? So output will be "The Last Resort, ☎ +977 1 4700525".

Krish

When you scraped a site, Python recognized a "☎" character and stored it in a string.

This character has codepoint 260e. When characters are stored, however, they are stored as sequences of one or more bytes. What those bytes are depends on the encoding being used. In your case UTF-8 was probably used.

The UTF-8 encoding of this character is E2 98 8E (See http://www.fileformat.info/info/unicode/char/260e/index.htm).

So now you have a byte sequence representing your character. What are you going to do with it? You are going to output it somewhere. But you want to convert this byte string into characters, so you have to specify an encoding. Let's say you specify the encoding Windows-1252 (see http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT).

E2 is â
98 is ˜
8E is Ž

which is what you see. You need to write out your Python string in UTF-8. Or if you are writing to HTML, use DruvPathak's suggestion of using HTML character entity references, in this case

&#x260e;

&#9742;

I suspect what happened is that you did not specify an encoding when you wrote out your string and that Windows-1252 was the default. Or, maybe your browser was set to display Windows-1252 by default.

An interesting thing about sending data out in HTML is that you can send out a UTF-8 byte stream, set the HTTP content-type to UTF-8 and put meta tags in your HTML document stating that the page is encoded in UTF-8, but if an enduser is using a browser that lets him or her override the encoding sent by the server, there is a chance, I suppose, that the enduser will see the data wrongly.

If you use character entity references, the browser will always show it properly.

It may be inconvenient, though, to use these entity references, everywhere. Most people these days don't manually set their browser to override the encoding sent by the server.

ADDENDUM

So let's say you have a unicode string and you want to produce a regular (non-unicode) string (of type str) containing HTML character entity references. Here is an full example script that illustrates a direct, though not necessarily the most Pythonic way to do it:

def to_character_entity_reference_string(s):
    return "".join(["&#" + str(ord(c)) + ";" for c in s])

print(to_character_entity_reference_string(u'काठमाण्डु'))

If you run this script, you get the output

&#2325;&#2366;&#2336;&#2350;&#2366;&#2339;&#2381;&#2337;&#2369;

You can put that output into a file and open it a Web browser and you will see काठमाण्डु displayed as expected.

You can create variations on this base script so that characters with codepoints less than 128 are preserved while everything else becomes a character entity reference. You might also want to explore Python's encode and decode functions. And once again, the character entity references guard against people manually changing their browser settings to override your encodings, which is of course just fine, but may be considered overkill. End users that mess with these settings can be said to get what they deserve so it is generally accepted to set things up to just encode everything in UTF-8, period. Nevertheless, it is nice to know about character entity references.

You can print them in your result page using HTML entities with the given code.

eg : http://www.danshort.com/HTMLentities/index.php?w=dingb

Or use string.encode function to encode it in required encoding.

继续阅读：python unicode

Character \u260e

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？