python display unicode in html

2023-02-14 03:15 问答作者：

I'm writing script to export my links and their titles from chrome to html.

Chrome bookmarks stored as json, in utf encoding

Some titles are on Russian therefore they stored like that:

"name": "\u0425\u0430\u0431\u0440\ ..."

import codecs
f = codecs.open("chrome.json","r", "utf-8")
data = f.readlines()

urls = [] # for links
names = [] # for link titles

ind = 0

for i in data:
    if i.find('"url":') != -1:
        urls.append(i.split('"')[3])
        names.append(data[ind-2].split('"')[3])
    ind += 1

fw = codecs.open("chrome.html","w","utf-8")
fw.write("<html><body>\n")
for n in names:
    fw.write(n + '<br>')
    # print type(n) # this will return <type 'unicode'> for each url!
fw.write("</body></html>")

Now, in chrome.html I got those displayed as \u0425\u0430\u0431...

How I can turn them back to Russian?

using python 2.5

Edit: Solved!

s = '\u041f\u0440\u0438\u0432\u0435\u0442 world!'
type(s)
<type 'str'>

print s.decode('raw-unicode-escape').encode('utf-8')
Привет world!

That's what I needed, to convert str of \u041f... into unicode.

f = o开发者_StackOverflow中文版pen("chrome.json", "r")
data = f.readlines()
f.close()

urls = [] # for links
names = [] # for link titles

ind = 0

for i in data:
    if i.find('"url":') != -1:
        urls.append(i.split('"')[3])
        names.append(data[ind-2].split('"')[3])
    ind += 1

fw = open("chrome.html","w")
fw.write("<html><body>\n")
for n in names:
    fw.write(n.decode('raw-unicode-escape').encode('utf-8') + '<br>')
fw.write("</body></html>")

By the way, it's not just Russian; non-ASCII characters are quite common in page names. Example:

name=u'Python Programming Language \u2013 Official Website'
url=u'http://www.python.org/'

As an alternative to fragile code like

urls.append(i.split('"')[3])
names.append(data[ind-2].split('"')[3])
# (1) relies on name being 2 lines before url
# (2) fails if there is a `"` in the name
# example: "name": "The \"Fubar\" website",

you could process the input file using the json module. For Python 2.5, you can get simplejson.

Here's a script that emulates yours:

try:
    import json
except ImportError: 
    import simplejson as json
import sys

def convert_file(infname, outfname):

    def explore(folder_name, folder_info):
        for child_dict in folder_info['children']:
            ctype = child_dict.get('type')
            name = child_dict.get('name')
            if ctype == 'url':
                url = child_dict.get('url')
                # print "name=%r url=%r" % (name, url)
                fw.write(name.encode('utf-8') + '<br>\n')
            elif ctype == 'folder':
                explore(name, child_dict)
            else:
                print "*** Unexpected ctype=%r ***" % ctype

    f = open(infname, 'rb')
    bmarks = json.load(f)
    f.close()
    fw = open(outfname, 'w')
    fw.write("<html><body>\n")
    for folder_name, folder_info in bmarks['roots'].iteritems():
        explore(folder_name, folder_info)
    fw.write("</body></html>")
    fw.close()    

if __name__ == "__main__":
    convert_file(sys.argv[1], sys.argv[2])

Tested using Python 2.5.4 on Windows 7 Pro.

It's a JSON file, so read it using a JSON parser. That will give you a Unicode string directly, without you having to unescape it. This is going to be much more reliable (as well as simpler), since JSON strings are not the same format as Python strings.

(They're pretty similar and both use the \u format, but your current code will fall over badly for other escaped characters, not to mention that it relies on the exact attribute order and whitespace settings of a JSON file, which makes it very fragile indeed.)

import json, cgi, codecs

with open('chrome.json') as fp:
    bookmarks= json.load(fp)

with codecs.open('chrome.html', 'w', 'utf-8') as fp:
    fp.write(u'<html><body>\n')
    for root in bookmarks[u'roots'].values():
        for child in root['children']:
            fp.write(u'<a href="%s">%s</a>' % (
                cgi.escape(child[u'url']),
                cgi.escape(child[u'name'])
            ))
    fp.write(u'</body></html>')

Note also the use of cgi.escape to HTML-encode any < or & characters in the strings.

I'm not sure where you're trying to display the russian text, but in the interpreter you can do the following to see the Russian text:

s = '\u0425\u0430\u0431'
l = s.split('\u')
l.remove('')
for x in l:
    print(unichr(int(x, 16))),

This will give the following output:

Х а б

If you're storing it in html, better off to leave it as '\u0425...' until you need to convert it.

Hope this helps.

You could include the utf-8 BOM, so chrome knows to read it as utf-8, not ascii:

fw = codecs.open("chrome.html","w","utf-8")
fw.write(codecs.BOM_UTF8.decode('utf-8'))
fw.write(u'你好')

Oh, but if you open fw in python, remember to use 'utf-8-sig' to strip the BOM.

Maybe you need to encode the unicode into utf-8, but I think codecs does that already, right:

继续阅读：json python unicode

python display unicode in html

Edit: Solved!

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

**Edit: Solved!**

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Edit: Solved!

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？