开发者

How to view crawled unicoded arabic string?

I have crawled some webpages using Python. I ripped off the html tags and only stored some content of those pages as repr(s). Most of those pages are not in English. Now how can I view the crawled content in its original language?

For example, the crawler wrote only one line of some Arabic text to a txt file: u'\u0639\u0644\u0649'

But when I open the txt file in text editer or browser it looks exactly as above, so it's basically not human readable..

开发者_运维问答

Is there some easy way to render and display the string in Arabic?

Thanks,


>>> x= u'\u0639\u0644\u0649'
>>> open('x.html','w').write(x.encode('ascii','xmlcharrefreplace'))

Open x.html in a browser and it should display properly. Actual content:

على


You don't get human-readable code because you've used repr(s) to write the string to file - and that's what repr is supposed to generate - a programmer-readable representation, which is not entirely human-readable.

If you want to store the text in a format readable by any (unicode-supporting) text editor and browser, you should save it in UTF-8 encoding:

import codecs

s = u'\u0639\u0644\u0649'
f = codecs.open('output.txt', 'w', 'utf-8')
f.write(s)
f.close()

Make sure you set your browser or editor encoding to UTF-8 if it doesn't get auto-detected.


>>> print ast.literal_eval("u'\u0639\u0644\u0649'")
على

Well, not the order shown in the browser, but whatever.


>>> print u'\u0639\u0644\u0649'
على


As others suggested it is not a bad idea to view the file in a browser.

  • Store it in utf-8 (like open('x.html','w').write(x.encode('utf-8'))), as most browsers are well equipped to handle utf-8.
  • In the browser, you may need to change View->Character Encoding to Utf-8.
  • You will need Arabic fonts on your machine, so the browser can use these to display the characters.

Having written this, any file viewer/editor that is capable of decoding utf-8 and has access to the fonts can do this for you (e.g. vim works fine on my machine).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜