How to view crawled unicoded arabic string?

2023-03-01 01:48 问答作者：

I have crawled some webpages using Python. I ripped off the html tags and only stored some content of those pages as repr(s). Most of those pages are not in English. Now how can I view the crawled content in its original language?

For example, the crawler wrote only one line of some Arabic text to a txt file: u'\u0639\u0644\u0649'

But when I open the txt file in text editer or browser it looks exactly as above, so it's basically not human readable..

开发者_运维问答

Is there some easy way to render and display the string in Arabic?

Thanks,

>>> x= u'\u0639\u0644\u0649'
>>> open('x.html','w').write(x.encode('ascii','xmlcharrefreplace'))

Open x.html in a browser and it should display properly. Actual content:

&#1593;&#1604;&#1609;

You don't get human-readable code because you've used repr(s) to write the string to file - and that's what repr is supposed to generate - a programmer-readable representation, which is not entirely human-readable.

If you want to store the text in a format readable by any (unicode-supporting) text editor and browser, you should save it in UTF-8 encoding:

import codecs

s = u'\u0639\u0644\u0649'
f = codecs.open('output.txt', 'w', 'utf-8')
f.write(s)
f.close()

Make sure you set your browser or editor encoding to UTF-8 if it doesn't get auto-detected.

>>> print ast.literal_eval("u'\u0639\u0644\u0649'")
على

Well, not the order shown in the browser, but whatever.

>>> print u'\u0639\u0644\u0649'
على

As others suggested it is not a bad idea to view the file in a browser.

Store it in utf-8 (like open('x.html','w').write(x.encode('utf-8'))), as most browsers are well equipped to handle utf-8.
In the browser, you may need to change View->Character Encoding to Utf-8.
You will need Arabic fonts on your machine, so the browser can use these to display the characters.

Having written this, any file viewer/editor that is capable of decoding utf-8 and has access to the fonts can do this for you (e.g. vim works fine on my machine).

继续阅读：python unicode web-crawler

How to view crawled unicoded arabic string?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？