How to view crawled unicoded arabic string?
I have crawled some webpages using Python. I ripped off the html tags and only stored some content of those pages as repr(s). Most of those pages are not in English. Now how can I view the crawled content in its original language?
For example, the crawler wrote only one line of some Arabic text to a txt file: u'\u0639\u0644\u0649'
But when I open the txt file in text editer or browser it looks exactly as above, so it's basically not human readable..
开发者_运维问答Is there some easy way to render and display the string in Arabic?
Thanks,
>>> x= u'\u0639\u0644\u0649'
>>> open('x.html','w').write(x.encode('ascii','xmlcharrefreplace'))
Open x.html
in a browser and it should display properly. Actual content:
على
You don't get human-readable code because you've used repr(s) to write the string to file - and that's what repr is supposed to generate - a programmer-readable representation, which is not entirely human-readable.
If you want to store the text in a format readable by any (unicode-supporting) text editor and browser, you should save it in UTF-8 encoding:
import codecs
s = u'\u0639\u0644\u0649'
f = codecs.open('output.txt', 'w', 'utf-8')
f.write(s)
f.close()
Make sure you set your browser or editor encoding to UTF-8 if it doesn't get auto-detected.
>>> print ast.literal_eval("u'\u0639\u0644\u0649'")
على
Well, not the order shown in the browser, but whatever.
>>> print u'\u0639\u0644\u0649'
على
As others suggested it is not a bad idea to view the file in a browser.
- Store it in utf-8 (like
open('x.html','w').write(x.encode('utf-8'))
), as most browsers are well equipped to handle utf-8. - In the browser, you may need to change View->Character Encoding to Utf-8.
- You will need Arabic fonts on your machine, so the browser can use these to display the characters.
Having written this, any file viewer/editor that is capable of decoding utf-8 and has access to the fonts can do this for you (e.g. vim works fine on my machine).
精彩评论