convert encoding via iconv linux
I used to convert encoding via iconv
but today i stopped by something new to me
the goal is convert الحلقة الثالثة
to its utf8 version : الحلقة الثالثة
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<title> this text is from arabic language </title>
</head>
<body>
<p><span> &#开发者_如何学C1575;لحلقة الثالثة</span></p>
</body>
</html>
tried to use encoding like ASCII , LATIN1 , windows-1252
but with no luck
how do i tell what is this type of encoding in order to convert it ??
both of google translate + stackoverflow editor was able to detect it and covert it ?
another example : this website http://kanjidict.stc.cx/recode.php was able to convert the encoding correctly if i check the Assume HTML (default: handle as plain text)
what i am missing and those 3 websites was do it to convert it correctly ????
Well ,
after one day working , i have found my lost command , its a package i had installed called ascii2uni
simply by : sudo apt-get install ascii2uni
and after some testing i was able to convert one file to unicode by using this command
ascii2uni -a D source.html > target.html
and i was able to convert it using command line only
cheers
The idea is string substitution. Coding in Python3
.
parse decimal only:
>>> import re
>>> s = r'Aßþ'
>>> r = re.compile(r'&#(\d+);')
>>> r.sub(lambda m:chr(int(m.group(1))), s)
'Aßþ'
parse hex and decimal:
>>> import re
>>> s = r'Aßþ'
>>> r = re.compile(r'&#(x?)(\w+);')
>>> r.sub(lambda m:chr(int(m.group(2), 10 if not m.group(1) else 16)), s)
'Aßþ'
Those numbers are called letter codes. There are special functions, related to url and html processing, that handle them - depending which language are you using.
In PHP, there is http://www.php.net/manual/en/function.htmlspecialchars-decode.php In other languages should be similar functions also
recode html..utf8
this should work too, but pls make sure you read the usage manual for recode, it recodes files in place if not told otherwise.
精彩评论