Translate unreadable Russian text
I'm trying to read documentation which was written in I believe is Russian, but I'm not sure i开发者_如何学JAVAf what I'm seeing is even encoded correctly. The text looks something like this:
Ãåíåðèðóåò ìàòðèöó ñëó÷àéíûõ ÷èñåë â äèàïàçîíå îò -1 äî 1
(appears as several special A's and o's)
when opened in Firefox. In other programs it looks like this:
���������� ������� ��������� ����� � ��������� �� -1 �� 1
(appears as several question marks)
Is there any hope to translate this?
Decode as CP1251.
>>> print u'Ãåíåðèðóåò ìàòðèöó ñëó÷àéíûõ ÷èñåë â äèàïàçîí'.encode('latin-1').decode('cp1251')
Генерирует матрицу случайных чисел в диапазон
You need to determine which of multiple possible Cyrillic codesets was used - the linked site lists more than a dozen possibilities, of which ISO 8859-5 and CP-1251 are perhaps the most likely.
You may be able to get one of the translation web sites (Babelfish or Google, and no doubt others) to help. However, you may have to translate from the original codeset to UTF-8 to get it to work -- simply copying the bytes above did not work.
When copying the original text to a Mac, it was encoded as UTF-8:
0x0000: C3 83 C3 A5 C3 AD C3 A5 C3 B0 C3 A8 C3 B0 C3 B3 ................
0x0010: C3 A5 C3 B2 20 C3 AC C3 A0 C3 B2 C3 B0 C3 A8 C3 .... ...........
0x0020: B6 C3 B3 20 C3 B1 C3 AB C3 B3 C3 B7 C3 A0 C3 A9 ... ............
0x0030: C3 AD C3 BB C3 B5 20 C3 B7 C3 A8 C3 B1 C3 A5 C3 ...... .........
0x0040: AB 20 C3 A2 20 C3 A4 C3 A8 C3 A0 C3 AF C3 A0 C3 . .. ...........
0x0050: A7 C3 AE C3 AD C3 A5 20 C3 AE C3 B2 20 2D 31 20 ....... .... -1
0x0060: C3 A4 C3 AE 20 31 0A .... 1.
0x0067:
So, to translate this with Perl, I used the Encode module first to convert the UTF-8 string back to Latin-1, and then I told Perl to treat the Latin-1 as if it was CP-1251 and convert that back to UTF-8:
#!/usr/bin/env perl
use Encode qw( from_to );
my $source = 'Ãåíåðèðóåò ìàòðèöó ñëó÷àéíûõ ÷èñåë â äèàïàçîíå îò -1 äî 1';
# from_to changes things 'in situ'
my $nbytes = from_to($source, "utf-8", "latin-1");
# print "$nbytes: $source\n";
$nbytes = from_to($source, "cp-1251", "utf-8");
print "$nbytes: $source\n";
The output is:
- 102: Генерирует матрицу случайных чисел в диапазоне от -1 до 1
Which Babelfish translates as:
- 102: It generates the matrix of random numbers in the range from -1 to 1
and Google translates as:
- 102: Generate a matrix of random numbers ranging from -1 to 1
The initial UTF-8 to Latin-1 translation was required because of the setup on my Mac (my terminal uses UTF-8 by default, etc): YMMV.
精彩评论