开发者

Delphi: Encoding Strings as Python do

I want to encode strings as Python do.

Python code is this:

def EncodeToUTF(inputstr):
  uns = inputstr.decode('iso-8859-2')开发者_StackOverflow社区
  utfs = uns.encode('utf-8')
  return utfs

This is very simple.

But in Delphi I don't understand, how to encode, to force first the good character set (no matter, which computer we have).

I tried this test code to see the convertion:

procedure TForm1.Button1Click(Sender: TObject);
var
    w : WideString;
    buf : array[0..2048] of WideChar;
    i : integer;
    lc : Cardinal;
begin
    lc := GetThreadLocale;
    Caption := IntToStr(lc);
    StringToWideChar(Edit1.Text, buf, SizeOF(buf));
    w := buf;
    lc := MakeLCID(
        MakeLangID( LANG_ENGLISH, SUBLANG_ENGLISH_US),
        0);
    Win32Check(SetThreadLocale(lc));
    Edit2.Text := WideCharToString(PWideChar(w));
    Caption := IntToStr(AnsiCompareText(Edit1.Text, Edit2.Text));
end;

The input is: "árvíztűrő tükörfúrógép", the hungarian accent tester phrase. The local lc is 1038 (hun), the new lc is 1033.

But this everytime makes 0 result (same strings), and the accents are same, I don't lost ŐŰ which is not in english lang.

What I do wrong? How to I do same thing as Python do?

Thanks for every help, link, etc: dd


Windows uses codepage 28592 for ISO-8859-2. If you have a buffer containing ISO-8859-2 encoded bytes, then you have to decode the bytes to UTF-16 first, and then encode the result to UTF-8. Depending on which version of Delphi you are using, you can either:

1) on pre-D2009, use MultiByteToWideChar() and WideCharToMultiByte():

function EncodeToUTF(const inputstr: AnsiString): UTF8String;
var
  ret: Integer;
  uns: WideString;
begin
  Result := '';
  if inputstr = '' then Exit;
  ret := MultiByteToWideChar(28592, 0, PAnsiChar(inputstr), Length(inputstr), nil, 0);
  if ret < 1 then Exit;
  SetLength(uns, ret);
  MultiByteToWideChar(28592, 0, PAnsiChar(inputstr), Length(inputstr), PWideChar(uns), Length(uns));
  ret := WideCharToMultiByte(65001, 0, PWideChar(uns), Length(uns), nil, 0, nil, nil);
  if ret < 1 then Exit;
  SetLength(Result, ret);
  WideCharToMultiByte(65001, 0, PWideChar(uns), Length(uns), PAnsiChar(Result), Length(Result), nil, nil);
end;

2a) on D2009+, use SysUtils.TEncoding.Convert():

function EncodeToUTF(const inputstr: RawByteString): UTF8String;
var
  enc: TEncoding;
  buf: TBytes;
begin
  Result := '';
  if inputstr = '' then Exit;
  enc := TEncoding.GetEncoding(28592);
  try
    buf := TEncoding.Convert(enc, TEncoding.UTF8, BytesOf(inputstr));
    if Length(buf) > 0 then
      SetString(Result, PAnsiChar(@buf[0]), Length(buf));
  finally
    enc.Free;
  end;
end;

2b) on D2009+, alternatively define a new string typedef, put your data into it, and assign it to a UTF8String variable. No manual encoding/decoding needed, the RTL will handle everything for you:

type
  Latin2String = type AnsiString(28592);

var
  inputstr: Latin2String;
  outputstr: UTF8String;
begin
  // put the ISO-8859-2 encoded bytes into inputstr, then...
  outputstr := inputstr;
end;


If you're using Delphi 2009 or newer every input from the default VCL controls will be UTF-16, so no need to do any conversions on your input.

If you're using Delphi 2007 or older (as it seems) you are at mercy of Windows, because the VCL is ANSI and Windows has a fixed Codepage that determines which characters can be used in i.e. a TEdit.

You can change the system-wide default ANSI CP in the control panel though, but that requires a reboot each time you do.

In Delphi 2007 you have some chance to use TNTUnicode controls or some similar solution to get the Text from the UI to your code.

In Delphi 2009 and newer there are also plenty of Unicode and character set handling routines in the RTL.

The conversion between character sets can be done with SysUtils.TEncoding:

http://docs.embarcadero.com/products/rad_studio/delphiAndcpp2009/HelpUpdate2/EN/html/delphivclwin32/SysUtils_TEncoding.html


The Python code in your question returns a string in UTF-8 encoding. To do this with pre-2009 Delphi versions you can use code similar to:

procedure TForm1.Button1Click(Sender: TObject);
var
  Src, Dest: string;
  Len: integer;
  buf : array[0..2048] of WideChar;
begin
  Src := Edit1.Text;
  Len := MultiByteToWideChar(CP_ACP, 0, PChar(Src), Length(Src), @buf[0], 2048);
  buf[Len] := #0;
  SetLength(Dest, 2048);
  SetLength(Dest, WideCharToMultiByte(CP_UTF8, 0, @buf[0], Len, PChar(Dest),
    2048, nil, nil));
  Edit2.Text := Dest;
end;

Note that this doesn't change the current thread locale, it simply passes the correct code page parameters to the API.


There are encoding tools in Open XML library. There is cUnicodeCodecsWin32 unit with functions like: EncodingToUTF16().

My code that converts between ISO Latin2 and UTF-8 looks like:

  s2 := EncodingToUTF16('ISO-8859-2', s);
  s2utf8 := UTF16ToEncoding('UTF-8', s2);
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜