Delphi XE AnsiStrings with escaped combining diacritical marks
What is the best way to convert a Delphi XE AnsiString containing escaped combining diacritical marks like "Fu\u0308rst" into a frienly WideString "Fürst"?
I am aware of the fact that this开发者_StackOverflow中文版 is not always possible for all combinations, but the common Latin blocks should be supported without building silly conversion tables on my own. I guess the solution can be found somewhere in the new Characters unit, but I don't get it.
I think you need to perform Unicode Normalization. on your string.
I don't know if there's a specific call in Delphi XE RTL to do this, but the WinAPI call NormalizeString should help you here, with mode NormalizationKC:
NormalizationKC
Unicode normalization form KC, compatibility composition. Transforms each base plus combining characters to the canonical precomposed equivalent and all compatibility characters to their equivalents. For example, the ligature fi becomes f + i; similarly, A + ¨ + fi + n becomes Ä + f + i + n.
Here is the complete code that solved my problem:
function Unescape(const s: AnsiString): string;
var
i: Integer;
j: Integer;
c: Integer;
begin
// Make result at least large enough. This prevents too many reallocs
SetLength(Result, Length(s));
i := 1;
j := 1;
while i <= Length(s) do begin
if s[i] = '\' then begin
if i < Length(s) then begin
// escaped backslash?
if s[i + 1] = '\' then begin
Result[j] := '\';
inc(i, 2);
end
// convert hex number to WideChar
else if (s[i + 1] = 'u') and (i + 1 + 4 <= Length(s))
and TryStrToInt('$' + string(Copy(s, i + 2, 4)), c) then begin
inc(i, 6);
Result[j] := WideChar(c);
end else begin
raise Exception.CreateFmt('Invalid code at position %d', [i]);
end;
end else begin
raise Exception.Create('Unexpected end of string');
end;
end else begin
Result[j] := WideChar(s[i]);
inc(i);
end;
inc(j);
end;
// Trim result in case we reserved too much space
SetLength(Result, j - 1);
end;
const
NormalizationC = 1;
function NormalizeString(NormForm: Integer; lpSrcString: LPCWSTR; cwSrcLength: Integer;
lpDstString: LPWSTR; cwDstLength: Integer): Integer; stdcall; external 'Normaliz.dll';
function Normalize(const s: string): string;
var
newLength: integer;
begin
// in NormalizationC mode the result string won't grow longer than the input string
SetLength(Result, Length(s));
newLength := NormalizeString(NormalizationC, PChar(s), Length(s), PChar(Result), Length(Result));
SetLength(Result, newLength);
end;
function UnescapeAndNormalize(const s: AnsiString): string;
begin
Result := Normalize(Unescape(s));
end;
Thank you all! I am sure that my first experience with StackOverflow won't be my last one :-)
Are they always escaped like this? Always in a number of 4 digits?
How is the \ character itself escaped?
Assuming the \character is escaped by \xxxx where xxxx is the code for the \ character, you can easily loop through the string:
function Unescape(s: AnsiString): WideString;
var
i: Integer;
j: Integer;
c: Integer;
begin
// Make result at least large enough. This prevents too many reallocs
SetLength(Result, Length(s));
i := 1; j := 1;
while i <= Length(s) do
begin
// If a '\' is found, typecast the following 4 digit integer to widechar
if s[i] = '\' then
begin
if (s[i+1] <> 'u') or not TryStrToInt(Copy(s, i+2, 4), c) then
raise Exception.CreateFmt('Invalid code at position %d', [i]);
Inc(i, 6);
Result[j] := WideChar(c);
end
else
begin
Result[j] := WideChar(s[i]);
Inc(i);
end;
Inc(j);
end;
// Trim result in case we reserved too much space
SetLength(Result, j-1);
end;
Use like this
MessageBoxW(0, PWideChar(Unescape('\u0252berhaupt')), nil, MB_OK);
This code is tested in Delphi 2007, but should work in XE as well due to the explicit use of Ansistring and Widestring.
[edit] Code is ok. Highlighter fails.
If I'm not mistaken, Delphi XE now supports regular expressions. I don't use them that often, though, but it seems a good way to parse the string and then replace all escaped values. Maybe someone has a good example of how to do this in Delphi with regular expressions?
GolezTrol, you forget '$'
if (s[i+1] <> 'u') or not TryStrToInt('$'+Copy(s, i+2, 4), c) then
加载中,请稍侯......
精彩评论