开发者

convert database field encoding in Jet Database / Delphi

I have a legacy application written in Delphi which uses a Jet Database as its back-end for storing data and I need to export the data to a new format.

Opening the database with MS Access (Windows) or MDBViewer (Linux), in fields of type "MEMO" (mysql's TEXT equivalent) all I can see is garbage which resembles Asian characters. Running the legacy application the fields' contents show up correctly.

Is there a way I can try every possible character encoding and convert it to recover the data (I'm comfortable with PHP and C#开发者_如何学编程)? I read something about BOM (byte-order marker), that might be related, any ideas?

Thanks!


Current MS Access versions use UTF-8 to store string values. Older ones simply followed the code page of the machine on which the text was entered.

Most encodings do indeed use some marker bytes to indicate the encoding of what follows. Whether or not you have the benefit of that, really depends on the legacy app. If that simply followed a single encoding, or relied on the machine's code page, then you'll have to do some clever recognizing yourself.

Quick checks

UTF-8

If there is a marker, it would be $EFBBBF. If there isn't, you can make an educted guess that it is UTF-8 when sequences of ASCII (0-127) characters can be seen in the string.

UTF-16

Comes in two flavours: Little Endian (LE) and Big Endian (BE). For characters within the Basic Multilingual Plane, both use two bytes per character. The difference between the two is that for ASCII characters, one starts with a zero byte, the other ends with it.

If there is a marker UTF-16LE is designated by $FFFE and UTF-16BE by $FEFF. If neither of those markers is present having alternating zero and non-zero bytes in the memo field is a fair indication. And your first bet should be UTF-16LE as that is the windows standard and UTF-16BE is used a lot less. (Sorry, can never remember which of the two starts with a zero-byte for ASCII characters and which one starts with a non-zero byte).

Other

If you can exclude UTF-8 and UTF-16, you could try to figure out whether one of the other UTF encodings was used. I wouldn't spend the time though, chances are that the program simply relied on the machine's code page. Seeing as your are dealing with a lot of "asian looking" characters, your best bet would be to check for the MBCS code pages (Multi Byte Character S??? code pages). See MSDN for more details. As I have never dealt with them myself, I'm afraid I can't be of more help here though.

Trying encodings

If you do have to start trying out every encoding there is, you may want to have a look at the DIConvertors library. It's pretty good at converting between encodings. IIRC it can also recognize encodings, but otherwise it should help getting you started with your own detection. It can be found at http://www.yunqa.de/delphi/doku.php/products/converters/index

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜