How to store unicode data in a format that doesn't support utf-8
Okay, here's yet another character encoding question, demonstrating my ignorance of all things Unicode.
I am reading data out of Microsoft Excel .xls
files, and storing it in ESRI shapefiles .shp
. For versions of Excel > 5.0, text in excel files is stored as Unicode. However, Unicode (and specifically UTF-8
support for shapefiles is inconsistent and thus I think I should not use it at all. Shapefiles do support old-school codepages, however.
What is the best practice in a situation where you must convert a Unicode string to a string in an unknown but specific codepage?
As开发者_如何学C I understand it, a Unicode string can include characters from multiple "codepages". I would assume, therefore, that I must somehow estimate the "best" codepage to use, and then convert all non-supported characters into their closest approximation in that codepage (or the dreaded ?
). Is this the usual approach?
I can definitely use more than just the system codepage. Because .shp
files use the .dbf
files to store their attribute data, at least all the codepages specified by the .dbf
format should be supported (see the xBase format description). The supported codepages are: DOS USA
, DOS Multilingual,
Windows ANSI,
Standard Macintosh
, EE MS-DOS
, Nordic MS-DOS
, Russian MS-DOS
, Icelandic MS-DOS
, Kamenicky (Czech) MS-DOS
, Mazovia (Polish) MS-DOS
, Greek MS-DOS (437G)
, Turkish MS-DOS
, Russian Macintosh
, Eastern European Macintosh
, Greek Macintosh
, Windows EE
, Russian Windows
, Turkish Windows
, Greek Windows
In addition, some applications support the use of an *.cpg
file which specifies additional codepages to use (although I understand support for utf-8
, and I suspect many other codepages, is limited).
Because I am trying to develop a general purpose tool, I can't assume anything about the content of the Unicode in the .xls
files.
What is the best practice in a situation where you must convert a Unicode string to a string in an unknown but specific codepage?
Depends on the file format. If it supports Unicode "escape sequences" like XML's €
or JSON's \u20AC
, then use those, and you won't lose any information. If not, a different approach is required.
I would assume, therefore, that I must somehow estimate the "best" codepage to use,
Generally, on a non-Unicode system, you'd convert characters into whatever the default encoding is, not an arbitrary code page.
Edit: So you do get a choice of code pages:
01h DOS USA code page 437
6Ah Greek MS-DOS (437G) code page 737
02h DOS Multilingual code page 850
64h EE MS-DOS code page 852
6Bh Turkish MS-DOS code page 857
67h Icelandic MS-DOS code page 861
65h Nordic MS-DOS code page 865
66h Russian MS-DOS code page 866
C8h Windows EE code page 1250
C9h Russian Windows code page 1251
03h Windows ANSI code page 1252
CBh Greek Windows code page 1253
CAh Turkish Windows code page 1254
04h Standard Macintosh code page 10000
98h Greek Macintosh code page 10006
96h Russian Macintosh code page 10007
68h Kamenicky (Czech) MS-DOS
69h Mazovia (Polish) MS-DOS
97h Eastern European Macintosh
To choose a code page, I would recommend:
- Check if your data is plain ASCII. If so, it doesn't matter which code page you choose.
- If not, try to find a code page that can exactly represent your data (or if you can't, one that minimizes the unrepresentable characters). Try code page 1252 first, then the other 125x code pages. Don't bother with the DOS code pages unless you have box-drawing characters.
and then convert all non-supported characters into their closest approximation in that codepage (or the dreaded ?). Is this the usual approach?
It's the approach we take at work when we need to convert a UTF-8 file into windows-1252 or into EBCDIC. I used Unidecode to help generate the "closest approximations".
We do, however, only replace letters and digits, not punctuation. Replacing “” with "" would break a few file formats.
What language is your text in? If the characters are mostly ASCII, it's probably best to write the original UTF-8 encoded text as such. A non-UTF-8-aware program will still read ASCII text correctly and display garbled ASCII for unknown characters.
精彩评论