Best way to store text with an undetermined code page in a MySQL database
I am currently writing an application (App1) which retrieves portions of text remotely from another application (let's call it App2). There are several instances of App2 around the world, and they all interpret their strings according to their local system code page. App2 is not unicode-aware.
App1 retrieves the text from App2 without any hint as to the text's code page, but it is expected that at a latter 开发者_开发知识库point, a manual process will be undertaken to select the code page to correctly interpret the text.
Previous attempts to automatically determine the code page of the text have failed.
In the mean time, pending the manual determination, this data must be stored in a MySQL database.
What is the best way to store this data? Specifically, whatCHARSET
and COLLATION
would be best employed here?
I believe that MySQL will not tolerate inserting characters into a field if they are not valid for the field's charset.
It would be ideal if I could detect the code page and convert the data to unicode before inserting into the database, but I am at a loss of how this can be done consistently and reliably.
If you really do not know the character set, then you can only store it as binary data. That will preserve all the contents (nothing gets mangled). When it comes to trying to use it as a text, you will have to guess the encoding.
What is the best way to store this data?
The only sane way is for App2 to send along the information what encoding the data is in.
Using that information, you could convert it to Unicode before inserting it into the database. That would be optimal.
All multi-byte libraries have functions to guess the encoding by looking at specific tell-tale byte values, but they are terribly unreliable, especially when the incoming data could have any encoding.
精彩评论