How best to reduce primary key value?
I am developing an application (.Net) that supports Oracle, Sql Server and Informix as the data repositories. A problem with Informix is that one table (this is legacy stuff) has a primary key of 2048 characters, and Informix will not allow a PK of this width. So my initial solution is to get the app to derive an MD5 value from the key value and use that as the primary key when inserting or looking up data. Okay that works, but leaves me the immediate problem of 'upgrading' the data in existing databases, which for various reasons has to be done by means of a Sql script. Sadly Informix does not have a built-in MD5 function so I will be hard pressed to write a Sql script to create the new PK column an开发者_StackOverflow中文版d populate it from the existing data.
So my question is: can anyone suggest a better way of significantly compressing a long string value, that will avoid this problem?
Your approach is flawed as a PK must be definition be unique, and MD5 may generate collisions (duplicates).
Instead consider using a surrogate PK (e.g. identity or GUID).
can anyone suggest a better way of significantly compressing a long string value, that will avoid this problem
By definition you can't compress arbitrary strings and maintain uniqueness. Obviously if the strings have some structure that you know about, you could use this knowledge to create an application-specific compression algorithm.
In response to comments:
I also have a problem with surrogate keys, which bear no relation to the date being stored - bad database design
I know surrogate vs natural keys is a contentious subject, but surely your proposed MD5 hash is essentially a surrogate key? And in any case "all design is trade-off" so I wouldn't describe a database design as "bad" without some context. IMHO if there is no natural key shorter than 2048 characters, a surrogate key may well be a good option.
There are also performance tradeoffs to consider: with an MD5 or GUID surrogate PK, you have the potential for page splits as new rows will be inserted in the middle of the table vs at the end for an Identity PK.
By what definition?
The key word is 'arbitrary'. A non-lossy compression algorithm such as ZIP does not guarantee to achieve a given compression ratio on all inputs - think of attempting to ZIP a ZIP archive.
In Informix, if you create a dbspace with large page sizes (you'd need to use 12, 14 or 16 KiB pages), you can create indexes on keys up to about 3 KiB in that dbspace (rule of thumb, 5 key values must fit on one index page).
But a key that big is probably not very efficient, to be polite about it. I'd be curious to see the breakdown of the columns in the PK and why they have to be so big that they add up to 2 KiB. Can you not use a surrogate of some sort?
I think you can split key on two parts and store that parts in two columns, something like "id1", "id2". And then you can create composite primary key.
精彩评论