Is it crazy to bypass database case sensitivity issues by storing original string case AND lower case?

2023-01-21 03:42 问答作者：

I'm implementing a database where several tables have string data as candidate keys (eg: username) and will be correspondingly indexed. For these fields I want:

Case insensitivity when someone queries the table on those keys
The initially written case to be preserved somehow so that the application can present the data to the user with the original case used

I also want the database schema to be as database independent as possible, as the application code is (or should not be) not slaved to a particular RDBMS.

Also worth noting is that the vast majority of queries done on the database will be done by the application code, not via direct table access by the client.

In implementing this, I'm running into a lot of annoying issues. One is that not all RDBMS implement COLLATE (which is where cases sensitivity appears to be tunable at schema level) in the same way. Another issue is that the collation and case sensitivity options can be set at multiple levels (server, database, table (?), column) and I can't guarantee to the application what setting it will get. Yet another issue is that COLLATE itself can get hairy because there is a heck of a lot more in there than simply case sensitivity (eg: unicode options).

To avoid all of these headaches, what I'm considering is dodging the issue altogether by storing two columns for one piece of data. One column with开发者_JS百科 the original case, another dropped to lower case by the application layer.

eg: Two of the fields in the table

user_name = "fredflintstone" (a unique index on this one)
orig_name = "FredFlintstone" (just data... no constraints)

The pros and cons of this as I see it are:

Pros:

No ambiguity - the application code will manage the case conversions and I never need to worry about unit tests failing "mysteriously" when the underlying RDBMS/settings changes.
Searches on the index will be clean and never be slowed down by collation features or calls to LOWER() or anything (assuming such things slow down the index, which seems logical)

Cons:

Extra storage space required for the doubled-up data
It seems a bit brutish

I know it will work, but at the same time it smells wrong.

Is it insane/pointless to do this? Is there something I don't know that makes the case sensitivity issue less tricky than it seems to me at the moment?

Of course, decisions like this are always a trade-off, but I don't think this is necessarily "doubled-up data". Lowercasing a string can be a non-trivial operation, in particular if you go beyond ASCII, so the lowercased version of the string is not just "duplicate". It is somewhat related to the original string, but not more than that.

If you think of it as an analog to storing computed results in the DB, it becomes more natural.

The option of querying on UPPER(UserName) is another good solution, which avoids the second column. However, to use it you need at least a reliable UPPER function (where in particular you can control the locale that it uses for non-ASCII characters), and probably function-based indices for decent performance.

Searches on the index will be clean and never be slowed down by collation features or calls to LOWER() or anything (assuming such things slow down the index, which seems logical)

No, that's not logical. You can have indexes on constant functions.

create index users_name on users(name); -- index on name
create index users_name_lower on users(lower(name)); -- index on the function result

Your RDBMS should be smart enough to know to use users_name_lower when it gets this query:

select * from users where lower(name) = ?

Without users_name_lower, yes, that would have to walk the table. With the functional index, it does the right thing.

I've often seen data duplicated in this way for performance reasons. It allows you to keep the original casing (which you'll obviously need as you're not always able to guess what the casing should be, you can't be sure that each name begins with a capital letter for example). If the database doesn't support other ways to do this (functional indexes), then this is practical, not crazy. You can keep the data consistent by using triggers.

Suggest your search queries do something like this:

SELECT * FROM Users WHERE LOWER(UserName) = LOWER('fredFlinstone')
explicitly include the COLLATION hint on the query when case sensitivity should be ignored/respected

I'd consider the duplication of data for case sensitivity too onerous.

继续阅读：database database-agnostic sql

Is it crazy to bypass database case sensitivity issues by storing original string case AND lower case?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？