Performance of case-insensitive search in Oracle database
My db background is on MS SQL Server side where text comparison in indexes and constraints is not case-sensitive (at least by default). So once you have a value "abc" assigned to a unique column, you can not store a second value "ABC" and if you search for "ABC" SQL Server will find "abc".
With Oracle things are different, so even with unique index on a text column you can store there both "abc" and "ABC", and if you search for "AbC" you won't get any result.
AFAIK prior to Oracle 10gR2 there was no way around it, now it's possible to set insensitive comparison per sesson which IMHO is not a good solution because everything depends on programmers' discipline.
But what's worst with case-sensitive lookup is that those who rewrite all searches as UPPER(some_column)=UPPER(some_text)
(and this is what many discussion threads recommend) end with table scan even when there is an index on some_column. Performance implication is disastereous: I just tested a simple search on a table with half a million rows, and search with UPPER function call took 20 times longer than the search with just a column identifier, thus confirming that the index is not used when doing function-based search.
Is it really so that the most standard techniqu开发者_高级运维e to do case-insensitive search in Oracle database is to apply UPPER/LOWER functions to search elements despite of bad performance? Or are there more elegant ways of addressing this issue?
Yes, use of of UPPER(some_column)=UPPER(some_text)
really is the best way, but you can create an index on UPPER(some_column)
. That should alleviate the problem.
I would say create "clean" fields based on your company's business logic for cleaning these fields (a company name or address, for example, will have a surprising amount of cleaning logic around furniture words, usps rules, etc., not to mention third party cleaning routines if used).
So, for important search fields, keep BOTH the raw (unclean) and clean versions. If your cleaning logic changes significantly over time, you can go back and reclean based on raw values. Your searches (assuming you're not using a fuzzy logic engine like Oracle Text or Lucene) would hit the clean values.
For all other fields (that do not merit separate clean versions), I usually perform a minimum level of scrubbing. Capitalization, trimming, strip control chars, reducing multiple spaces to 1 space, etc. is all part of a set of basic cleansing routines. These are usually done before the data is loaded (in the data build programs).
So, come up with a set of data standards and stick to it, but I wouldn't just throw any ole junk in the db. Try not to think of the db as a dumping ground of data, and you'll get better results for sure.
精彩评论