SQL: best way to store yes/no values? Looking after performance in huge databases
I have some columns where I have to store basically yes/no values. For example user status for active or inactive. Newsletter suscription status for suscri开发者_运维百科bed or unsuscribed.
Well I want to know (considering tables with a lot of records) if the best way is to put a tiny int with char length of 1 and set 1 for yes, and 0 for no.
Is this a correct thought? Or there are no impact in the performance of db queries when using just words like yes, no, active, inactive, suscribed, etc.
thanks in advance.
Semantically, I suggest you use bit
if it's available to you. When looking at the column, any other developer can immediately determine that a boolean value is stored in it. If you don't have bit
, try using tinyint
. Ensuring that 1
is the only true
value and 0
is the only false
value will bring consistency. Otherwise, you could end up with a messy mixture of true
/false
, yes
/no
, valid
/invalid
, y
/n
, and/or t
/f
.
Comparing bit
or tinyint
values probably isn't slower than comparing strings, and even if it were slower than comparing strings, I can't imagine it having a significant effect on overall speed.
Is there something you don't like about the 'bit' data type?
The most commonly supported means is to use CHAR(1)
- on most databases, it takes the same amount of space as BIT (assuming BIT is available, 1 byte) but supports more values (26 if case insensitive, 52 if not) if there's any chance of supporting more values. Unlike BIT
, CHAR(1)
is human readable. Also, BIT
isn't supported on every database.
If your RDBMS supports bitmap indexes, go for BIT every time. If it doesn't, use whatever you want, there is really no difference between char(1), tinyint (byte).
Are you just asking in general, what the most efficient way to store a yes/no flag is? Or do you have a performance problem at hand?
If so, when do you have the performance problem (specific queries, inserts, maintenance etc)? What kind of performance gain are you looking for? 2%? 10%? 50%?
Changing datatypes will likely result in only a minor improvement unless we are talking about several hundred million rows. I will give you an example. Let's say that whatever change you did, you shaved of 3 bytes per row. Let's say the table contains 100,000,000 rows. That would be a saving of ~285 mb. Assuming the disk subsystem can provide you 100mb/s you have saved a whopping 3 seconds for a full table scan. Something tells me that the users would think 2 hours and 3 seconds vs 2 hours is same same :)
My intuition would have said performance would have been better with tinyints, but this post doesn't really bare that thought out. This SO post also offers some other interesting opinions.
I do think that performing analysis with data stored as numbers is typically easier than character data. What other programs are you going to have to interface with and use? For example, several of my analysis tools do not read character data at all, so we have to recode any data we receive in the format of "yes", "no", etc.
精彩评论