Database design - empty fields [closed]
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
开发者_如何学GoClosed 2 years ago.
Improve this questionI am currently debating an issue with my dev team. They believes that empty fields are bad news. For instance, if we have a customer details table that stores data for customers from different countries, and each country has a slightly different address configuration - plus 1-2 extra fields, e.g. French customer details may also store details for entry code, and floor/level plus title fields (madamme, etc.). South Africa would have a security number. And so on.
Given that we're talking about minor variances my idea is to put all of the fields into the table and use what is needed on each form.
My colleague believes we should have a separate table with extra data. E.g. customer_info_fr. But this seams to totally defeat the purpose of a combined table in the first place.
The argument is that empty fields / columns is bad - but I'm struggling to find justification in terms of database design principles for or against this argument and preferred solutions.
Another option is a separate mini EAV table that stores extra data with parent_id, key, val fields. Or to serialise extra data into an extra_data column in the main customer_data table.
I think I am confused because what I'm discussing is not covered by 3NF which is what I would typically use as a reference for how to structure data.
So my question specifically: -
If you have slight variances in data for each record (1-2 different fields for instance) what is the best way to proceed?
There is definitely a school of thought which holds that NULL fields are bad, in and of themselves. Relational theory demands that databases consist of facts, and NULLs are the absence of fact. So, a rigorously designed database would have no nullable columns.
Your colleague is proposing something which is on the road to 6th Normal Form, where all the tables consist of a primary key and at most one other column. Only in such a schema we wouldn't have tables called customer_info_fr
. That's not normalised. Many countries might include ENTRY_CODE in their addresses. So we would need address_entry_codes
and address_floor_numbers
. Not to mention address_building_number
and address_building_name
, as some places are identified by number and other by name.
It's completely accurate and truthful as a logical design. Alas from a physical perspective it is Teh Suck! The simplest query - select * from addresses
- becomes a multi-table join, and outer joins at that. Nullable columns are a way of reconciling ugly design with the hard truth, "you cannae break the laws of physics". Nullable columns allow us to combine disjoint data sets into a single table, albeit at the cost of handling nulls (they can affect data retrieval, index usage, maths, etc).
Some designs attempt to get around the use of nulls by applying magic values. That is, if we don't know the correct value for some column we inject a default value which is a value but also means "unknown". A common instance of this is date '9999-12-31'
as an open-ended TO_DATE in a FROM-TO date range. As long as everybody understands and adheres to the convention it's not a problem. It becomes a problem when some tables have date '9999-12-01'
or date '9999-01-31'
instead.
This is why magic values are not a robust solution. Consumers of our data need to know that -1
is the value we use for DofQ
in our stock control system when we don't know the real value. But at least it's obviously not a valid value. Choosing say 20
as a magic value is deadly because it could be a real DofQ
: we can no longer tell the actual values from the "don't knows".
So, given a choice between nulls and magic values, choose nulls.
I'd be interested in your colleague's justification as to why empty fields are bad. As far as I'm aware, empty or null fields aren't bad in and of themselves. If you have a lot of empty data values for a column that you are planning on putting an important index on, you may want to consider other options. This goes for any column where you have a lot of duplicate records actually and need an index, as duplicated records lower the cardinality of the column, making indexes less useful. In your case, I don't see it being an issue.
For this kind of data, you're likely using a VARCHAR or some kind of TEXT column anyway, which are variable length fields in the database. It doesn't matter if your field is full of data or empty, you're still going to incur the overhead of a variable-length column (which isn't worth worrying about in normal circumstances). So again, there's no difference to the RDBMS.
From the sounds of what you're designing, I think if you came up with a generic method of handling address variances in a single table, it would be the way to go. Your code and structure would be much simpler at the negligible (in my opinion) cost of some empty data fields.
That's what nullable fields are for: "Data not available/applicable".
SQL has a different notion of null than most programming languages, so SQL's null is often a misunderstood concept.
Whatever you do, do not go down the EAV route. This is a prescription for a poorly performing database, far, far worse than a few empty fields.
If you must have a separate related tables for the different situations, a lot of that will depend on how different the entities are and how they will be queried. If you will be querying across categories, you will find that joins to a bunch of tables to get all the data you may or may not need is a nightmare (I don't know if Germany will be in my result set so I join to the Germany details tables, oops didn't need to). It can be far simpler to handle nulls than to try it figure out which of many tables you need to join to (and to always remember to left join to those tables).
However, if you will never be querying across the entitites and the fields make sense separate, then put them in a separate table.
Nulls invariably add complexity to a data model because the behaviour of null in SQL rarely matches the maths, logic or reality that you intended to model with it. In other words, some queries return incorrect results, which you then need to compensate for with additional logic.
All information can be represented accurately without nulls. Since nulls add complexity it is sound design practice to begin your data model without them and then only add a null where you find some special reason to do so or where some database feature or limitation forces a null upon you.
I wouldn't overthink it. NULL can be used, but developers need to be careful using them.
I would prefer to have the Address be a long Text field in the database for any website that deals with multiple countries.
Most websites have Address Line1, Address Line 2, Postal/ZIP Code, City, State/Region, Country ... anything more than that (like EAV) would be overkill.
I wouldn't mind having the user interface show different labels near the text boxes for each country.
Entry code, floor/level, title fields, security number, and so on should fit in the address lines, the label near it, or a tip in the UI can indicate it.
精彩评论