开发者

When are nulls "safe" in a column?

Is there a general rule of thumb for designing databases to allow the columns as nulls vs 3nf normalization? I have a table with a column that will mostly consist of nulls (85%) but the table size is not over 10K records (not very big). It's m开发者_JAVA技巧ostly for logging and record keeping so most transactions will be inserts and selects, no updates. I'm trying to consider both the performance as well as simplified design. Will there be a great benefit with either denormalized or normalized in this case? Do different RDBMSs behave differently?


There are three separate reasons to avoid NULLs. In order of importance (in my opinion, of course) they are:

  1. You want to correctly model your data.

  2. You want to provide a database that makes it easy to produce correct applications.

  3. You want to save disk space.

  4. You're concerned about performance.

Okay, there are four separate reasons!

Of the four, I'm very concerned about #1, pretty concerned about #2 (especially since I'm always the programmer, too) and I really don't care much about #3 both because disk space is cheap and also because #1 and #2 get rid of most NULLs in any database for me. As for #4, I never want to sacrifice correctness for performance — it doesn't matter how fast I get the answer if I can't be confident that it's correct. If I can't meet my performance goals in a well-modeled SQL database then an SQL database might be the wrong tool for the job.

So the most important question to me would be "does a single-table approach correctly model the data you want to store?" Also important is "will the single table approach make it too easy to write incorrect code against the database?" I want to force myself (or any programmer who comes after me) to have to put data in the correct places. Look at your data and answer those questions, and it should help you decide on the correct structure.


Nulls are not safe, so there is no point trying to make them safe or justifying them or re-defining them as "safe".

Self-Contradiction

When you state things like If I should normalize and split into 2 or 3 others just to avoid a bunch of nulls in my table, or if I should keep one table and leave the nulls in to simplify my code and my design, and avoid extra joins. I was trying to be generic to see what's a good standard so we can apply it to various scenarios., you are working at cross-purposes with yourself, on several different points. So no one can help you reasonably. The first thing to do is resolve your cross-purposes.

  • Either you want Standards (Normalisation, no Nulls, many small fast tables, ease of use and ease of extension) Or you want simple code (fat table, nulls, no performance, impossible to extend)

  • Either you want generic standards or shortest code blocks.

Justification

Now, being a human being, like the millions of coders before you, whatever you choose, you will justify. Just look at the conflicting and conflicted answers. They all make their choices, and then justify them.

One Standard Technical Answer

But you have asked a technical question, about a well-known subject that has been answered by the giants of the industry over 30 years ago. Standards bodies have accepted these principles as Standards. There is only one technical answer. The other answers are justifications for non-technical and sub-standard methods.

  • Normalise. Not just to avoid a bunch of nulls in my table, but because if it isn't Normalised, it isn't a Database, it is a flat file.

  • Normalise because it gets rid of data duplication.

  • Normalise because Normalised Databases are much, much faster than flat files.

    It is a matter of simple physics. Normalised rows are much shorter; therefore many more rows fit into the same disk block, or page, and therefore many more rows fit into any given memory (cache). It should be no surprise that that will lead to faster processing and less I/O overall, for all users of the server.

  • Normalise because the resulting database will be much smaller (more, smaller tables, but smaller overall)

  • And finally Normalised data will not have Nulls.

  • Nulls mean one of two things.

  • Either "optional" fields (well, they can't be columns, because it is not a database), which means that data is not normalised.

  • Or "missing/unknown value" which means you have no Data Integrity (again, a flat file, not a database); the data cannot be relied upon for analysis.

Sure, SQL is cumbersome with joins, but SQL is all we have, so deal with it. That means, learn to code joins, use cut-and-paste.

"Join Cost"

SQL was designed for Relational database, not for flat files. That means many small tables, rather than fewer large tables. Joins are pedestrian for Relational databases, there is no point is "avoiding joins". There is a myth that "joins cost", but thus far no one has provided any evidence. All the SQL vendors have been enhancing their engines over 25 years, hundreds of man years by serious engineers, to ensure that joins cost nothing.

Now do not mix things up, do not misconstrue what I am saying:

  • the cost is in the size of the data sets being joined; whether indices can be used; the nature of the join; if there is a DataType mismatch; the search arguments; etc. But the code itself required for joins (assuming we are joining on Keys). the "join cost", is nothing. Just check you statistics and query plans.

  • And do not make your estimations based on your knowledge, which is, as evidenced, limited to joining fat flat files; to be sure, as I have already explained the cost of joins, joining those monsters costs an awful lot.

SQL and Non-SQL

You have tagged you question "SQL" and "MySQL". SQL is a Standard published by IEC/ISO/ANSI. MySQL is not SQL. The treatment of Null is set out in the Standard. What MySQL does, is non-standard, in both engines. In fact, what it did last year and what it will do this year, are different, and non-standard.

To call a Non-SQL, "SQL", when SQL is a Standard, is a simple fraud. Just like calling a bunch of flat files a "database".

Point is, you will get one answer if your question was tagged "SQL" and a different answer if it was tagged "MySQL".

Un-normalised for Coder Convenience

The main reason coders should not be allowed to design "databases" is demonstrated beautifully in this thread. They have no concerns about performance or ease of use to others. If we left it to them, they would design flat files full of nulls, to "simplify" their code, and actually justify it.


Are you talking about tables allowing users to store NULL as opposed to an empty string, or a number?

If so, I would say that you should only allow NULL values in a column if NULL has a special meaning (not just equivalent to an empty string '' or 0).


Logging tables should be denormalized as a general rule, since you are interested in the state of things at the time of the event and you don't want to just hold onto a foreign key to anything as it may have changed since then.

However, I don't see what NULL has to do with normalization here?


Bottom line: if you want your data structure to truly be normalized to 3rd Normal Form, you must not have null columns.

If you live and work in the real world like the rest of us, you'll find nullable columns perfectly acceptable and "safe". Your database won't technically be normalized, but what data system truly is?

Very few.


NULL means 'missing or unknown'. It's got little to do with normalization. It's got to do with domain of the attribute. If the attributes is required it is non-null (e.g., EmployeeName is probably a required attribute for the Employee Attribute). If the attribute is optional or otherwise non-required (e.g., not all Employees have — or are willing to share with their boss — a home email address), hence the column EmployeeHomeEmailAddress should be nullable.

Whether that column itself is properly a part of the table is dependent on its semantics. A nullable column does not necessarily (though it might) indicate a design issue — perhaps it should be its own entity in the model, with a 0:1 or 0:M cardinality to the parent entity. Perhaps it shouldn't. It all depends on the semantics of the entity and the attribute. But you can't just make a blanket assertion that nullable column == denormalized data: that way leads to madness.


nulls are "safe" when you don't always need an input. In fact, they may be preferred if there is no real default data. Example, let's say you're tracking inventory and the columns are id, length, width, height, weight

Maybe you don't get a count. The width shouldn't be 0, it should be null. This is particularly relevant if 0 is a valid width or value for what you are tracking.


SQL Server 2008 has sparse columns and filtered indexes to help address this situation.


If I understand the question you are referring to the normalisation of data that does not apply to every record in one table. So saying you have a vehicle table in your database that looked as so:

vehicle
----------------------------
vehicleId int
makeId int not null references make(makeId)
modelId int not null references model(modelId)
numberOfWheels tinyint null
hullMaterialId int null refrences hullMaterial(hullMaterialId)

Then you could normalise out some of these null columns into "1 to zero or 1" tables to make better sense of the data:

roadvehicle
----------------------------
vehicleId int references vehicle(vehicleId)
numberOfWheels tinyint not null

waterVehicle
----------------------------
vehicleId int references vehicle(vehicleId)
hullMaterialId int not null refrences hullMaterial(hullMaterialId)

(Please pardon the much simplified example)

I believe you are considering the right things. To me, if it makes sense from a design and a domain point of view, then it should be done but I'm not aware of any rule of thumb for this, just experience.


Some things I have learned with respect to NULL values -- trying to avoid a duplicate answer :-)

Additionally, relevant information is also presented in the Wikipedia article NULL: Controversy (and surrounding context).

Not all RDBMS systems behave the same.

In SQL Server, NULL is considered a unique-ing value in an UNIQUE column/index. In SQLite a NULL may appear many times and does not count as a duplicate value. Apparently the SQL92 specification is ambiguous in this regard.

NULL is NULL -> true (standard and correct) but NULL = NULL -> ??? It should be false, but this depends upon RDBMS and settings! In SQL Server this can be controlled by the ANSI_NULLS setting. This can make a join misbehave if running under a different context! Depending upon RDMBS similar behavior can also apply to other operations. Please always use/support proper SQL equality!

Different RDBMs systems may also use a different physical table layout and when a column is NULLABLE it may mean that space is not reserved in the record (this is true for CHAR(100) or XML (the stub bit) in SQL Server, for instance). If this record needs to be updated after the initial creation (e.g. was initially NULL) this can lead to excess fragmentation in certain circumstances. However, this should only be something to note and I would be hesitant to make general decisions based around such an implementation detail as there are other factors in play!

Make sure to define what NULL means

Unfortunately there is no NOTSET or INVALID values to complement NULL. Each column should have the particular meaning of NULL defined. Does it mean "invalid value" or "value not specified" or "value not know", etc? It may require mixing NULL with non-NULL sentinel values (or not using NULL at all or using additional state columns/relationships) to achieve the correct business logic.

Happy SQL'ing.


while table design, if you expect 85% of value can be null for particular column, it is better to keep in another table. [based on normalization] the data block assigned for a table record may calculate based on the bytes specified for each column. [like malloc function in C]. This can cause row chaining.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜