Drowning in a Sea of Nulls
An application I inherited tracks lab test results performed on material samples. Data is stored in a single table (tblSampleData) with a primary key of SampleID and 235 columns representing potential test results. The problem is that only a few tests are performed per sample, so each row contains over 200 nulls. Actually, there is a second similar table (tblSamp开发者_StackOverflowleData2) with another 215 primarily null columns and a primary key of SampleID. The two tables have a one-to-one relationship and most SampleIDs have some data in both tables. For every SampleID, however, there are easily 400 null columns!
Is this bad database design? If so, which normal form rule is broken? How can I query this table to identify which groups of columns are typically filled together with data? My goal would be to have, say 45 tables with 10 columns and fewer null values. How can I do this? How do I avoid breaking existing applications?
The tables have about 200,000 sample records so far. Users are asking me to add more columns for more tests, but I'd rather build a new table. Is this wise?
I have seen articles / papers that indicate that simply having NULLs in the database breaks the first normal form.
From what I've gathered from your description of the database, a better design might be as follows:
A Sample table with fields that are always associated with a sample. For example,
Sample
------
SampleID
SampleDate
SampleSource
Then, a table of test types with one entry for each type of test that can be performed.
TestType
--------
TestTypeID
TestName
MaximumAllowedValue
Finally, have an intermediate table that represents the many-to-many relationship between the two above tables and holds the results for the tests.
TestResult
----------
SampleID
TestTypeID
TestResult
This would eliminate the null values because the TestResult table would only contain entries for the tests that were actually performed on each sample. I once designed a database for an almost identical purpose to what I believe you are doing and this is the approach I took.
You could use the well known Entity Attribute Value model (EAV). The description of when it is appropriate to use EAV fits quite well with your use case:
This data representation is analogous to space-efficient methods of storing a sparse matrix, where only non-empty values are stored.
One example of EAV modeling in production databases is seen with the clinical findings (past history, present complaints, physical examination, lab tests, special investigations, diagnoses) that can apply to a patient. Across all specialties of medicine, these can range in the hundreds of thousands (with new tests being developed every month). The majority of individuals who visit a doctor, however, have relatively few findings.
In your specific case:
- The entity is a material sample.
- The attribute is a test type.
- The value is the result of a test for a specific sample.
EAV has some serious drawbacks and creates a number of difficulties so it should only be applied when it is appropriate to do so. You should not use it if you need to return all test results for a specific sample in a single row.
It will be difficult to modify the database to use this structure without breaking existing applications.
I'm not sure the design is really that bad. NULL values should actually be relatively cheap to store. In SQL Server, there is an internal bit field (or fields) for each row that indicates which column values are NULL.
If the performance of the application doesn't need to be improved and the cost-benefit of refactoring due to changing the table schema isn't positive, why change it?
Just because no normal-form rules are broken doesn't mean it isn't bad database design. Generally you're better off with a design with smaller rows more tightly packed, because that way more rows can fit in a page so there is less work for the database to do. With the current design the database server is having to devote a lot of space to holding null values.
Avoiding breaking existing applications is the tough part, if the other applications need only read-access you could write a view that looks identical to the old table.
If you do change your table structure, I'd recommend having a view called tblSampleData which returns the same data as the table does now. That'll preserve some compatibility.
You probably don't even need a RDBMS for this data. Store your data in structured binary files or a DBM/ISAM table.
It's not normalized. Usually, lack of normalization is the source of all your problems. But in this case, lack of normalization is not the end of the world because this data is "read only", there's only one key, and it's not related to anything else. So update anomalies shouldn't be a worry. You only have to worry that the original data is consistent.
There's nothing too terribly wrong with all those NULLs if you treat NULLs as a "special value" with the same meaning across the entire app. Data was not collected. Data not available. Subject refused to answer question. Data is outlier. Data is pending. Data is known to be UNKNOWN. Subject said they didn't know... etc. you get the idea. Allowing NULLs for no defined reason with no defined meaning is terribly wrong.
I say normalize it. Either define special values and create one massive table. Or, leave NULLs for the VB and PHP programmers, and split it up your data properly. Create a VIEW to join the data back up if you need to support legacy code. From what you described, you're talking about a couple hours of work to get this thing correct. That's not such a bad deal.
Let's say you have test machine X with 40 measurement channels. If you know that on each test the testers will use just a few channels, you could change the design to:
tblTest: testId, testDate tblResult: testId, machineId, channelId, Result
You could always retrieve the pevious layout using a crosstab.
I d go with 1 main table, where you'd have 1 row per sample, it would contain all the columns that every sample should have:
Sample
-------
SampleID int auto increment PK
SampleComment
SampleDate
SampleOrigin
....
I'd then add one table for each different test or "class" of similar tests, and include all columns related to those (use the actual test name and not XYZ):
TestMethod_XYZ
---------------
SampleID int FK Sample.SampleID
MeltTemp
BurnTemp
TestPersonID
DateTested
...
TestMethod_ABC
---------------
SampleID int FK Sample.SampleID
MinImpactForce
TestPersonID
DateTested
....
TestMethod_MNO
---------------
SampleID int FK Sample.SampleID
ReactionYN
TimeToReact
ReactionType
TestPersonID
DateTested
...
When you search for a result, you would search the test method table that applies and join back to the actual sample table.
EAV is an option but the queries will kill you.
Is it an option to migrate the data to a NoSQL DB like MongoDB? I believe this will be the most efficient and easy way arround your problem. Since you mentioned you're basically doing CRUD queries NoSQL should be pretty efficient.
The current design is a poor one. In general a database with lots of NULL values is an indication of poor design, violating 4th normal form. But the biggest problem with the design is not a violation of normal principles but the fact that the addition of a new test type requires changes to the database structure rather than simply adding some data to several tables that "define" a test. Even worse, it requires structural changes to an existing table, rather than addition of new tables.
You can achieve perfect fourth normal form by adapting a key-value system as described by others. But you may be able to substantially improve the design of the database and still maintain your sanity (something hard to do when working with key-value systems without an ORM) by doing either of the following:
Attempt to discover the largest number of measurements required to represent any individual test. If there are different data types returned by the tests, you'll need to discover the largest number of values of each data type returned by the largest test. Create a table with those columns only, labelled Meas1, Meas2, etc. Instead of 400 columns you'll need, perhaps, 10. Or 40. Then create a set of tables that describe what each column "means" for each test. This information can be used to provide meaningful prompts and report column headers depending on the type of test being stored. This will not eliminate NULLs entirely, but will greatly reduce them and, as long as any new test can be "fit" into the number of measurements you specified, new test can be added as data rather than structural changes.
Discover the actual list of measurements for each test and create a separate table to hold the results of each one (basic information like test ID, who ran it, the time, etc still go into a single table). This is a multi-table inheritance pattern (I don't know if it has a real name). You still have to create a new "data" table for each new test, but now you wont be touching other existing production tables and you will be able to achieve perfect normal form.
I hope this provides some ideas to get started.
精彩评论