How would you structure your entity model for storing arbitrary key/value data with different data types?
I keep coming across scenarios where it will be useful to store a set of arbitrary data in a table using a per-row key/value model, rather than a rigid column/field model. The problem is, I want to store the values with their correct data type rather than converting everything to a string. This means I have to choose either a single table with multiple nullable columns, one for each data type, or a set of value tables, one for each data type. I'm also unsure as to whether I should use full third normal form and separate the keys into a separate table, referencing them via a foreign key from the value table(s), or if it would be better to keep things simple and store the string keys in the value table(s) and accept the duplication of strings.
Old/bad:
This solution makes adding additional values a pain in a fluid environment because the table needs to be modified regularly.
MyTable ============================ ID Key1 Key2 Key3 int int string date ---------------------------- 1 Value1 Value2 Value3 2 开发者_Go百科 Value4 Value5 Value6
Single Table Solution
This solution allows simplicity via a single table. The querying code still needs to check for nulls to determine which data type the field is storing. A check constraint is probably also required to ensure only one of the value fields contains non-null data.
DataValues ============================================================= ID RecordID Key IntValue StringValue DateValue int int string int string date ------------------------------------------------------------- 1 1 Key1 Value1 NULL NULL 2 1 Key2 NULL Value2 NULL 3 1 Key3 NULL NULL Value3 4 2 Key1 Value4 NULL NULL 5 2 Key2 NULL Value5 NULL 6 2 Key3 NULL NULL Value6
Multiple-Table Solution
This solution allows for more concise purposing of each table, though the code needs to know the data type in advance as it needs to query a different table for each data type. Indexing is probably simpler and more efficient because there are less columns that need indexing.
IntegerValues =============================== ID RecordID Key Value int int string int ------------------------------- 1 1 Key1 Value1 2 2 Key1 Value4 StringValues =============================== ID RecordID Key Value int int string string ------------------------------- 1 1 Key2 Value2 2 2 Key2 Value5 DateValues =============================== ID RecordID Key Value int int string date ------------------------------- 1 1 Key3 Value3 2 2 Key3 Value6
How do you approach this problem? Which solution is better?
Also, should the key column be separated into a separate table and referenced via a foreign key or be should it be kept in the value table and bulk updated if for some reason the key name changes?
First, relational databases were not designed to store arbitrary data. The fundamentals of the relational model revolve around getting specifications for the nature of the data that will be stored.
Second, what you are suggesting is a variant of an Entity-Attribute-Value (EAV). The problem with EAVs comes in data integrity, reporting, performance and maintenance. They have their place but they are analogous to drugs: used in limited quantity and narrow circumstances they can beneficial; too much will kill you.
Writing queries against an EAV is a bear. Thus, if you are going to use an EAV, the only circumstance under which I've seen them be successful is to restrict their use such that no one is permitted to write a query that filters for a specific attribute. I.e., no one is ever permitted to write a query akin to Where AttributeName = 'Foo'
. That means you can never filter, sort, calculate on nor place a specific attribute in a specific place on a report. The EAV data is just a bag of categorized data which can spewed out en masse on a report but that's it. I've even seen people implement EAVs as Xml blobs.
Now, if you use the EAV in this light and since it is just a blob of data, I would use the single table approach. A significant benefit to the single table approach is that you can add a check constraint that ensures that you have one and only one value in the IntValue, StringValue or DateValue columns. The nulls will not cost you much and if this is just wad of data, it will not make any difference in performance. Furthermore, it will make your queries simpler in that you can use a simple case statement to return the String, Integer or DateValue.
I can see many problems with the multi-table approach not the least of which is that there is nothing to prevent the same attribute from having multiple types of values (e.g. a row in IntegerValues and a row in StringValues). In addition, to get the data, you will always have to use three left joins which will make your queries more cumbersome to write.
The cost of EAVs is discipline and vigilance. It requires discipline in your development team to never, ever, under any circumstances write a report or query against a specific attribute. Developers will get a lot of pressure from management to "just this one time" write something that filters for a specific attribute. Once you go down the dark path forever does it dominate your development and maintenance. The EAVs must remain a wad of data and nothing more. If you cannot maintain that kind of discipline in your development team, then I would not implement an EAV. I would require specification for any new column in the interest of avoiding a maintenance nightmare later. Once users do want to filter, sort, calculate on or put an attribute in a special place on a report, the attribute must become a first class column. If you are able to maintain discipline on their use, EAVs can work well for letting users store whatever information they want and deferring the time when you need to get specifications on the data elements until users want to use the attribute in a way mentioned earlier.
i prefer to keep keys and values all together in a single table. the database i am building right now collects data points about chinese characters in simple subject / predicate / object phrases; both subject and predicate are strings, but objects can have any type. there is some additional structural information in the table (such as type of predicates) but not much.
a special feature of my db structure is that the predicate is actually split into several partial keys. to see how that can be useful, let’s consider some datapoints for the character 人:
人 / reading / chinese / rén
人 / reading / japanese / on / jin
人 / reading / japanese / on / nin
人 / reading / japanese / kun / hito
人 / used-in / taiwan
人 / reading / prc
人 / reading / japan
人 / meaning / chinese / english / man; person; human
人 / meaning / japanese / english / man; person; human; quantifier for people
人 / form / strokecount / 2
人 / form / strokeorder / 34
each line represents one data point. the first element is the subject, the last the object, and in between are the predicate parts. there is a fixed number of columns (3 to 5 will most probably suffice—flat is better than nested) for the predicate parts; unused parts receive a NULL value. with this schema, it is easy to formulate sql statements that return all the facts about a given character, or all the japanese readings (both on and kun) concerning a number of given characters, or all the characters with at least 13 and at most 24 strokes, and so on and on:
subject predicate1 predicate2 predicate3 ob_type ob_int ob_text ob_bool
人 reading chinese text rén
人 reading japanese on text jin
人 reading japanese on text nin
人 reading japanese kun text hito
人 used-in taiwan bool true
人 reading prc bool true
人 reading japan bool true
人 meaning chinese english text man; perso...
人 meaning japanese english text man; perso...
人 form strokecount int 2
人 form strokeorder int 34
the beauty of this approach is that without too much thinking and upfront planning, you can soon start to feed data into the table. when new facts appear, they will fit into this very general structure most of the time; when you discover that some predicates are awkward, it is not too difficult to collect the offending records and update them to carry your new favorite wording. no more schema migration. yay!
more specifically to answer your question, i’ve thought much about whether to put the values in a separate table and whether to actually represent the predicates in yet another table.
it is perfectly possible, but for my initial version, i found it more important to keep it simple; if at some point it turned out that storing all those repetitve string hurts storage and performance (i mean i have strokecounts for ca. 70000 characters in my db, so that alone are in the order of ( len( 'form' ) + len( 'strokecount' ) ) * 7e4 == 1e6
bytes just to spell out the predicate), i believe it will be relatively easy to migrate to a more sophisticated approach. alas, that also means you have to modify your queries.
when i hear people claim that of course you absolutely must keep those repetitive predicates and the disparate value types in separate tables i just smile politely. databases have been optimized for decades to cope with huge amounts of data and organize sparse tables efficiently, dammit. i mean this entire approach is against the very grain of what everyone tells you how to do it, so why not be bold.
in the end, i believe there are three main factors to help you decide how to structure the db:
1) can you come up with reasonable SQL statements that give you the answers you expect? (if not so, you still have to decide whether you’ve hit upon one of SQL’s inherent limitations, one that may or may not be solvable with a different db schema).
2) do those queries perform well? i know from experience that ‘putting it the other way’ (in a few-MB sqlite db) can make a truly huge difference in performance, so even if you chose the one-big-table approach and get unsatisfactory query timings, it might be the schema that is at fault, but it might just as well that choosing another way to query the same data could give you a 10fold speed gain.
3) scalability. scalability. scalability. this is a hard one, but maybe you know for sure all you want to do is collecting the data about your personal marble collection. in that case, it’s hard to do it very wrong. if you promised to deliver data on any book ever published in your country to every desktop in the world in under one second, then it’s hard to do anything right. most real world scenarios are somewhere in between, so scalability means to ask: if this or that tool should turn out to be a performance bottleneck, will i be able to upgrade it, or, that failing, migrate to another product? i’d say the one-big-table approach is so simple, alternatives should be abundant.
ah, and maybe of interest to you: i am currently looking into redis, which is one of those NoSQLish db thingies. it looks quite interesting and easy to use. the one-big-table approach should be quite compatible with one of those CouchDB / MongoDB / whathaveyou ‘document oriented, schema-free’ databases that have become so popular.
I'd definitely go with the multi-table approach, which will make your queries simpler and rid you of lots of unused fields. If you wish to normalize your database, you should also separate the keys into an additional table but that's just a matter of preference.
The only better approach I can think of is using a document-oriented database like MongoDB.
Another alternative to a single Table design is store the DataType
where you have an internal enum that differentiates the data type ie:
1 - string
2 - int
3 - date
4 - etc
Table
============================
ID Key Value DataType
int string string int
----------------------------
1 Key Value 1
I recommend storing plain text in the value as in the Pragmatic Programmer
book, plain text will outlive all data types, and it allows you to perform any manipulation to it.
As what Thomas said, there is always a discipline trade off for EAVs. The discipline comes when you insert the data to ensure that the data type is validated and inserted as the correct type you expect.
When you query you just parse depending on data type on your code.
ie: if response.dataType == enum.date { parseDate(response)}
else if response.dataType == enum.int { parseInt(response)} // etc etc
worst case in production if it fails in production and insert the incorrect datatype, you can just alter the datatype in your db and your code should parse accordingly
I just want to state/iterates that EAVs should be used in moderation, and there are certain trade offs going this path, and I recommend reading on them before continuing.
Store the value in raw binary
Store the value in a varbinary field by converting the data to its raw byte[] format.
Why converting everything to strings sucks:
Converting everything to strings is expensive because their is an actual conversion involved.
For example:
ints, doubles, floats, bools, dates, etc... all need to be cast to ASCII chars which can be expensive both in terms of storage and processing (especially since they'll need to be cast back).
Believe me, I've done this before by converting everything to double (I was only using scalar values) and it worked, but it sucked.
If you convert everything to bytes[] there is no 'real' format conversion because you're copying the primitive versions the types. The only addition is you'll have to store the type too so you'll know what to convert the data to when you pull it back out of the table.
You have two options for this:
- Add another column with the type
- Concatenate the type to the front of the varbinary by adding a byte containing bitflags to indicate the type
for example:
0000 0001 - int
0000 0010 - float
0000 0011 - double
0000 0100 - bool
0000 0101 - string
0000 0111 - date
- Option 1 adds another column, which you said you didn't want.
- Option 2 adds complexity. If you can handle bits and bytes it is the best/most efficient method. You use bitmath to set/read the type flags and you use an Array.Copy() and Convert.ToType() method to restore them back to their proper type.
Sometimes, knowing how to work in binary and byte[]s can be a huge advantage.
Update
I'm not clear why this got down-voted. It satisfies the question's requirements of containing a simple per-row key-value pair relationship. It's also fast and efficient, considering that no unnecessary columns need to be added to the table and the values stored in the binary representation of their 'native' format.
It adds an extra step to insert/extract items from the table but a normal dictionary/array would do the same on a lower level. The only major difference being, this method of storing values carries its type around with it.
The only real downsides to this method are:
- values can't be searched in an efficient manner (because they need to be casted to be checked.
- It adds complexity because data needs to be cast into and cast out of binary format and the type would need to be stored. (which is still a marked improvement over adding a type column and casting everything as a string.
You could store the actual main entity value as like a JSON blob, and then instead have IndexFieldName1, IndexFieldValue1 to support filtering and sorting to get kindof the best of both worlds
精彩评论