开发者

Pros and cons of making database IDs consistent and "readable" [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 11 years ago.

Question

Is it a good rule of thumb for database IDs to be "meaningless?" Conversely, are there significant benefits from having IDs structured in a way where they can be recognized at a glance? What are the pros and cons?

Background

I just had a debate with my coworkers about the consistency of the IDs in our database. We have a data-driven application that leverages spring so that we rarely ever have to change code. That means, if there's a problem, a data change is usually the solution.

My argument was that by making IDs consistent and readable, we save ourselves significant time and headaches, long term. Once the IDs are set, they don't have to change often and if done right, future changes won't be difficult. My coworkers position was that IDs should never matter. Encoding information into the ID violates DB design policies and keeping them orderly requires extra work that, "we don't have time for." I can't find anything online to support either position. So I'm turning to all the gurus here at SA!

Example

Imagine this simplified list of database records representing food in a grocery store, the first set represents data that has meaning encoded in the IDs, while the second does not:


ID's with meaning:

Type
1 Fruit
2 Veggie

Product
101 Apple
102 Banana
103 Orange
201 Lettuce
202 Onion
203 Carrot

Location
41 Aisle four top shelf
42 Aisle four bottom shelf
51 Aisle five top shelf
52开发者_JAVA技巧 Aisle five bottom shelf

ProductLocation
10141 Apple on aisle four top shelf
10241 Banana on aisle four top shelf
//just by reading the ids, it's easy to recongnize that these are both Fruit on Aisle 4

ID's without meaning:

Type
1 Fruit
2 Veggie

Product
1 Apple
2 Banana
3 Orange
4 Lettuce
5 Onion
6 Carrot

Location
1 Aisle four top shelf
2 Aisle four bottom shelf
3 Aisle five top shelf
4 Aisle five bottom shelf

ProductLocation
1 Apple on aisle four top shelf
2 Banana on aisle four top shelf
//given the IDs, it's harder to see that these are both fruit on aisle 4

Summary

What are the pros and cons of keeping IDs readable and consistent? Which approach do you generally prefer and why? Is there an accepted industry best-practice?

-------- edit ( helpful background info from comments, below ): --------

In our tables, the Primary Key is always an ID field containing a unique integer. At first, that integer was arbitrary. Over time, some of these IDs naturally took on meaning among developers/testers. During a recent refactor, certain developers also took time to make all IDs easier to recognize. It made everyone's job 100X easier. Some people (who don't actually use the data/code) vehemently disagreed for theoretical reasons. In practice, not one of those objections are holding true. Moreover, all developers using the data agree that it's now significantly easier to maintain.

I'm looking for (but haven't seen) a defensible argument against using immediately recognizable IDs in a data-centric environment.


Con: I just changed "Aisle Five top shelf" to "Aisle Six top shelf" so now I have to change its ID to 61 and now I have to change the ProductLocation ID for "Grapes on Aisle five top shelf" to be 10461 and oh god where else does the shelf location ID string show up in IDs in my database oh god whoever designed IDs to carry meaning ought to be shot it's four in the morning and everything's gone crazy and why does "Aisle seven bottom shelf" have an ID of 41 die die die.


There are several problems with using database IDs to encode information about a row. If you want your carrots to have an "ID" of 203, you should add a product_id column (for example) and put this information there instead. Why?

  1. By customizing your IDs, you have to add domain-specific code that manages your IDs and you can't rely on database features like auto-incrementing or UUIDs.
  2. If you ever have to change your classification, it will mess up your table relations, browser bookmarks, search engine results, etc.
  3. It's not common practice -- so when you put application- or domain-specific data into the ID field, it will be assumed by many that this is meaningless information, when it isn't. You will need a data dictionary (and you will have to ensure that people read the data dictionary) to note the fact that this is valuable information.

The only required purpose of an ID is to uniquely identify a row within a table. If it can provide good lookup performance, that's a bonus, and if it can be compactly stored, that's another bonus. But it shouldn't contain any information about the entity in the row it identifies, other than the unique identifier of that entity.


Well, given your 10141 "Apple is in aisle four", what happens when you end up with product 10 in aisle 1 on shelf 41? Or is that product 1 in aisle 014 on shelf 1, or is it product 101 in aisle 41 sitting on the floor because it's not on a shelf?

Once you start co-mingling data like that, you generally lose any ability to reliably extract the components. Human-readable keys are all nice and dandy, but you never destroy the individual IDs the human form is based on.


What do you mean by "readable"? IDs are typically just numbers. And what do you mean by "consistent"? IDs are typically just incrementing numbers; you can't get much more consistent than that. Why waste time and effort trying to encode information into the ID, when the information will already be present explicitly in the database? Who would make use of "orderly" IDs?


Here is my take on Surrogate keys. (or ID's if you want to call them that)

Surrogate keys have no business meaning. They are used to uniquely identify the row. But they do more than merely identify the row. They are also the "soul" of the row. It can't be changed or traded. If the surrogate follows the "soul" principle then when you delete the row, a new row will never take the dead row's value. The soul still belongs to the deleted row even after it's dead and gone.

I like my surrogates to be "souls", though that is not required to be a surrogate.

The advantage of a surrogate is it never needs to change. If 30 other tables have a foriegn key to your main table, you don't want to update all 30 when the main table's PK changes. You can still have a CANDIDATE key on that potentially changing value, but since it can change it is not the row's soul.

Surrogate keys are often auto-increment integers. This lends itself PERFECTLY for clustered indexes. Your table joins will be as good as they can possibly be. Natural keys tend to make horrible clustered indexes as new values are rarely sequential. Integers are small, fixed length data types for even faster matching.

If your name changes, you're still you. If you burn off your finger prints, you're still you. God is using a surrogate key, so I think it's OK for us to use them in our databases.

EDIT After reading your question more carefully, I think you are actually using "meaningless keys" just in the wrong way.

You have value "10141" to represent the apple/location association. That is the combination of 2 surrogates into 1 field. Keep them as separate fields "101" and "41" and make the PK on the combo of those fields. Keeping them separate will make it easier to search, index, table join, etc.

You are right, you don't need yet another surrogate on the mapping table. The combo of 2 surrogates is a surrogate in it's own right (though not a soul). Just express the combo in 2 separate columns, not combined into 1 column. END EDIT


Meaningful ids are not against "db design policies"!

Quite the opposite, it's exactly what the real relational databases were about from the day one. If your data contains some combination of attributes that is - from the business point of view - unique, NOT making it an ID will usually break the Boyce-Codd normal form. And bring the anomalies that go with it.

Unless the information encoded in ID is redundant with what's in other fields, just use it. If it is redundant, make a multiple column primary key. They are not very handy with ORMs, but in data driven applications they are a blessing.

ADDENDUM: (after the edit of original question)

In your case, for a data driven application, I would do:

Type
==========
Fruit
Veggie

Product
==========
Apple    Fruit
Banana   Fruit
Orange   Fruit
Lettuce  Veggie
Onion    Veggie
Carrot   Veggie

Isle
==========
4
5

Shelf
==========
top
bottom

Location
==========
4   top
4   bottom
5   top
5   bottom

ProductLocation
==========
Apple    4  top
Banana   4  top

With such setup:

  • the data is normalized
  • you can SEE the location of any product in the ProductLocation table - you can even see the shelf
  • no surrogates
  • depending on types of queries, this structure can actually perform better than other propositions, because it requires less joins (or it may be slower, because it requires more storage).
  • this will work best with RDBMSs that support an "on replace update" constraint.
  • if you want to treat names as ids, you probably need to add some column like 'display name". That's because people want to change what's displayed more often than they want to change identity of thigns.


The three principal criteria for key design are Familiarity, Simplicity and Stability. If you use keys that are familiar and simple then users will find them easier to recognise, remember and use. They will be less likely to make mistakes when entering and using key values and data quality and usability will usually be improved.

I suggest you resolve this question by asking your users which type of identifier they find easier to use, or even testing out different schemes with them if it's very important to you. In any case, developers alone shouldn't make that decision. Some organisations have departments or individuals responsible for defining the standard coding schemes to be used.


The ID may be meaningful to you but not necessarily the computer. The database software is not going to be intelligent enough to pick up on a pattern like that (unless you program it to do so, obviously, which will be more trouble than its worth) so really all you've done is create a potential problem for yourself down the road when there's a conflict with IDs that you didn't foresee.

I understand the point that you're trying to make but good database design involves making things as easy as possible for the database engine to write and read. You're better off setting indexes and studying your database performance to find areas where you can optimize.


Zooko's Triangle and the concept of petnames may be relevant here.


I thought about making this a comment but I'm afraid it might be too involved.

I think the consensus opinion is that in general IDs shouldn't have meaning. Perhaps if you were to constrain your question more to the specifics of your scenario the opinion would be different?

Based on your comments it sounded like you are doing a data load from a spreadsheet and I assume you are using the meaningful IDs as a way of determining the relationships between different data?

Is there a reason why you can't let the database handle auto-increment IDs but let the users (developers?) define codes. This way you can maintain referential integrity via foreign keys and also be properly normalized. If you really need to have some sort of quick look at the data then you could have a computed column with some sort of naming convention. It might even be more meaningful for your needs?

e.g.

Code Description
==== ===========
F    Fruit
V    Veggie

Product Code Product Type Product Description
============ ============ ===================
AP           F            Apple
BA           F            Banana

Location Code Location Description
============= ====================
AFTS          Aisle four top shelf
AFBS          Aisle four bottom shelf


Product Code Location 
============ ========
AP           AFTS 
BA           AFTS

Actually Location can be further normalized to aisles and shelves but you get the idea.

When the data is inserted into the database, IDs are created for each record and the relationships can be determined by the code and the foreign keys can be set to the proper ID. Your application can then deal with the IDs without knowledge of the codes.

So the Product Location would look something like:

Product ID Location ID
========== ===========
1          1 
2          1

If you still need something more descriptive you could just do a join in your SQL to get the codes or create a Computed Column or your app could map the IDs to the codes from a cache.

e.g.

Product ID Location ID ProductCode_LocationCode
========== =========== ========================
1          1           AP_AFTS
2          1           BA_AFTS

That would be a bit of a performance hit and I still don't really see the point but maybe that helps you?

OK, that was way too long. :)


I don't think it makes much difference. I always tend to reseed my id's when I have the opportunity, but that's just me. I suppose it would be useful have some order in your id's, if you refer to them in code [enum's for example], other than that I wouldn't worry about it.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜