simple table design question
I'm trying to think ahead a little and avoid myself some extra pain if possible.
I've had this problem in past applications and have usually opted for the most verbose approach but would like the opinions of a few others.
If you have a basic table such as below, is it wise, and/or more efficient to include a field which includes a calculation from the information which could be found from two other columns. IE:
+-----+---------+------------+-------+--开发者_运维问答------+-------+
| id | room_id | bookdate | price | people | total |
+-----+---------+------------+-------+--------+-------+
| 414 | 132 | 2010-03-01 | 14.55 | 2 | 29.10 |
| 415 | 132 | 2010-03-02 | 14.55 | 2 | 29.10 |
| 416 | 132 | 2010-03-03 | 14.55 | 2 | 29.10 |
+-----+---------+------------+-------+--------+-------+
The information in the last field could be extracted from the product of the previous two, therefore it is redundant and unnecessary. Are there instances when it could still be worthwhile to have it?
As a rule of thumb, I don't store values that can be calculated (especially ones that can be easily calculated) on the fly unless there is a performance issue and I need to save some processing time.
This is a classic tradeoff between performance and storage. I would recommend calculating the value until you need a performance boost.
Maybe create a table, that contains all fields except last one, and then create a view that contains all fields and counts the last fields automatically?
So the table would contain only these fields
+-----+---------+------------+-------+--------+
| id | room_id | bookdate | price | people |
+-----+---------+------------+-------+--------+
| 414 | 132 | 2010-03-01 | 14.55 | 2 |
And the definition of the view, that calculates the total is also very simple:
select *, price*people as total from rooms
(assuming that your table is called rooms
The general rule is that you shouldn't store what you can easily calculate, but if you've identified this area as being a performance bottleneck—by profiling your app, not guesswork—then do it.
If you decide to denormalize for read performance, you can add a check constraint to enforce consistency.
create table rooms (
price numeric,
people numeric,
total numeric check (total=price*people));
This will add slight overhead to inserts and updates.
I'm often in favor of a calculated field assuming you do it correctly by defining the field in the database as calculated. This way the calculation is always applicable no matter how the data changes. I'd only do this this though if you are going to need to get those calculations in reports that contain many records. Sure it's easy to write the formula inthe query, but if you calculate this number frequently you are wasting server resources (a calculated field only does the calc when the information changes) and possibly seriously slowing down the query if it must do the calc for millions of records for reports. A materialized view is also a good idea (becasue it will pre-calulate), but a regular view just gets you out of writing the calc multiple times, it doesn't have the performance benefit of a calculated field. On the other hand, I never create views if I don't need to (I.e. I can solve the problem so other way) as they can get you into real performance trouble when people start creating views on top of views. Don't use a hammer when a screwdriver is what you need.
Calculated fields are powerful tools when used properly and are often overlooked by database designers.
If you're doing this for convenience when writing queries, I would create a view that includes the total instead.
Otherwise, this is a matter of normalization. Sometimes denormalizing a table is acceptable. Denormalization, especially in an environment like a data warehouse, can be used to increase performance. However, it is important to make sure your data stays consistent. In other words, you need to make sure your total
field gets updated when either price
or people
change.
In practice I consider this a last resort, only to be used when other performance optimizations fall short. Plus, denormalization does not guarantee an improvement - depending on the amount of data and other factors, it could actually make things worse.
Note: a table cannot be 3NF (third normal form) until calculated fields are removed.
If you're worried about select performance (at least with a WHERE total = xx.xx) you could just add an index.
CREATE INDEX booking_total ON booking ((price*people));
That will change the query plan for SELECT * from booking where price*people = 58.2;
from this;
Seq Scan on booking (cost=0.00..299.96 rows=60 width=24) (actual time=0.015..2.926 rows=1 loops=1)
Filter: ((price * (people)::double precision) = 58.2::double precision)
Total runtime: 2.947 ms
to this
Bitmap Heap Scan on booking (cost=4.30..20.83 rows=5 width=24) (actual time=0.016..0.016 rows=1 loops=1)
Recheck Cond: ((price * (people)::double precision) = 58.2::double precision)
-> Bitmap Index Scan on booking_total (cost=0.00..4.29 rows=5 width=0) (actual time=0.009..0.009 rows=1 loops=1)
Index Cond: ((price * (people)::double precision) = 58.2::double precision)
Total runtime: 0.044 ms
PostgreSQL rocks :-)
I'd go ahead and put in the TOTAL field. From what I can see here there's no 'DISCOUNT' or similar field which might reduce the total, but I can imagine scenarios where the price * the number of people might not equal the total. You might want to consider a COMMENTS field or even a table to allow someone to note why the total doesn't match the product of the other fields.
Share and enjoy.
Basically i prefer not to have a "total" field, or any field that is calculated by other fields, not in the same table nor from other tables. If the price field will change, someone might "forget" to update the total field and you will end up with wrong data.
It is very easy to SELECT using this field: SELECT price,people,(price*people) AS total FROM some_table;
The only case i guess it is ok to keep a calculated field is when it takes long time to calculate it and it will overload the database on huge amount of data.
BR
It is generally considered bad practice to store fields that can be simply calculated from other fields in your table. The only time I would recommend it, would be when you need to store the result of a complex calculation and it is easier to store the calculated value than to recalculate the value every time - however in your case this doesn't seem necessary.
The other problem with calculated fields is that the original values used for the calculation can be changed without modifying the stored result, causing potential problems in your application.
As you can calculate the value -- quite easily in this case -- it is redundant. You should almost never store redundant data. This means that every place where you update either price or people, you must be sure to update total. If you forget to do this in even one place, the data is now inconsistent. So suppose you now have a record that says price=$10, people=3, total=$40. If you have different programs displaying information in different ways -- different totals or subsets or whatever -- the user could get different answers to the same question depending on how he asked it. While it's bad to get a wrong answer, it's even worse to sometimes get a right answer and sometimes a wrong answer, because then it may not be clear how to fix the problem. I mean, if I see that a certain customer shows 2 people when he should show 3, presumably there is some screen I can go to, overtype the 2 with a 3, click save or whatever, and it's fixed. But if it says $10 times 2 people = $30, where do I go to fix it? How?
You may say that the record is only updated in one place so there's no problem. But that's today. What if tomorrow you or some other programmer adds a new function to do a different sort of update?
I'm working on a system right now that is filled with redundant data. Basic information about each of our company's products is stored in an "item" table. For each unit in stock we have a stock record, and instead of simply referring to the item record, they copy all the data for each stock unit. When an item is sold, we copy all the data to the sale record. If something is returned, we copy all the data to the return record. Etc etc for several other record types. This causes endless trouble. We once had a problem where a user ran a query looking for items with certain characteristics and the list of hits included items that did not meet the search criteria. Why? Because the query finds all the item records that meet the search criteria, that tries to match those item records against stock records by part number ... but some of the stock records did not match the item record on other criteria for various reasons. Right now I'm working on fixing a problem where cost data is not always copied from stock records to sale records properly. I'd love to just redesign the database to eliminate all the redundant data, but that would be a huge project.
Sure, there are times when the performance penalty to recalculate some piece of data is just too high. Like, if you need to read thousands of transaction records to calculate the current balance, and you regularly want to display the current balance, that might just be too much of a performance burden and you're better to store it redundantly. But I would be very slow to do that sort of thing. Make sure it's really really a serious performance issue.
Multiplying two numbers together that are in a record that you're already reading anyway? No way. I can't imagine that that would cause any performance issues. If you're database engine can't multiply two numbers in a tiny percentage of the time that it takes to read a record, get a new database engine.
精彩评论