开发者

Combine three tables into one, or too many columns?

I am tracking clicks over three time periods: the past day, past week and past month.

To do this, I have three tables:

  • An hourly table, with columns link_id, two other attributes, and hour_1 to hour_24, together with a computed column giving the sum

  • A weekday table, with columns click_id, two other attributes, and day_1 to day_7, together with a computed column giving the sum

  • A monthday table, as above, with columns day_1 to day_31

When a click comes in, I store its key attributes like href, description, etc, in other tables, and insert or update the row(s) corresponding to the link_id in each of the above tables.

Each link can have several entries in each of the above hourly/weekday/monthday tables, depending on the two other attributes (e.g. where the user is sitting).

So if a user is Type A and sitting in X, three rows are created or added to in the above tables -- the first row records all clicks on that link over the time period, the second row records all clicks by "Type A people", and the third "All clicks by people in X".

I have designed it this way as I didn't want to have to move data around each hour/day/week/month. I just maintain pointers for "current hour" (1-24), "current day" (1-31) and "current weekday" (1-7), and write to the corresponding cells in the tables. When we enter a new period (e.g. "3pm-4pm"), I can just blank out that current column (e.g. hour_15), then start incrementing it for links as they come in. Every so often I can delete old rows which have fallen down to "all zero".

This way I shouldn't ever have to move around column data, which would likely be very expensive for what will potentially be tens of thousands of rows.

I will only be SELECTing either the current day/weekday/hour rows (prior to inserting/updating) or the TOP 20 values from the computed columns based on the attributes (and will likely cache these results for an hour or so).

After the tables populate, UPDATES will far exceed INSERTs as there aren't that many unique hrefs.

Three questions:

  • Is it OK to combine the three big tables into one big table of monthdays/weekdays/hours? This would give a table with 64 columns, which I'm not sure is overkill. On the other hand, keeping them separate like they are now triples the number of INSERT/UPDATE statements needed. I don't know enough about SQL server to know which is best.

  • Is this approach sensible? Most data sets I've worked with of course have a separate row per item and you would then sort by date -- but when tracking clicks from thousands of users this would give me many hundreds of thousands of rows, which I would have to cull very often, and ordering and summing them would be hideous. Once the tracker is开发者_Python百科 proven, I have plans to roll the click listener out over hundreds of pages, so it needs to scale.

  • In terms of design, clearly there is some redundancy in having both weekdays and monthdays. However, this was the only way I could think of to maintain a pointer to a column and quickly update it, and use a computed column. If I eliminated the weekdays table, I would need to get an additional computed column on the "monthdays" that summed the previous 7 days -- (e.g. if today is the 21st, then sum day_14, day_15, day_16... day_20). The calculation would have to update every day, which I imagine would be expensive. Hence the additional "weekday" table for a simple static calculation. I value simple and fast calculations more highly than small data storage.

Thanks in advance!


Anytime you see columns with numbers in their names, such as column_1, column_2, column_3... your 'horrible database design' flag should raise. (FYI, here you are breaking 1NF, specifically you are repeating groups across columns)

Now, it is possible that such implementation can be acceptable (or even necessary) in production, but conceptually it is definitively wrong.

As Geert says, conceptually two tables will suffice. If the performance is an issue you could denormalize data for weekly/monthly stats, but still I would not model them as above but I would keep the

CREATE TABLE base_stats ( link_id INT, click_time DATETIME )
CREATE TABLE daily_stats ( link_id INT, period DATETIME, clicks INT )

You can always aggregate with

SELECT link_id, count(*) as clicks, DATE(click_time) as day
FROM base_stats
GROUP_BY link_id, day

which can be run periodically to fill the daily_stats. If you want to keep it up to date you can implement it in triggers (or if you really must, do it on the application side). You can also denormalize the data on different levels if necessary (by creating more aggregate tables, or by introducing another column in the aggregated data table), but that might be premature optimization.

The above design is much cleaner for future ad-hoc analysis (will happen with stats). For other benefits see wikipedia on repeating groups.

EDIT: Even though the solution with two tables base_stats and aggregated_stats is accepted, with following strategy:

  • insert each click in base_stats
  • periodically aggregate the data from base_stats into daily_stats and purge the full detail

it might not be the optimal solution. Based on discussions and clarification of requirements it seems that the table base_stats is not necessary. The following approach should be also investigated:

CREATE TABLE period_stats ( link_id INT, period DATETIME, ...)

Updates are easy with

UPDATE period_stats 
SET clicks = clicks + 1 
WHERE period = @dateTime AND link_id = @url AND ...

The cost of updating this table, properly indexed is as efficient as inserting rows in the base_table and any it is also easy to use it for analysis

SELECT link_id, SUM(clicks)
FROM period_stats
WHERE period between @dateTime1 AND @dateTime2
GROUP BY ...


Denormalization as you have done in your database can be a good solution for some problems. In your case however I would not choose the above solution mainly because you lose information that you might need in the future, maybe you want to report on half-hour intervals in the future. So looking at your description you could do with only 2 tables: Links (ahref's and descriptions) and clicks on the links (containing the date and time of the click and maybe some other data). The drawback of course is that you have to store hunderds of thousands of records and querying this amount of data can take a lot of time. If this is the case you might consider storing aggregate data on these 2 tables in separate tables and update these tables on a regular basis.


That design is really bad. Unreason's proposal is better.
If you want to make it nice and easy, you could as well have a single table with 2 fields:

   timeSlice  
   clickCount  
   location
   userType 

with TimeSlice holding the date and time rounded to the hour. All the rest can be deducted from that, and you would have only
24 * 365 * locations# * types#
records per year.

Always depending on the configuration and feasibility, with this table design, you could eventually accumulate values in memory and only update the table once per 10 sec. or any time length <= 1 hour, depending on acceptable risk

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜