Data warehouse for user data - design Q
How to best store user data vs date/time dimension? Usecase is I am trying to store user actions per day, per hour. Such as number of Shares, likes, friends etc. I have a time table and a date table. For time it is easy - i have each row =开发者_StackOverflow社区 user_id and colunms = 1 to 24 for each hour of the day. But problem is for dates. If i give each day = 1 colunm then i will have 365 colunms a year. I cannot archive the data way either because analytic needs past data too. What are the other strategies?
dimDate : 1 row per date
dimTime : 1 row per minute
At the beginning you have to state the "grain" of the fact table and then stick to it.
If the grain is one day, then TimeKey
always points to the key of "23:59".
If the grain is one hour, then TimeKey
points to entries of "HH:59".
If the grain is one minute, then TimeKey
points to the respective "HH:MM"
If the grain is 15 minutes, then TimeKey
points to the respective "HH:14", "HH:29", "HH:44", "HH:59"
And so on...
-- How many new friends did specific user gain
-- in first three months of years 2008, 2009 and 2010
-- between hours 3 and 5 in the morning
-- by day of week
-- not counting holidays ?
select
DayOfWeek
, sum(NewFriends) as FriendCount
from factUserAction as f
join dbo.dimUser as u on u.UserKey = f.UserKey
join dbo.dimDate as d on d.DateKey = f.DateKey
join dbo.dimTime as t on t.TimeKey = f.TimeKey
where CalendarYear between 2008 and 2010
and MonthNumberInYear between 1 and 3
and t.Hour between 3 and 5
and d.IsHoliday = 'no'
and UserEmail = 'john_doe@gmail.com'
group by DayOfWeek
order by DayOfWeek ;
You would store the date in the dimension, and then add computed fields such as day_of_year.
On the designs I've worked on, we've never had more granular time slices than day, but I can't see why one couldn't have a time dimension based on date-hour, as the grain?
user_activity_facts(
time_key references time_dimension(time_key)
,user_key references user_dimension(user_key)
,measure1
,measure2
,measure3
,primary key(time_key, user_key)
)
partition by range(time_key)(
...
)
精彩评论