开发者

Data warehouse for user data - design Q

How to best store user data vs date/time dimension? Usecase is I am trying to store user actions per day, per hour. Such as number of Shares, likes, friends etc. I have a time table and a date table. For time it is easy - i have each row =开发者_StackOverflow社区 user_id and colunms = 1 to 24 for each hour of the day. But problem is for dates. If i give each day = 1 colunm then i will have 365 colunms a year. I cannot archive the data way either because analytic needs past data too. What are the other strategies?


Data warehouse for user data - design Q

dimDate : 1 row per date
dimTime : 1 row per minute

At the beginning you have to state the "grain" of the fact table and then stick to it.

If the grain is one day, then TimeKey always points to the key of "23:59".

If the grain is one hour, then TimeKey points to entries of "HH:59".

If the grain is one minute, then TimeKey points to the respective "HH:MM"

If the grain is 15 minutes, then TimeKey points to the respective "HH:14", "HH:29", "HH:44", "HH:59"

And so on...

-- How many new friends did specific user gain
-- in first three months of years 2008, 2009 and 2010
-- between hours 3 and 5 in the morning
-- by day of week
-- not counting holidays ?

select
      DayOfWeek
    , sum(NewFriends) as FriendCount
from factUserAction as f
join dbo.dimUser    as u on u.UserKey = f.UserKey
join dbo.dimDate    as d on d.DateKey = f.DateKey
join dbo.dimTime    as t on t.TimeKey = f.TimeKey
where CalendarYear between 2008 and 2010
  and MonthNumberInYear between 1 and 3
  and t.Hour between 3 and 5
  and d.IsHoliday = 'no'
  and UserEmail = 'john_doe@gmail.com' 
group by DayOfWeek
order by DayOfWeek ;


You would store the date in the dimension, and then add computed fields such as day_of_year.

On the designs I've worked on, we've never had more granular time slices than day, but I can't see why one couldn't have a time dimension based on date-hour, as the grain?

user_activity_facts(
   time_key references time_dimension(time_key)
  ,user_key references user_dimension(user_key)
  ,measure1
  ,measure2
  ,measure3
  ,primary key(time_key, user_key)
)
partition by range(time_key)(
   ...
)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜