开发者

What database table layout should I use for quick retrieval of aggregated / distinct data for a date range?

I am writing a web app to analyze my web server log.

I plan to have an SQL开发者_开发知识库 job, run daily, to denormalize my web server log in a SQL database, so that the web app does not read the raw web server log.

I would like for the web app user to input a date range, then have the web app return:

  • a table containing each browser in one column and in the next column the number of unique client IPs for that date range
  • a table containing each OS in one column and in the next column the number of unique client IPs for that date range
  • a table containing each browser + OS in one column and in the next column the number of unique client IPs for that date range

(You can see this idea in Google Analytics.)

We have about 100,000 unique client IPs per month, and I wish to hold denormalized data for a year (though many of these client IPs will be the same month-to-month).

  1. What would be a table layout in which to put the denormalized information?
  2. What would be SQL queries for web app to efficiently retrieve the desired information?

(I am not asking how to have the SQL job write to these tables; this I can figure out.)


I would sum up the number of access by the SQL-job and put the daily result into table like [logsum]:

Table [logsum]:
sum_id (int / auto_increment)
sum_day (date)
sum_name (string)
sum_count (number)

and the denormalized-log-data to [logaccess]:

Table [logaccess]:
access_id (int / auto_increment)
access_day (date)
access_ip (string)
access_browser (string)
access_os (string)
access_click_count (int)

SQL-job:

1) Add all log-entries to [Loginfo] and sum-up clicks per IP and day

for each line in log
{
  info = parse(line)
  execute_sql('REPLACE logaccess
               SET 
                 access_day=date(),
                 access_ip='& info[IP] &',
                 access_browser'& info[browser] &',
                 access_os='& info[OS] &',
                 access_click_count=IF(ISNULL(access_click_count),0,access_click_count) + 1) 
               WHERE access_day=date() AND access_ip='& info[IP] &';')
}

2) Sum up and save to [logsum]:

//- get OS count per day
res = execute_sql('SELECT access_day, access_os, count(access_id) AS C FROM logaccess GROUP BY access_day, access_os;');

//- write to [logsum]
for each record in res
{
   REPLACE logsum SET 
     sum_day=record['access_day'],
     sum_name=record['access_os'],
     sum_count=record['c']
   WHERE sum_day=record['access_day'] AND sum_name=record['access_os'];
}


//- get browser count per Day
res = execute_sql('SELECT access_day, access_browser, count(access_id) AS C FROM logaccess GROUP BY access_day, access_browser;');

//- write to [logsum]
for each record in res
{
   REPLACE logsum SET 
     sum_day=record['access_day'],
     sum_name=record['access_browser'],
     sum_count=record['c']
   WHERE sum_day=record['access_day'] AND sum_name=record['access_browser'];
}

//- get IP count per Day
res = execute_sql('SELECT access_day, count(access_id) AS C FROM logaccess GROUP BY access_day;')

//- write to [logsum]
for each record in res
{
   REPLACE logsum SET 
     sum_day=record['access_day'],
     sum_name='ip',
     sum_count=record['c']
   WHERE sum_day=record['access_day'] AND sum_name='ip';
}

//- get click count per Day
res = execute_sql('SELECT access_day, sum(access_click_count) AS C FROM logaccess GROUP BY access_day;')

//- write to [logsum]
for each record in res
{
   REPLACE logsum SET 
     sum_day=record['access_day'],
     sum_name='clicks',
     sum_count=record['c']
   WHERE sum_day=record['access_day'] AND sum_name='clicks';
}

3) clean up / delete entrys older then 1 year:

DELETE FROM logaccess WHERE access_day<DATE_ADD(date(),INTERVAL 1 year)

maybe this helps you Regards Thomas

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜