What database table layout should I use for quick retrieval of aggregated / distinct data for a date range?
I am writing a web app to analyze my web server log.
I plan to have an SQL开发者_开发知识库 job, run daily, to denormalize my web server log in a SQL database, so that the web app does not read the raw web server log.
I would like for the web app user to input a date range, then have the web app return:
- a table containing each browser in one column and in the next column the number of unique client IPs for that date range
- a table containing each OS in one column and in the next column the number of unique client IPs for that date range
- a table containing each browser + OS in one column and in the next column the number of unique client IPs for that date range
(You can see this idea in Google Analytics.)
We have about 100,000 unique client IPs per month, and I wish to hold denormalized data for a year (though many of these client IPs will be the same month-to-month).
- What would be a table layout in which to put the denormalized information?
- What would be SQL queries for web app to efficiently retrieve the desired information?
(I am not asking how to have the SQL job write to these tables; this I can figure out.)
I would sum up the number of access by the SQL-job and put the daily result into table like [logsum]:
Table [logsum]:
sum_id (int / auto_increment)
sum_day (date)
sum_name (string)
sum_count (number)
and the denormalized-log-data to [logaccess]:
Table [logaccess]:
access_id (int / auto_increment)
access_day (date)
access_ip (string)
access_browser (string)
access_os (string)
access_click_count (int)
SQL-job:
1) Add all log-entries to [Loginfo] and sum-up clicks per IP and day
for each line in log
{
info = parse(line)
execute_sql('REPLACE logaccess
SET
access_day=date(),
access_ip='& info[IP] &',
access_browser'& info[browser] &',
access_os='& info[OS] &',
access_click_count=IF(ISNULL(access_click_count),0,access_click_count) + 1)
WHERE access_day=date() AND access_ip='& info[IP] &';')
}
2) Sum up and save to [logsum]:
//- get OS count per day
res = execute_sql('SELECT access_day, access_os, count(access_id) AS C FROM logaccess GROUP BY access_day, access_os;');
//- write to [logsum]
for each record in res
{
REPLACE logsum SET
sum_day=record['access_day'],
sum_name=record['access_os'],
sum_count=record['c']
WHERE sum_day=record['access_day'] AND sum_name=record['access_os'];
}
//- get browser count per Day
res = execute_sql('SELECT access_day, access_browser, count(access_id) AS C FROM logaccess GROUP BY access_day, access_browser;');
//- write to [logsum]
for each record in res
{
REPLACE logsum SET
sum_day=record['access_day'],
sum_name=record['access_browser'],
sum_count=record['c']
WHERE sum_day=record['access_day'] AND sum_name=record['access_browser'];
}
//- get IP count per Day
res = execute_sql('SELECT access_day, count(access_id) AS C FROM logaccess GROUP BY access_day;')
//- write to [logsum]
for each record in res
{
REPLACE logsum SET
sum_day=record['access_day'],
sum_name='ip',
sum_count=record['c']
WHERE sum_day=record['access_day'] AND sum_name='ip';
}
//- get click count per Day
res = execute_sql('SELECT access_day, sum(access_click_count) AS C FROM logaccess GROUP BY access_day;')
//- write to [logsum]
for each record in res
{
REPLACE logsum SET
sum_day=record['access_day'],
sum_name='clicks',
sum_count=record['c']
WHERE sum_day=record['access_day'] AND sum_name='clicks';
}
3) clean up / delete entrys older then 1 year:
DELETE FROM logaccess WHERE access_day<DATE_ADD(date(),INTERVAL 1 year)
maybe this helps you Regards Thomas
精彩评论