tagged dimensions in data warehouse
In my data warehouse I got a dimension where each record can have one or several tags. Are there any straight forward ways to model a dimension with different tags?
My initial idea was to store the comma separated list, and then use MySQL:s FIND_IN_SET() function to test for the presence of a tag. This allows me to slice the data.
But how should I model my scheme if I want to be able to group by tag?
Example:
Two products: Product A and product B. A is tagged with "foo", "bar". B is tagged with "bar", "baz".
Query: sales amounts, grouped by product tag. Group for tag "bar" must include sales for both products A and B:
foo -> sales for A
bar -> sales fo开发者_如何学运维r B and A
baz -> sales for C
For example you could store your data duplicated by tag.
If you have sales a=10 foo + bar and b=20 bar + baz you could store it like this.
sales_by_tag_facts
id, tag, sale_id, amount, is_primary_record
1 foo A 10 true
2 bar A 10 false
3 bar B 20 true
4 baz B 20 false
select sum(amount) from sales_by_tag_facts group by tag; // by tag
select sum(amount) from sales_by_tag_facts where is_primary_record=true; // without tag.
Why not shift the heavy lifting from reporting to the transaction booking
You can add a new table called:
TagTotal that stores the total amount per tag and updates with with every transaction via a BEFORE(AFTER)_UPDATE_EACH trigger.
Extra fields/table
If you have 2 extra fields in the product table:
product.amount decimal(10,2) running total of sales to date
product.last_sale date date of the last sale
And the tag totals table looks like this.
tag.id primary autoinc
tag.tagstr varchar(25)
tag.amount decimal(10,2)
tag.date_from date #dates to keep the running totals per month/week/day.
tag.date_to date
pseudo code
CREATE TRIGGER ai_sales_each AFTER INSERT ON sales FOR EACH ROW
BEGIN
UPDATE product SET product.amount = product.amount + new.amount,
product.last_sale = sale.date
END
CREATE TRIGGER au_product_each AFTER UPDATE ON product FOR EACH ROW
BEGIN
DECLARE AllTags VARCHAR(255);
DECLARE ThisTag VARCHAR(25);
IF old.tags <> new.tags THEN BEGIN
reorganize the running totals in the tagtotal table.
END; END IF;
SET AllTags = new.tags;
WHILE AllTags has more tags BEGIN
SET ThisTag = NextTag(AllTags);
UPDATE TagTotals SET amount = amount + new.amount
WHERE TagTotals.tagstr = ThisTag
AND new.last_date BETWEEN TagTotals.date_from AND TagTotals.date_to;
END; END WHILE;
Now if you want the sales totals per tag, you can just select from the tagtotals table.
Which will give instant results.
I would recommend against that, it goes against the normalization rules.
See I keep messing up 1NF
Or read up on the posts under the normalization tag.
Redesign suggestion for tables
If you make a tag and taglink table like so.
table tag
id autoincrement integer primary index
tag_str varchar(20) index unique not null
table taglink
id autoincrement integer primary index #gotta have an ID in my book.
tag_id integer not null
product_id integer not null
and you have a sales table something like this.
table product
id autoincement integer primary index
desc varchar(255)
barcode, price, whatever ...
select statement to find products per tag
Than you can lookup articles that match a tag as follows.
select * from product
inner join taglink on (product.id = taglink.product_id)
inner join tag on (taglink.tag_id = tag.id)
where tag.tag_str in ('foo','bar','baz');
select statement to list tags per product
select tag_str from tag
inner join taglink on (taglink.tag_id = tag.id)
inner join product on (taglink.product_id = product.id)
where product.barcode = '4548215' or product.desc like 'OMG Po%'
Adding new tags
To add a new tag, just
insert into tag (id, tag_str) values (
null /*remember autoincrement*/
,'Mytag');
linking a tag
To link a tag to a product
set @product_id = 10;
set @tag_id = 1;
...or...
select @product_id:= product.id from product where product.barcode = '1254851';
...
insert into taglink (id, product_id, tag_id) values (
null /*autoinc id*/
,@product_id
,@tag_id );
You can link an unlimited number of tags to a product and you don't slow your queries down with costly FIND_IN_SET
statements.
And you prevent duplicate tags.
And your database will be faster and smaller.
精彩评论