开发者

tagged dimensions in data warehouse

In my data warehouse I got a dimension where each record can have one or several tags. Are there any straight forward ways to model a dimension with different tags?

My initial idea was to store the comma separated list, and then use MySQL:s FIND_IN_SET() function to test for the presence of a tag. This allows me to slice the data.

But how should I model my scheme if I want to be able to group by tag?

Example:

Two products: Product A and product B. A is tagged with "foo", "bar". B is tagged with "bar", "baz".

Query: sales amounts, grouped by product tag. Group for tag "bar" must include sales for both products A and B:

foo -> sales for A
bar -> sales fo开发者_如何学运维r B and A
baz -> sales for C


For example you could store your data duplicated by tag.

If you have sales a=10 foo + bar and b=20 bar + baz you could store it like this.

    sales_by_tag_facts
    id, tag, sale_id, amount, is_primary_record
    1   foo  A        10     true
    2   bar  A        10     false
    3   bar  B        20     true
    4   baz  B        20     false

select sum(amount) from sales_by_tag_facts group by tag; // by tag
select sum(amount) from sales_by_tag_facts where is_primary_record=true; // without tag.


Why not shift the heavy lifting from reporting to the transaction booking

You can add a new table called:

TagTotal that stores the total amount per tag and updates with with every transaction via a BEFORE(AFTER)_UPDATE_EACH trigger.

Extra fields/table

If you have 2 extra fields in the product table:

product.amount    decimal(10,2) running total of sales to date
product.last_sale date          date of the last sale

And the tag totals table looks like this.

tag.id primary autoinc    
tag.tagstr varchar(25)     
tag.amount decimal(10,2)
tag.date_from date   #dates to keep the running totals per month/week/day.
tag.date_to   date

pseudo code

CREATE TRIGGER ai_sales_each AFTER INSERT ON sales FOR EACH ROW
BEGIN
  UPDATE product SET product.amount = product.amount + new.amount,
    product.last_sale = sale.date
END

CREATE TRIGGER au_product_each AFTER UPDATE ON product FOR EACH ROW
BEGIN
  DECLARE AllTags VARCHAR(255);
  DECLARE ThisTag VARCHAR(25);

  IF old.tags <> new.tags THEN BEGIN
    reorganize the running totals in the tagtotal table.
  END; END IF;

  SET AllTags = new.tags;
  WHILE AllTags has more tags BEGIN
    SET ThisTag = NextTag(AllTags);
    UPDATE TagTotals SET amount = amount + new.amount
      WHERE TagTotals.tagstr = ThisTag 
      AND new.last_date BETWEEN TagTotals.date_from AND TagTotals.date_to;
  END; END WHILE; 

Now if you want the sales totals per tag, you can just select from the tagtotals table.
Which will give instant results.


I would recommend against that, it goes against the normalization rules.
See I keep messing up 1NF
Or read up on the posts under the normalization tag.

Redesign suggestion for tables

If you make a tag and taglink table like so.

table tag
  id autoincrement integer primary index 
  tag_str varchar(20) index unique not null

table taglink
  id autoincrement integer primary index #gotta have an ID in my book.
  tag_id integer not null
  product_id integer not null

and you have a sales table something like this.

table product
  id autoincement integer primary index
  desc varchar(255)
  barcode, price, whatever ...

select statement to find products per tag

Than you can lookup articles that match a tag as follows.

select * from product
inner join taglink on (product.id = taglink.product_id)
inner join tag on (taglink.tag_id = tag.id)
where tag.tag_str in ('foo','bar','baz');

select statement to list tags per product

select tag_str from tag
inner join taglink on (taglink.tag_id = tag.id)
inner join product on (taglink.product_id = product.id)
where product.barcode = '4548215' or product.desc like 'OMG Po%'

Adding new tags

To add a new tag, just

insert into tag (id, tag_str) values (
  null /*remember autoincrement*/
  ,'Mytag');

linking a tag

To link a tag to a product

set @product_id = 10;
set @tag_id = 1;
...or...
select @product_id:= product.id from product where product.barcode = '1254851';
...
insert into taglink (id, product_id, tag_id) values (
  null /*autoinc id*/
  ,@product_id
  ,@tag_id );

You can link an unlimited number of tags to a product and you don't slow your queries down with costly FIND_IN_SET statements.
And you prevent duplicate tags.
And your database will be faster and smaller.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜