Data Mining Operation using SQL Query (Fuzzy Apriori Algorithm) - Coding it using SQL
So I have this Table:
Trans_ID Name Fuzzy_Value Total_Item
100 I1 0.33333333 3
100 I2 0.33333333 3
100 I5 0.33333333 3
200 I2 0.5 2
200 I5 0.5 2
300 I2 0.5 2
300 I3 0.5 2
400 I1 0.33333333 3
400 I2 0.33333333 3
400 I4 0.33333333 3
500 I1 0.5 2
500 I3 0.5 2
600 I2 0.5 2
600 I3 0.5 2
700 I1 0.5 2
700 I3 0.5 2
800 I1 0.25 4
800 I2 0.25 4
800 I3 0.25 4
800 I5 0.25 4
900 I1 0.33333333 3
900 I2 0.33333333 3
900 I3 0.33333333 3
1000 I1 0.2 5
1000 I2 0.2 5
1000 I4 0.2 5
1000 I6 0.2 5
1000 I8 0.2 5
And two blank Tables:
Table ITEMSET
"ITEM_SET" "Support"
Table Confidence
"ANTECEDENT" "CONSEQUENT"
I need to find FUZZY value for each item that occurs in each transaction:
I1 = Sum of (Fuzzy_Value from item I1 in trans 100 until 1000 which is trans: 100,400,500,700,800,900,1000)/Total Trans
-> (.33333333+0.33333333+0.5+0.5+0.25+0.33333333+0.2)/10 = 0.244999999
I2 = Sum of (Fuzzy_Value from item I2 in trans 100 - 1000 which is trans:100,200,300,400,600,800,900,1000)/Total Trans
-> (0.33333333+0.5+0.5+0.33333333+0.5+0.25+0.33333333)/10 = 0.274999999
I3 -> 0.258333333
I4 -> 0.103333333
I5 -> 0.058333333
I6 -> 0.02
I8 -> 0.02
E.g., I use minimum Support 10% -> 0.1
I need to remove I5,I6,I8 since it's value < 0.1 => prune stepthen store:
I1=0.244999999, I2=0.274999999, I3=0.258333333,I4=0.103333333 on new table 'ITEMSET'
2 COMBINATIONS
NOTE: This is the basic 1st step after this most likely need to use repeat or recursive, since the process will keep going on until no other 开发者_开发知识库item combination is possible
then from what's left I need to find K+1 itemset (which is 2 combination itemset) => join step{I1,I2} =Sum of (Fuzzy_Value from item I1 + I2 in trans 100 - 1000 which is trans:100,400,800,900,1000)/Total Trans
->(0.666666667+0.666666667+0.5+0.666666667+0.4)/9 = 0.29
*do the same for the rest*
{I1,I3} =(1+1+0.5+0.666666667)/9 = 0.316666667
{I1,I4} =(0.666666667+0.4)/9 = 0.106666667
{I2,I3} =(1+1+0.5+0.666666667)/9 = 0.316666667
{I2,I4} =(1+0.666666667+0.4)/9 =0.206666667
{I3,I4} =0
Then Do another Prune Step removing less than 0.1 value which is {I3,I4}
Store {I1,I2} = 0.29, {I1,I3} = 0.316666667, {I1,I4} =0.106666667, {I2,I3} = 0.316666667, {I2,I4} = 0.206666667 AT "ITEMSET" TABLE
3 COMBINATION
After that Do another JOIN STEP combining itemset that pass pruning
{I1,I2,I3} = Sum of (Fuzzy_Value from item I1 + I2 +I3 in trans 100 - 1000 which is trans:800,900)/Total Trans
-> 0.75+1 = 0.175
**Same for the rest**
{I1,I2,I4} = 1+0.6 = 0.16
{I2,I3,I4} = 0
Do another Prune Step removing less than 0.1 value which is {I1,I3,I4}
Store {I1,I2,I3} = 0.176 AND {I1,I2,I4} = 0,16 AT "ITEMSET" TABLE
4 COMBINATION
Combine itemset that pass pruning K+4 (4 combination)
{I1,I2,I3,I4} = 0
**since no transaction containing this item
after process stop since there's no possible combination left
At this point, ITEMSET database have:
ITEM_SET Support
{I1} 0.244999999
{I2} 0.274999999
{I3} 0.258333333
{I4} 0.103333333
{I1,I2} 0.29
{I1,I3} 0.316666667
{I1,I4} 0.106666667
{I2,I3} 0.316666667
{I2,I4} 0.206666667
{I1,I2,I3} 0.176
{I1,I2,I4} 0,16
How do I code that in sql? Thank you very much!
Note: You can add another table as needed.
Step 1:
CREATE TABLE ITEMSET
SELECT Name, SUM(Fuzzy_Value)/COUNT(*) Fuzzy_Value
FROM trans
GROUP BY ID
HAVING ROUND(SUM(Fuzzy_Value), 1) >= 0.1
Note the ROUND()
function - it's important, because you have values like .33333 that don't sum in a happy way.
Step 2:
ALTER TABLE ITEMSET ADD INDEX (Name)
SELECT a.Name Name1, b.Name Name2, SUM(Fuzzy_Value)/COUNT(*) Fuzzy_Value
FROM ITEMSET a JOIN ITEMSET b ON (a.Name != b.Name)
GROUP BY a.Name, b.Name
HAVING ROUND(SUM(Fuzzy_Value), 1) >= 0.1
Opps: I just noticed that you asked this half a year ago, so I guess there is no point in continuing. If you still need this answer leave a comment.
Try the following link use roll-up combinations oracle doc
Here is a bit different approach to this question. It differs because there is no writing to tables but giving
final result that then can be written according to conditions into one or more tables.
The first cte named "tbl" is just used for sample data creation.
The second one, "name_ids" is later used to generate different combinations of 2, 3 or more T_NAMEs (innermost query of step2, 3, 4)
The third one, "id_names" serves as a filtering dataset for different combinations that are still active and that have at least one common T_ID
All of the cte-s data are shown in the code along with the final result at the end.
Tested with Oracle 11g
WITH
tbl AS
(
Select 100 "T_ID", 'I1' "T_NAME", 0.33333333 "FUZZY_VALUE", 3 "TOTAL_ITEM" From DUAL UNION ALL
Select 100, 'I2', 0.33333333, 3 From Dual UNION ALL
Select 100, 'I5', 0.33333333, 3 From Dual UNION ALL
Select 200, 'I2', 0.5, 2 From Dual UNION ALL
Select 200, 'I5', 0.5, 2 From Dual UNION ALL
Select 300, 'I2', 0.5, 2 From Dual UNION ALL
Select 300, 'I3', 0.5, 2 From Dual UNION ALL
Select 400, 'I1', 0.33333333, 3 From Dual UNION ALL
Select 400, 'I2', 0.33333333, 3 From Dual UNION ALL
Select 400, 'I4', 0.33333333, 3 From Dual UNION ALL
Select 500, 'I1', 0.5, 2 From Dual UNION ALL
Select 500, 'I3', 0.5, 2 From Dual UNION ALL
Select 600, 'I2', 0.5, 2 From Dual UNION ALL
Select 600, 'I3', 0.5, 2 From Dual UNION ALL
Select 700, 'I1', 0.5, 2 From Dual UNION ALL
Select 700, 'I3', 0.5, 2 From Dual UNION ALL
Select 800, 'I1', 0.25, 4 From Dual UNION ALL
Select 800, 'I2', 0.25, 4 From Dual UNION ALL
Select 800, 'I3', 0.25, 4 From Dual UNION ALL
Select 800, 'I5', 0.25, 4 From Dual UNION ALL
Select 900, 'I1', 0.33333333, 3 From Dual UNION ALL
Select 900, 'I2', 0.33333333, 3 From Dual UNION ALL
Select 900, 'I3', 0.33333333, 3 From Dual UNION ALL
Select 1000, 'I1', 0.2, 5 From Dual UNION ALL
Select 1000, 'I2', 0.2, 5 From Dual UNION ALL
Select 1000, 'I4', 0.2, 5 From Dual UNION ALL
Select 1000, 'I6', 0.2, 5 From Dual UNION ALL
Select 1000, 'I8', 0.2, 5 From Dual
),
-- **************************************************************************************************************************
name_ids AS
( Select T_NAME, Count(T_NAME) OVER(Order By T_NAME ROWS BETWEEN UNBOUNDED PRECEDING And CURRENT ROW) "T_NAME_ID", LISTAGG(T_ID, ', ') WITHIN GROUP (Order By T_ID) "NAME_IDS"
From tbl
Group By T_NAME ),
-- -----------------------------------------------------------
-- R e s u l t f o r name_ids
--
-- T_NAME T_NAME_ID NAME_IDS
-- ------ ---------- -----------------------------------------
-- I1 1 100, 400, 500, 700, 800, 900, 1000
-- I2 2 100, 200, 300, 400, 600, 800, 900, 1000
-- I3 3 300, 500, 600, 700, 800, 900
-- I4 4 400, 1000
-- I5 5 100, 200, 800
-- I6 6 1000
-- I8 7 1000
--
-- -----------------------------------------------------------
id_names AS
( Select T_ID, LISTAGG(T_NAME, ', ') WITHIN GROUP (Order By T_NAME) "ID_NAMES", COUNT(DISTINCT T_ID) OVER() "TOT_NUM_TRANS"
From tbl
Group By T_ID ),
-- -----------------------------------------------------------
-- R e s u l t f o r id_names
--
-- T_ID ID_NAMES TOT_NUM_TRANS
-- ---------- -------------------------------- ------------
-- 100 I1, I2, I5 10
-- 200 I2, I5 10
-- 300 I2, I3 10
-- 400 I1, I2, I4 10
-- 500 I1, I3 10
-- 600 I2, I3 10
-- 700 I1, I3 10
-- 800 I1, I2, I3, I5 10
-- 900 I1, I2, I3 10
-- 1000 I1, I2, I4, I6, I8 10
-- ---------------------------------------------------------------------------------------------
step1 AS
( Select
1 "STEP", T_NAME "T_NAME", i.TOT_NUM_TRANS "TOT_NUM_TRANS",
Sum(FUZZY_VALUE) "FUZZ_SUM", Round(Sum(FUZZY_VALUE / i.TOT_NUM_TRANS), 8) "SUPPORT",
CASE WHEN Sum(FUZZY_VALUE / i.TOT_NUM_TRANS) < 0.1 THEN 'OUT' ELSE 'IN' END "STATUS"
From tbl
Inner Join (Select Max(TOT_NUM_TRANS) "TOT_NUM_TRANS" From id_names) i ON(1=1)
Group By T_NAME, i.TOT_NUM_TRANS
),
-- -----------------------------------------------------------
-- R e s u l t f o r step1
--
-- STEP T_NAME TOT_NUM_TRANS FUZZ_SUM SUPPORT STATUS
-- ---------- ------ ------------- ---------- ---------- ------
-- 1 I1 10 2.44999999 .245 IN
-- 1 I2 10 2.94999999 .295 IN
-- 1 I3 10 2.58333333 .25833333 IN
-- 1 I4 10 .53333333 .05333333 OUT
-- 1 I5 10 1.08333333 .10833333 IN
-- 1 I6 10 .2 .02 OUT
-- 1 I8 10 .2 .02 OUT
-- -------------------------------------------------------------------------------------------------------------------------
step2 AS
( Select STEP, T_NAME, TOT_NUM_TRANS, FUZZ_SUM, SUPPORT, STATUS
From
(
Select STEP, S_NAME "T_NAME", TOT_NUM_TRANS, Sum(FUZZY_VALUE) "FUZZ_SUM", Sum(SUPPORT) "SUPPORT", CASE WHEN Sum(SUPPORT) < 0.1 THEN 'OUT' ELSE 'IN' END "STATUS"
From ( Select RWN, STEP, T_NAME "S_NAME", FUZZY_VALUE, TOT_NUM_TRANS, SUPPORT
From ( Select ROW_NUMBER() OVER (PARTITION BY s.T_ID, t.T_NAME Order By s.T_ID, t.T_NAME) "RWN", s.STEP, s.T_NAME, i.TOT_NUM_TRANS "TOT_NUM_TRANS", s.T_ID, t.FUZZY_VALUE, Round(t.FUZZY_VALUE / i.TOT_NUM_TRANS, 8) "SUPPORT"
From(
Select
n.STEP, n.T_NAME, i.T_ID
From
id_names i
Inner Join
(
Select 2 "STEP", n1.T_NAME || ', ' || n2.T_NAME "T_NAME"
From name_ids n1
Inner Join
name_ids n2 ON(n1.T_NAME <> n2.T_NAME And n1.T_NAME_ID < n2.T_NAME_ID)
) n ON(1=1)
) s
Inner Join
tbl t ON(t.T_ID = s. T_ID And InStr(s.T_NAME, t.T_NAME) > 0)
Inner Join
id_names i ON(i.T_ID = t.T_ID And InStr(i.ID_NAMES, t.T_NAME) > 0 And
InStr(i.ID_NAMES, SubStr(s.T_NAME, 1, 2)) > 0 And InStr(i.ID_NAMES, SubStr(s.T_NAME, 5, 2)) > 0)
Inner Join
step1 s1 ON(INSTR((Select LISTAGG(T_NAME, ', ') WITHIN GROUP (Order By STATUS) From step1 Where STATUS = 'IN' Group By STATUS), SubStr(s.T_NAME, 1, 2)) > 0 And
INSTR((Select LISTAGG(T_NAME, ', ') WITHIN GROUP (Order By STATUS) From step1 Where STATUS = 'IN' Group By STATUS), SubStr(s.T_NAME, InStr(s.T_NAME, ', ', 1) + 2, 2)) > 0)
)
Where RWN = 1
)
Group By STEP, S_NAME, TOT_NUM_TRANS
)
),
-- -----------------------------------------------------------
-- R e s u l t f o r step2
--
-- STEP T_NAME TOT_NUM_TRANS FUZZ_SUM SUPPORT STATUS
-- ---------- ------ ------------- ---------- ---------- ------
-- 2 I1, I2 10 2.89999998 .28999998 IN
-- 2 I1, I3 10 2.58333333 .25833333 IN
-- 2 I1, I5 10 .58333333 .05833333 OUT
-- 2 I2, I3 10 2 .2 IN
-- 2 I2, I5 10 1 .1 IN
-- -------------------------------------------------------------------------------------------------------------------------
step3 AS
( Select STEP, T_NAME, TOT_NUM_TRANS, FUZZ_SUM, SUPPORT, STATUS
From
(
Select STEP, S_NAME "T_NAME", TOT_NUM_TRANS, Sum(FUZZY_VALUE) "FUZZ_SUM", Sum(SUPPORT) "SUPPORT", CASE WHEN Sum(SUPPORT) < 0.1 THEN 'OUT' ELSE 'IN' END "STATUS"
From
( Select RWN, STEP, T_NAME, T_NAME "S_NAME", FUZZY_VALUE, TOT_NUM_TRANS, SUPPORT
From
(
Select
ROW_NUMBER() OVER (PARTITION BY s.T_ID, t.T_NAME Order By s.T_ID, t.T_NAME) "RWN", s.STEP, s.T_NAME, i.TOT_NUM_TRANS "TOT_NUM_TRANS", s.T_ID, t.FUZZY_VALUE, Round(t.FUZZY_VALUE / i.TOT_NUM_TRANS, 8) "SUPPORT"
From
(
Select
n.STEP, n.T_NAME, i.T_ID
From
id_names i
Inner Join
(
Select 3 "STEP", n1.T_NAME || ', ' || n2.T_NAME || ', ' || n3.T_NAME "T_NAME"
From name_ids n1
Inner Join
name_ids n2 ON(n1.T_NAME <> n2.T_NAME And n1.T_NAME_ID < n2.T_NAME_ID)
Inner Join
name_ids n3 ON(n2.T_NAME <> n3.T_NAME And n2.T_NAME_ID < n3.T_NAME_ID)
) n ON(1=1)
) s
Inner Join
tbl t ON(t.T_ID = s. T_ID And InStr(s.T_NAME, t.T_NAME) > 0)
Inner Join
id_names i ON(i.T_ID = t.T_ID And InStr(i.ID_NAMES, t.T_NAME) > 0 And
InStr(i.ID_NAMES, SubStr(s.T_NAME, 1, 2)) > 0 And InStr(i.ID_NAMES, SubStr(s.T_NAME, 5, 2)) > 0)
Inner Join
step2 s2 ON(INSTR((Select LISTAGG(T_NAME, ', ') WITHIN GROUP (Order By STATUS) From step2 Where STATUS = 'IN' Group By STATUS), SubStr(s.T_NAME, 1, 6)) > 0 And
INSTR((Select LISTAGG(T_NAME, ', ') WITHIN GROUP (Order By STATUS) From step2 Where STATUS = 'IN' Group By STATUS), SubStr(s.T_NAME, InStr(s.T_NAME, ', ', 1) + 2, 6)) > 0 And
INSTR((Select LISTAGG(T_NAME, ', ') WITHIN GROUP (Order By STATUS) From step2 Where STATUS = 'IN' Group By STATUS), SubStr(s.T_NAME, InStr(s.T_NAME, ', ', 2) + 2, 6)) > 0)
)
Where RWN = 1
)
Group By STEP, S_NAME, TOT_NUM_TRANS
)
),
-- -----------------------------------------------------------
-- R e s u l t f o r step3
--
-- STEP T_NAME TOT_NUM_TRANS FUZZ_SUM SUPPORT STATUS
-- ---------- ---------- ------------- ---------- ---------- ------
-- 3 I1, I2, I3 10 3.48333331 .34833331 IN
-- 3 I1, I2, I5 10 .58333333 .05833333 OUT
-- -------------------------------------------------------------------------------------------------------------------------
step4 AS
(Select STEP, T_NAME, TOT_NUM_TRANS, FUZZ_SUM, SUPPORT, STATUS
From
( Select STEP, S_NAME "T_NAME", TOT_NUM_TRANS, Sum(FUZZY_VALUE) "FUZZ_SUM", Sum(SUPPORT) "SUPPORT", CASE WHEN Sum(SUPPORT) < 0.1 THEN 'OUT' ELSE 'IN' END "STATUS"
From
( Select RWN, STEP, T_NAME, T_NAME "S_NAME", FUZZY_VALUE, TOT_NUM_TRANS, SUPPORT
From
(
Select
ROW_NUMBER() OVER (PARTITION BY s.T_ID, t.T_NAME Order By s.T_ID, t.T_NAME) "RWN", s.STEP, s.T_NAME, i.TOT_NUM_TRANS "TOT_NUM_TRANS", s.T_ID, t.FUZZY_VALUE, Round(t.FUZZY_VALUE / i.TOT_NUM_TRANS, 8) "SUPPORT"
From
(
Select
n.STEP, n.T_NAME, i.T_ID
From
id_names i
Inner Join
(
Select 4 "STEP", n1.T_NAME || ', ' || n2.T_NAME || ', ' || n3.T_NAME || ', ' || n4.T_NAME "T_NAME"
From name_ids n1
Inner Join
name_ids n2 ON(n1.T_NAME <> n2.T_NAME And n1.T_NAME_ID < n2.T_NAME_ID)
Inner Join
name_ids n3 ON(n2.T_NAME <> n3.T_NAME And n2.T_NAME_ID < n3.T_NAME_ID)
Inner Join
name_ids n4 ON(n3.T_NAME <> n4.T_NAME And n3.T_NAME_ID < n4.T_NAME_ID)
) n ON(1=1)
) s
Inner Join
tbl t ON(t.T_ID = s. T_ID And InStr(s.T_NAME, t.T_NAME) > 0)
Inner Join
id_names i ON(i.T_ID = t.T_ID And InStr(i.ID_NAMES, t.T_NAME) > 0 And
InStr(i.ID_NAMES, SubStr(s.T_NAME, 1, 2)) > 0 And InStr(i.ID_NAMES, SubStr(s.T_NAME, 5, 2)) > 0 And InStr(i.ID_NAMES, SubStr(s.T_NAME, 9, 2)) > 0)
Inner Join
step3 s3 ON(INSTR((Select LISTAGG(T_NAME, ', ') WITHIN GROUP (Order By STATUS) From step3 Where STATUS = 'IN' Group By STATUS), SubStr(s.T_NAME, 1, 10)) > 0 And
INSTR((Select LISTAGG(T_NAME, ', ') WITHIN GROUP (Order By STATUS) From step3 Where STATUS = 'IN' Group By STATUS), SubStr(s.T_NAME, InStr(s.T_NAME, ', ', 1) + 2, 10)) > 0 And
INSTR((Select LISTAGG(T_NAME, ', ') WITHIN GROUP (Order By STATUS) From step3 Where STATUS = 'IN' Group By STATUS), SubStr(s.T_NAME, InStr(s.T_NAME, ', ', 2) + 2, 10)) > 0 And
INSTR((Select LISTAGG(T_NAME, ', ') WITHIN GROUP (Order By STATUS) From step3 Where STATUS = 'IN' Group By STATUS), SubStr(s.T_NAME, InStr(s.T_NAME, ', ', 3) + 2, 10)) > 0)
)
Where RWN = 1
)
Group By STEP, S_NAME, TOT_NUM_TRANS
)
)
-- -----------------------------------------------------------
-- R e s u l t f o r step4
--
-- no rows selected
-- -----------------------------------------------------------
-- *******************************************************
Select * From
(
Select * From step1 UNION ALL
Select * From step2 UNION ALL
Select * From step3 UNION ALL
Select * From step4
)
Order By
STEP, T_NAME
-- --------------------------------------------------------------------
-- R e s u l t all together
--
-- STEP T_NAME TOT_NUM_TRANS FUZZ_SUM SUPPORT STATUS
-- ---------- -------------- ------------- ---------- ---------- ------
-- 1 I1 10 2.44999999 .244999999 IN
-- 1 I2 10 2.94999999 .294999999 IN
-- 1 I3 10 2.58333333 .258333333 IN
-- 1 I4 10 .53333333 .053333333 OUT
-- 1 I5 10 1.08333333 .108333333 IN
-- 1 I6 10 .2 .02 OUT
-- 1 I8 10 .2 .02 OUT
-- 2 I1, I2 10 2.89999998 .28999998 IN
-- 2 I1, I3 10 2.58333333 .25833333 IN
-- 2 I1, I5 10 .58333333 .05833333 OUT
-- 2 I2, I3 10 2 .2 IN
-- 2 I2, I5 10 1 .1 IN
-- 3 I1, I2, I3 10 3.48333331 .34833331 IN
-- 3 I1, I2, I5 10 .58333333 .05833333 OUT
-- -------------------------------------------------------------------
Limiting factor for this answer is that the code works ok with T_NAME column that has length of 2 characters as in sample data. For different lengths there should be some changes in ON() clauses of joins with step2,3,... No time to deal with it now but that also can be solved in a way that it doesn't matter what the length of the column is. Regards...
精彩评论