how to find maximum frequent item sets from large transactional data file
I have the input file contains large amount of transactions like
Transaction ID Items
T1 Bread, milk, coffee, juice
T2 Juice, milk, coffee
T3 Bread, juice
T4 Coffee, milk
T5 Bread, Milk
T6 Coffee, Bread
T7 Coffee, Bread, Juice
T8 Bread, Milk, Juice
T9 Milk, Bread, Coffee,
T10 Bread
T11 Milk
T12 Milk, Coffee, Bread, Juice
i want the occurrence of every unique item like
Item Name Count
Bread 9
Milk 8
Coffee 7
Juice 6
alt text http://www.ade-technologies.com/OrderFP_Tree.jpg
and from that i want an a fp-tree now by traversing this tree i want the maximal frequent itemsets as follows
The basic idea of method is to dispose nodes in each “layer” from bottom to up. The concept of “layer” is different to the common concept of layer in a tree. Nodes in a “layer” mean the nodes correspond to the same item and be in a linked list from the “Head Table”. For nodes in a “l开发者_如何学Pythonayer” NBN method will be used to dispose the nodes from left to right along the linked list. To use NBN method, two extra fields will be added to each node in the ordered FP-Tree. The field tag of node N stores the information of whether N is maximal frequent itemset, and the field count’ stores the support count information in the nodes at left.
In Figure, the first node to be disposed is “juice: 2”. If the min_sup is equal to or less than 2 then “bread, milk, coffee, juice” is a maximal frequent itemset. Firstly output juice:2 and set the field tag of “coffee:3” as “false” (the field tag of each node is “true” initially ). Next check whether the right four itemsets juice:1 be the subset of juice:2. If the itemset one node “juice:1” corresponding to is the subset of juice:2 set the field tag of the node “false”. In the following process when the field tag of the disposed node is FALSE we can omit the node after the same tagging. If the min_sup is more than 2 then check whether the right four juice:1 is the subset of juice:2. If the itemset one node “juice:1” corresponding to is the subset of juice:2 then set the field count’ of the node with the sum of the former count’ and 2 After all the nodes “juice” disposed ,begin to dispose the node “coffee:3”.
Any suggestions or available source code, welcome.
thanks in advance
This can be done directly in SQL
CREATE TABLE dbo.TestTable
( FIELD1 VARCHAR(256) )
GO
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T1 Bread, milk, coffee, juice')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T2 Juice, milk, coffee')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T3 Bread, juice')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T4 Coffee, milk')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T5 Bread, Milk')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T6 Coffee, Bread')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T7 Coffee, Bread, Juice')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T8 Bread, Milk, Juice')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T9 Milk, Bread, Coffee,')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T10 Bread')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T11 Milk')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T12 Milk, Coffee, Bread, Juice')
GO
--CREATE INDEX TestIndex ON dbo.TestTable(FIELD1)
--GO
;WITH Numbers AS
(
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS N
FROM dbo.TestTable T1
CROSS JOIN dbo.TestTable T2
),
Base AS
(
SELECT SUBSTRING(FIELD1, 0, CHARINDEX(' ', FIELD1, 0)) AS TRANID,
UPPER(REPLACE(SUBSTRING(FIELD1, CHARINDEX(' ', FIELD1, 0)+1, DATALENGTH(FIELD1)), ' ', '')) AS ITEMS
FROM dbo.TestTable
),
Split AS
(
SELECT TRANID, ITEMS, N, SUBSTRING(ITEMS, N, CHARINDEX(',', ITEMS + ',', N) - N) AS ELEMENT
FROM Base
JOIN Numbers ON N <= DATALENGTH(Base.ITEMS) + 1
AND SUBSTRING(',' + Base.ITEMS, N, 1) = ','
)
SELECT ELEMENT, COUNT(*) AS TOTAL
FROM Split
GROUP BY ELEMENT
ORDER BY TOTAL DESC
This returns
BREAD 9
MILK 8
COFFEE 7
JUICE 6
1
The empty entry comes from the comma at the end of transaction T9
T9 Milk, Bread, Coffee,
精彩评论