how to find maximum frequent item sets from large transactional data file

2022-12-27 20:50 问答作者：

I have the input file contains large amount of transactions like

Transaction ID Items

T1 Bread, milk, coffee, juice
T2 Juice, milk, coffee
T3 Bread, juice
T4 Coffee, milk
T5 Bread, Milk
T6 Coffee, Bread
T7 Coffee, Bread, Juice
T8 Bread, Milk, Juice
T9 Milk, Bread, Coffee, 
T10 Bread
T11 Milk
T12 Milk, Coffee, Bread, Juice

i want the occurrence of every unique item like

Item Name Count
Bread  9
Milk   8
Coffee 7
Juice  6

alt text http://www.ade-technologies.com/OrderFP_Tree.jpg

and from that i want an a fp-tree now by traversing this tree i want the maximal frequent itemsets as follows

The basic idea of method is to dispose nodes in each “layer” from bottom to up. The concept of “layer” is different to the common concept of layer in a tree. Nodes in a “layer” mean the nodes correspond to the same item and be in a linked list from the “Head Table”. For nodes in a “l开发者_如何学Pythonayer” NBN method will be used to dispose the nodes from left to right along the linked list. To use NBN method, two extra fields will be added to each node in the ordered FP-Tree. The field tag of node N stores the information of whether N is maximal frequent itemset, and the field count’ stores the support count information in the nodes at left.

In Figure, the first node to be disposed is “juice: 2”. If the min_sup is equal to or less than 2 then “bread, milk, coffee, juice” is a maximal frequent itemset. Firstly output juice:2 and set the field tag of “coffee:3” as “false” (the field tag of each node is “true” initially ). Next check whether the right four itemsets juice:1 be the subset of juice:2. If the itemset one node “juice:1” corresponding to is the subset of juice:2 set the field tag of the node “false”. In the following process when the field tag of the disposed node is FALSE we can omit the node after the same tagging. If the min_sup is more than 2 then check whether the right four juice:1 is the subset of juice:2. If the itemset one node “juice:1” corresponding to is the subset of juice:2 then set the field count’ of the node with the sum of the former count’ and 2 After all the nodes “juice” disposed ,begin to dispose the node “coffee:3”.

Any suggestions or available source code, welcome.

thanks in advance

This can be done directly in SQL

CREATE TABLE dbo.TestTable
( FIELD1 VARCHAR(256) )
GO

INSERT INTO dbo.TestTable(FIELD1) VALUES
('T1 Bread, milk, coffee, juice')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T2 Juice, milk, coffee')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T3 Bread, juice')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T4 Coffee, milk')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T5 Bread, Milk')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T6 Coffee, Bread')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T7 Coffee, Bread, Juice')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T8 Bread, Milk, Juice')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T9 Milk, Bread, Coffee,')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T10 Bread')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T11 Milk')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T12 Milk, Coffee, Bread, Juice')
GO

--CREATE INDEX TestIndex ON dbo.TestTable(FIELD1)
--GO

;WITH Numbers AS
(
    SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS N
    FROM dbo.TestTable T1
    CROSS JOIN dbo.TestTable T2
),
Base AS
(
    SELECT SUBSTRING(FIELD1, 0, CHARINDEX(' ', FIELD1, 0)) AS TRANID,
    UPPER(REPLACE(SUBSTRING(FIELD1, CHARINDEX(' ', FIELD1, 0)+1, DATALENGTH(FIELD1)), ' ', '')) AS ITEMS
    FROM dbo.TestTable
),
Split AS
(
    SELECT TRANID, ITEMS, N, SUBSTRING(ITEMS, N, CHARINDEX(',', ITEMS + ',', N) - N) AS ELEMENT
    FROM Base 
    JOIN Numbers ON N <= DATALENGTH(Base.ITEMS) + 1
    AND SUBSTRING(',' + Base.ITEMS, N, 1) = ','
)
SELECT ELEMENT, COUNT(*) AS TOTAL
FROM Split
GROUP BY ELEMENT
ORDER BY TOTAL DESC

This returns

BREAD   9
MILK    8
COFFEE  7
JUICE   6
        1

The empty entry comes from the comma at the end of transaction T9

T9 Milk, Bread, Coffee,

继续阅读：.net data-mining sql sql-server

how to find maximum frequent item sets from large transactional data file

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？