大数据集处理算法Apriori-python代码实现

大数据集处理算法Apriori-python代码实现,第1张

大数据集处理算法Apriori-python代码实现 大数据集处理算法Apriori-python实现

在处理短文本数据时候,我们首先提取了高频词,去掉stop-words之后,希望人工提取高频词的交叉组合,来提取短文本内的组合关系,为下游任务提供分析基础。

首先在python已有成熟的库可用,在小数据量情况下,其实可以直接使用,不过一般项目中,我想大概率都是大数据项目。一种解决办法是抽样;另一种就是从算法实现上进行改进;

from apyori import apriori

data = [['豆奶','莴苣'],
        ['莴苣','尿布','葡萄酒','甜菜'],
        ['豆奶','尿布','葡萄酒','橙汁'],
        ['莴苣','豆奶','尿布','葡萄酒'],
        ['莴苣','豆奶','尿布','橙汁']]

result = list(apriori(transactions=data,min_support = 0.6,max_length = 2))
result

# apriori其他参数说明:
min_support -- The minimum support of relations (float).最小支持度,可用来筛选项集
min_confidence -- The minimum confidence of relations (float).最小可信度,可用来筛选项集
min_lift -- The minimum lift of relations (float).最小提升度
max_length -- The maximum length of the relation (integer).序列最小长度
结果
[RelationRecord(items=frozenset({'尿布'}), support=0.8, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'尿布'}), confidence=0.8, lift=1.0)]),
 RelationRecord(items=frozenset({'莴苣'}), support=0.8, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'莴苣'}), confidence=0.8, lift=1.0)]),
 RelationRecord(items=frozenset({'葡萄酒'}), support=0.6, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'葡萄酒'}), confidence=0.6, lift=1.0)]),
 RelationRecord(items=frozenset({'豆奶'}), support=0.8, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'豆奶'}), confidence=0.8, lift=1.0)]),
 RelationRecord(items=frozenset({'莴苣', '尿布'}), support=0.6, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'莴苣', '尿布'}), confidence=0.6, lift=1.0), OrderedStatistic(items_base=frozenset({'尿布'}), items_add=frozenset({'莴苣'}), confidence=0.7499999999999999, lift=0.9374999999999998), OrderedStatistic(items_base=frozenset({'莴苣'}), items_add=frozenset({'尿布'}), confidence=0.7499999999999999, lift=0.9374999999999998)]),
 RelationRecord(items=frozenset({'葡萄酒', '尿布'}), support=0.6, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'葡萄酒', '尿布'}), confidence=0.6, lift=1.0), OrderedStatistic(items_base=frozenset({'尿布'}), items_add=frozenset({'葡萄酒'}), confidence=0.7499999999999999, lift=1.2499999999999998), OrderedStatistic(items_base=frozenset({'葡萄酒'}), items_add=frozenset({'尿布'}), confidence=1.0, lift=1.25)]),
 RelationRecord(items=frozenset({'豆奶', '尿布'}), support=0.6, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'豆奶', '尿布'}), confidence=0.6, lift=1.0), OrderedStatistic(items_base=frozenset({'尿布'}), items_add=frozenset({'豆奶'}), confidence=0.7499999999999999, lift=0.9374999999999998), OrderedStatistic(items_base=frozenset({'豆奶'}), items_add=frozenset({'尿布'}), confidence=0.7499999999999999, lift=0.9374999999999998)]),
 RelationRecord(items=frozenset({'莴苣', '豆奶'}), support=0.6, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'莴苣', '豆奶'}), confidence=0.6, lift=1.0), OrderedStatistic(items_base=frozenset({'莴苣'}), items_add=frozenset({'豆奶'}), confidence=0.7499999999999999, lift=0.9374999999999998), OrderedStatistic(items_base=frozenset({'豆奶'}), items_add=frozenset({'莴苣'}), confidence=0.7499999999999999, lift=0.9374999999999998)])]

apyori模块的star数和contributor数比较少,而且文档也不是很全面,建议使用mlxtend模块:star数和contributor数远多于apyori模块,文档总结很全面,而且指标中不止支持度、置信度、lift值,还增加了leverage和conviction。

数据关联分析mlxtend
算法改进型的代码实现,后续待代码开发验证完毕后提供。

欢迎分享,转载请注明来源:内存溢出

原文地址: http://outofmemory.cn/zaji/5699710.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-12-17
下一篇 2022-12-17

发表评论

登录后才能评论

评论列表(0条)

保存