在处理短文本数据时候,我们首先提取了高频词,去掉stop-words之后,希望人工提取高频词的交叉组合,来提取短文本内的组合关系,为下游任务提供分析基础。
首先在python已有成熟的库可用,在小数据量情况下,其实可以直接使用,不过一般项目中,我想大概率都是大数据项目。一种解决办法是抽样;另一种就是从算法实现上进行改进;
from apyori import apriori data = [['豆奶','莴苣'], ['莴苣','尿布','葡萄酒','甜菜'], ['豆奶','尿布','葡萄酒','橙汁'], ['莴苣','豆奶','尿布','葡萄酒'], ['莴苣','豆奶','尿布','橙汁']] result = list(apriori(transactions=data,min_support = 0.6,max_length = 2)) result
# apriori其他参数说明: min_support -- The minimum support of relations (float).最小支持度,可用来筛选项集 min_confidence -- The minimum confidence of relations (float).最小可信度,可用来筛选项集 min_lift -- The minimum lift of relations (float).最小提升度 max_length -- The maximum length of the relation (integer).序列最小长度结果
[RelationRecord(items=frozenset({'尿布'}), support=0.8, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'尿布'}), confidence=0.8, lift=1.0)]), RelationRecord(items=frozenset({'莴苣'}), support=0.8, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'莴苣'}), confidence=0.8, lift=1.0)]), RelationRecord(items=frozenset({'葡萄酒'}), support=0.6, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'葡萄酒'}), confidence=0.6, lift=1.0)]), RelationRecord(items=frozenset({'豆奶'}), support=0.8, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'豆奶'}), confidence=0.8, lift=1.0)]), RelationRecord(items=frozenset({'莴苣', '尿布'}), support=0.6, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'莴苣', '尿布'}), confidence=0.6, lift=1.0), OrderedStatistic(items_base=frozenset({'尿布'}), items_add=frozenset({'莴苣'}), confidence=0.7499999999999999, lift=0.9374999999999998), OrderedStatistic(items_base=frozenset({'莴苣'}), items_add=frozenset({'尿布'}), confidence=0.7499999999999999, lift=0.9374999999999998)]), RelationRecord(items=frozenset({'葡萄酒', '尿布'}), support=0.6, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'葡萄酒', '尿布'}), confidence=0.6, lift=1.0), OrderedStatistic(items_base=frozenset({'尿布'}), items_add=frozenset({'葡萄酒'}), confidence=0.7499999999999999, lift=1.2499999999999998), OrderedStatistic(items_base=frozenset({'葡萄酒'}), items_add=frozenset({'尿布'}), confidence=1.0, lift=1.25)]), RelationRecord(items=frozenset({'豆奶', '尿布'}), support=0.6, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'豆奶', '尿布'}), confidence=0.6, lift=1.0), OrderedStatistic(items_base=frozenset({'尿布'}), items_add=frozenset({'豆奶'}), confidence=0.7499999999999999, lift=0.9374999999999998), OrderedStatistic(items_base=frozenset({'豆奶'}), items_add=frozenset({'尿布'}), confidence=0.7499999999999999, lift=0.9374999999999998)]), RelationRecord(items=frozenset({'莴苣', '豆奶'}), support=0.6, ordered_statistics=[OrderedStatistic(items_base=frozenset(), items_add=frozenset({'莴苣', '豆奶'}), confidence=0.6, lift=1.0), OrderedStatistic(items_base=frozenset({'莴苣'}), items_add=frozenset({'豆奶'}), confidence=0.7499999999999999, lift=0.9374999999999998), OrderedStatistic(items_base=frozenset({'豆奶'}), items_add=frozenset({'莴苣'}), confidence=0.7499999999999999, lift=0.9374999999999998)])]
apyori模块的star数和contributor数比较少,而且文档也不是很全面,建议使用mlxtend模块:star数和contributor数远多于apyori模块,文档总结很全面,而且指标中不止支持度、置信度、lift值,还增加了leverage和conviction。
数据关联分析mlxtend
算法改进型的代码实现,后续待代码开发验证完毕后提供。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)