导入必要的包裹编写矢量化 *** 作并避免循环可显着提高速度
从第一个列表创建数据框from fuzzywuzzy import fuzzimport pandas as pdimport numpy as np
从第二个列表创建数据框dataframecolumn = pd.Dataframe(["apple","tb"])dataframecolumn.columns = ['Match']
合并-通过引入密钥(自加入)的笛卡尔积compare = pd.Dataframe(["adfad","apple","asple","tab"])compare.columns = ['compare']
向量化dataframecolumn['Key'] = 1compare['Key'] = 1combined_dataframe = dataframecolumn.merge(compare,on="Key",how="left")combined_dataframe = combined_dataframe[~(combined_dataframe.Match==combined_dataframe.compare)]
使用矢量化并通过在阈值上设置阈值来获得期望的结果def partial_match(x,y): return(fuzz.ratio(x,y))partial_match_vector = np.vectorize(partial_match)
结果combined_dataframe['score']=partial_match_vector(combined_dataframe['Match'],combined_dataframe['compare'])combined_dataframe = combined_dataframe[combined_dataframe.score>=80]
+--------+-----+--------+------+| Match | Key | compare | score+--------+-----+--------+------+| apple | 1 | asple | 80| tb | 1 | tab | 80+--------+-----+--------+------+
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)