说我有两个数据集基础和付款.
基数是:
[ ID,timestamp,value]
付款是:
[ payment_ID,value,gateway ]
我想对付基础付款.理想的结果是:
[ID,时间戳,值,payment_ID,网关,概率]
基本上,它应该告诉我给定基本条目最可能的payment_ID.匹配应同时考虑日期时间和值.如果只给出最高的概率,我会感到满意,但是也不会因为提出第二/第三建议而困扰我.
到目前为止,我已经阅读了一些有关模糊匹配和相似性学习,余弦相似性和内容的内容,但是似乎无法将这些内容应用于我的问题.
我想到了手动执行以下 *** 作:
for each_entry in base: value_difference = base['value'] - payment['value'] time_difference = base['timestamp'] - payment['timestamp'] if value_difference <= 0.1 and time_difference <= 0.1: #if the difference is small,then tell me the payment_ID.
事实是,这看起来像是一种真正的“转储”方法,如果有多个与条件匹配的payment_entry可能会发生冲突,并且我将不得不手动微调参数以获得良好的结果.
我希望找到一种更智能,更自动的方式来协调这两个数据集.
有人对如何解决问题有任何建议吗?
编辑:我当前的状态:
import pandas as pdimport timefrom itertools import islicefrom pandas import ExcelWriterimport datetimefrom random import uniformorders = pd.read_excel("Orders.xlsx")pmts = pd.read_excel("Payments.xlsx")pmts['date'] = pd.to_datetime(pmts.date)orders['data'] = pd.to_datetime(orders.data)payment_List = []for index,row in pmts.iterrows(): new_entry = {} ts = row['date'] new_entry['ID'] = row['ID'] new_entry['date'] = ts.to_pydatetime() new_entry['value'] = row['value'] new_entry['types'] = row['pmt'] new_entry['results'] = [] payment_List.append(new_entry)order_List = []for index,row in orders.iterrows(): new_entry = {} ts = row['data'] new_entry['ID'] = row['ID1'] new_entry['date'] = ts.to_pydatetime() new_entry['value'] = row['valor'] new_entry['types'] = row['nome'] new_entry['results'] = [] order_List.append(new_entry)for each_entry in order_List: for each_payment in payment_List: delta_value = (each_entry['value'] - each_payment['value']) try: delta_time = abs(each_entry['date'] - each_payment['date']) except: TypeError pass results = [] delta_ref = datetime.timedelta(minutes=60) if delta_value == 0 and delta_time < delta_ref: result_type = each_payment['types'] result_ID = each_payment['ID'] results.append(result_type) results.append(delta_time) results.append(result_ID) each_entry['results'].append(results) result_ID = each_entry['ID'] each_payment['results'].append(result_ID)orders2 = pd.DataFrame(order_List)writer = ExcelWriter('OrdersList.xlsx')orders2.to_excel(writer)writer.save()pmts2 = pd.DataFrame(payment_List)writer = ExcelWriter('PaymentList.xlsx')pmts2.to_excel(writer)writer.save()
好,现在我得到了一些东西.它向我返回所有具有相同值和小于x的时间增量(在这种情况下为60分钟)的条目.最好只给我最有可能的结果,也不会给出匹配正确的可能性(相同的数量,较短的时间范围).将继续尝试.最佳答案最简单的方法可能是选择差异最小的基本/付款对.例如:
base_data = [...] # all base datapayment_data = [...] # all payment datadef prop_diff(a,b,props): # this iterates through all specifIEd propertIEs and # sums the result of the differences return sum([a[prop]-b[prop] for prop in props])def join_data(base,payment): # you need to implement your merging strategy here return joined_base_and_paymentresults = [] # where we will store our merged resultsworking_payment = payment_data.copy()for base in base_data: # check the difference between the Lists diffs = [] for payment in working_payment: diffs.append(prop_diff(base,payment,['value','timestamp'])) # find the index of the payment with the minimum difference min_IDx = 0 for i,d in enumerate(diffs): if d < diffs[min_IDx]: min_IDx = i # append the result of the joined Lists results.append(join_data(base,working_payment[min_IDx])) del working_payment[min_IDx] # remove the selected paymentprint(results)
基本思想是找到列表之间的总差异,并选择差异最小的对.在这种情况下,我复制了payment_data,所以我们不会破坏它,并且一旦我们将其与基本匹配并附加了结果,我们实际上就删除了该付款条目. 总结
以上是内存溢出为你收集整理的协调记录(日期和数值):给定两个具有多个功能的数据集,如何获取最可能的匹配项? 全部内容,希望文章能够帮你解决协调记录(日期和数值):给定两个具有多个功能的数据集,如何获取最可能的匹配项? 所遇到的程序开发问题。
如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)