协调记录(日期和数值):给定两个具有多个功能的数据集,如何获取最可能的匹配项?

协调记录(日期和数值):给定两个具有多个功能的数据集,如何获取最可能的匹配项?,第1张

概述说我有两个数据集基础和付款.基数是:[ id, timestamp, value] 付款是: [ payment_id, timestamp, value, gateway ] 我想对付基础付款.理想的结果是:[id,时间戳,值,payment_id,网关,概率]基本上,它应该告诉我给定基本条目最可能的payment_id.匹配应同时考虑日期时间和值.如果只

说我有两个数据集基础和付款.

基数是:

[ ID,timestamp,value]

付款是:

 [ payment_ID,value,gateway ]

我想对付基础付款.理想的结果是:

[ID,时间戳,值,payment_ID,网关,概率]

基本上,它应该告诉我给定基本条目最可能的payment_ID.匹配应同时考虑日期时间和值.如果只给出最高的概率,我会感到满意,但是也不会因为提出第二/第三建议而困扰我.

到目前为止,我已经阅读了一些有关模糊匹配和相似性学习,余弦相似性和内容的内容,但是似乎无法将这些内容应用于我的问题.
我想到了手动执行以下 *** 作:

for each_entry in base:    value_difference = base['value'] - payment['value']    time_difference = base['timestamp'] - payment['timestamp']    if value_difference <= 0.1 and time_difference <= 0.1:        #if the difference is small,then tell me the payment_ID.  

事实是,这看起来像是一种真正的“转储”方法,如果有多个与条件匹配的payment_entry可能会发生冲突,并且我将不得不手动微调参数以获得良好的结果.

我希望找到一种更智能,更自动的方式来协调这两个数据集.

有人对如何解决问题有任何建议吗?

编辑:我当前的状态:

import pandas as pdimport timefrom itertools import islicefrom pandas import ExcelWriterimport datetimefrom random import uniformorders = pd.read_excel("Orders.xlsx")pmts = pd.read_excel("Payments.xlsx")pmts['date'] = pd.to_datetime(pmts.date)orders['data'] = pd.to_datetime(orders.data)payment_List = []for index,row in pmts.iterrows():    new_entry = {}    ts = row['date']    new_entry['ID'] = row['ID']    new_entry['date'] = ts.to_pydatetime()    new_entry['value'] = row['value']    new_entry['types'] = row['pmt']    new_entry['results'] = []        payment_List.append(new_entry)order_List = []for index,row in orders.iterrows():    new_entry = {}    ts = row['data']    new_entry['ID'] = row['ID1']    new_entry['date'] = ts.to_pydatetime()    new_entry['value'] = row['valor']    new_entry['types'] = row['nome']    new_entry['results'] = []           order_List.append(new_entry)for each_entry in order_List:    for each_payment in payment_List:        delta_value = (each_entry['value'] - each_payment['value'])        try:            delta_time = abs(each_entry['date'] - each_payment['date'])        except:            TypeError            pass        results = []        delta_ref = datetime.timedelta(minutes=60)        if delta_value == 0 and delta_time < delta_ref:            result_type = each_payment['types']            result_ID = each_payment['ID']            results.append(result_type)            results.append(delta_time)            results.append(result_ID)            each_entry['results'].append(results)            result_ID = each_entry['ID']            each_payment['results'].append(result_ID)orders2 = pd.DataFrame(order_List)writer = ExcelWriter('OrdersList.xlsx')orders2.to_excel(writer)writer.save()pmts2 = pd.DataFrame(payment_List)writer = ExcelWriter('PaymentList.xlsx')pmts2.to_excel(writer)writer.save()

好,现在我得到了一些东西.它向我返回所有具有相同值和小于x的时间增量(在这种情况下为60分钟)的条目.最好只给我最有可能的结果,也不会给出匹配正确的可能性(相同的数量,较短的时间范围).将继续尝试.最佳答案最简单的方法可能是选择差异最小的基本/付款对.例如:

base_data = [...]  # all base datapayment_data = [...]  # all payment datadef prop_diff(a,b,props):  # this iterates through all specifIEd propertIEs and  # sums the result of the differences  return sum([a[prop]-b[prop] for prop in props])def join_data(base,payment):  # you need to implement your merging strategy here  return joined_base_and_paymentresults = []  # where we will store our merged resultsworking_payment = payment_data.copy()for base in base_data:  # check the difference between the Lists  diffs = []  for payment in working_payment:    diffs.append(prop_diff(base,payment,['value','timestamp']))  # find the index of the payment with the minimum difference  min_IDx = 0  for i,d in enumerate(diffs):    if d < diffs[min_IDx]:      min_IDx = i  # append the result of the joined Lists  results.append(join_data(base,working_payment[min_IDx]))  del working_payment[min_IDx]  # remove the selected paymentprint(results)

基本思想是找到列表之间的总差异,并选择差异最小的对.在这种情况下,我复制了payment_data,所以我们不会破坏它,并且一旦我们将其与基本匹配并附加了结果,我们实际上就删除了该付款条目. 总结

以上是内存溢出为你收集整理的协调记录(日期和数值):给定两个具有多个功能的数据集,如何获取最可能的匹配项? 全部内容,希望文章能够帮你解决协调记录(日期和数值):给定两个具有多个功能的数据集,如何获取最可能的匹配项? 所遇到的程序开发问题。

如果觉得内存溢出网站内容还不错,欢迎将内存溢出网站推荐给程序员好友。

欢迎分享,转载请注明来源:内存溢出

原文地址: http://outofmemory.cn/langs/1199557.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-06-04
下一篇 2022-06-04

发表评论

登录后才能评论

评论列表(0条)

保存