我认为没有矢量化此 *** 作的方法会比Python循环快得多。(至少,如果您只想使用Python,pandas和numpy,则不需要。)
但是,您可以通过简化代码来提高此 *** 作的性能。您的实现使用
if语句和许多Dataframe索引。这些是相对昂贵的 *** 作。
这是对脚本的修改,其中包括两个功能:
add_signal_l(df)和
add_lagged(df)。第一个是您的代码,仅包装在一个函数中。第二个使用更简单的函数来达到相同的结果-
仍然是Python循环,但它使用numpy数组和按位运算符。
import numpy as npimport pandas as pdimport datetime#-----------------------------------------------------------------------# Create the test Dataframe# Data frame with input and desired output i column signal_ddf = pd.Dataframe({'condition_A':list('00001100000110'), 'condition_B':list('01110011111000'), 'signal_d':list('00001111111110')})colnames = list(df)df[colnames] = df[colnames].apply(pd.to_numeric)datelist = pd.date_range(pd.datetime.today().strftime('%Y-%m-%d'), periods=14).tolist()df['dates'] = datelistdf = df.set_index(['dates']) #-----------------------------------------------------------------------def add_signal_l(df): # Solution using a for loop with nested ifs in column signal_l df['signal_l'] = df['condition_A'].copy(deep = True) i=0 for observations in df['signal_l']: if df.ix[i,'condition_A'] == 1: df.ix[i,'signal_l'] = 1 else: # Signal previously triggered by condition_A # AND kept "alive" by condition_B: if df.ix[i - 1,'signal_l'] & df.ix[i,'condition_B'] == 1: df.ix[i,'signal_l'] = 1 else: df.ix[i,'signal_l'] = 0 i = i + 1def compute_lagged_signal(a, b): x = np.empty_like(a) x[0] = a[0] for i in range(1, len(a)): x[i] = a[i] | (x[i-1] & b[i]) return xdef add_lagged(df): df['lagged'] = compute_lagged_signal(df['condition_A'].values, df['condition_B'].values)
这是在IPython会话中运行的两个函数的计时比较:
In [85]: dfOut[85]: condition_A condition_B signal_ddates 2017-06-09 0 0 02017-06-10 0 1 02017-06-11 0 1 02017-06-12 0 1 02017-06-13 1 0 12017-06-14 1 0 12017-06-15 0 1 12017-06-16 0 1 12017-06-17 0 1 12017-06-18 0 1 12017-06-19 0 1 12017-06-20 1 0 12017-06-21 1 0 12017-06-22 0 0 0In [86]: %timeit add_signal_l(df)8.45 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)In [87]: %timeit add_lagged(df)137 µs ± 581 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
如您所见,
add_lagged(df)速度更快。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)