你可以:
df['consecutive'] = df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count
要得到:
Count consecutive0 1 11 0 02 1 23 1 24 0 05 0 06 1 37 1 38 1 39 0 0
在这里,您可以设置任何阈值:
threshold = 2df['consecutive'] = (df.consecutive > threshold).astype(int)
要得到:
Count consecutive0 1 01 0 02 1 13 1 14 0 05 0 06 1 17 1 18 1 19 0 0
或者,只需一步即可:
(df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count >= threshold).astype(int)
在效率方面,
pandas当问题的规模变大时,使用方法可以显着提高速度:
df = pd.concat([df for _ in range(1000)])%timeit (df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count >= threshold).astype(int)1000 loops, best of 3: 1.47 ms per loop
相比:
%%timeitl = []for k, g in groupby(df.Count): size = sum(1 for _ in g) if k == 1 and size >= 2: l = l + [1]*size else: l = l + [0]*size pd.Series(l)10 loops, best of 3: 76.7 ms per loop
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)