方法#1: 使用
NumPybroadcasting,我们可以在输入数组之间寻找按元素进行的绝对减法,并使用适当的阈值从中过滤掉不需要的元素
A。对于给定的样本输入,似乎是一个工作阈值
90。
因此,我们将有一个实现,像这样-
thresh = 90Aout = A[(np.abs(A[:,None] - B) < thresh).any(1)]
样品运行-
In [69]: AOut[69]: array([ 2, 100, 300, 793, 1300, 1500, 1810, 2400])In [70]: BOut[70]: array([ 4, 305, 789, 1234, 1890])In [71]: A[(np.abs(A[:,None] - B) < 90).any(1)]Out[71]: array([ 2, 300, 793, 1300, 1810])
方法2:
thispost在的基础上,这是一种使用的内存有效方法
np.searchsorted,这对于大型阵列可能至关重要-
def searchsorted_filter(a, b, thresh): choices = np.sort(b) # if b is already sorted, skip it lidx = np.searchsorted(choices, a, 'left').clip(max=choices.size-1) ridx = (np.searchsorted(choices, a, 'right')-1).clip(min=0) cl = np.take(choices,lidx) # Or choices[lidx] cr = np.take(choices,ridx) # Or choices[ridx] return a[np.minimum(np.abs(a - cl), np.abs(a - cr)) < thresh]
样品运行-
In [95]: searchsorted_filter(A,B, thresh = 90)Out[95]: array([ 2, 300, 793, 1300, 1810])
运行时测试
In [104]: A = np.sort(np.random.randint(0,100000,(1000)))In [105]: B = np.sort(np.random.randint(0,100000,(400)))In [106]: out1 = A[(np.abs(A[:,None] - B) < 10).any(1)]In [107]: out2 = searchsorted_filter(A,B, thresh = 10)In [108]: np.allclose(out1, out2) # Verify resultsOut[108]: TrueIn [109]: %timeit A[(np.abs(A[:,None] - B) < 10).any(1)]100 loops, best of 3: 2.74 ms per loopIn [110]: %timeit searchsorted_filter(A,B, thresh = 10)10000 loops, best of 3: 85.3 µs per loop
2018年1月更新,进一步提升了性能
我们可以
np.searchsorted(..., 'right')通过利用从中获得的索引
np.searchsorted(...,'left')以及
absolute计算来避免二次使用,就像这样-
def searchsorted_filter_v2(a, b, thresh): N = len(b) choices = np.sort(b) # if b is already sorted, skip it l = np.searchsorted(choices, a, 'left') l_invalid_mask = l==N l[l_invalid_mask] = N-1 left_offset = choices[l]-a left_offset[l_invalid_mask] *= -1 r = (l - (left_offset!=0)) r_invalid_mask = r<0 r[r_invalid_mask] = 0 r += l_invalid_mask right_offset = a-choices[r] right_offset[r_invalid_mask] *= -1 out = a[(left_offset < thresh) | (right_offset < thresh)] return out
更新了计时以测试进一步的加速-
In [388]: np.random.seed(0) ...: A = np.random.randint(0,1000000,(100000)) ...: B = np.unique(np.random.randint(0,1000000,(40000))) ...: np.random.shuffle(B) ...: thresh = 10 ...: ...: out1 = searchsorted_filter(A, B, thresh) ...: out2 = searchsorted_filter_v2(A, B, thresh) ...: print np.allclose(out1, out2)TrueIn [389]: %timeit searchsorted_filter(A, B, thresh)10 loops, best of 3: 24.2 ms per loopIn [390]: %timeit searchsorted_filter_v2(A, B, thresh)100 loops, best of 3: 13.9 ms per loop
深层发掘 -
In [396]: a = A; b = BIn [397]: N = len(b) ...: ...: choices = np.sort(b) # if b is already sorted, skip it ...: ...: l = np.searchsorted(choices, a, 'left')In [398]: %timeit np.sort(B)100 loops, best of 3: 2 ms per loopIn [399]: %timeit np.searchsorted(choices, a, 'left')100 loops, best of 3: 10.3 ms per loop
似乎
searchsorted并
sort占用了几乎所有的运行时,并且它们似乎对于此方法至关重要。因此,采用这种基于排序的方法似乎无法进一步改善它。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)