根据另一个参考数组从一个数组中选择接近匹配_随笔

根据另一个参考数组从一个数组中选择接近匹配

方法＃1： 使用

NumPybroadcasting

，我们可以在输入数组之间寻找按元素进行的绝对减法，并使用适当的阈值从中过滤掉不需要的元素

。对于给定的样本输入，似乎是一个工作阈值

。

因此，我们将有一个实现，像这样-

thresh = 90Aout = A[(np.abs(A[:,None] - B) < thresh).any(1)]

样品运行-

In [69]: AOut[69]: array([   2,  100,  300,  793, 1300, 1500, 1810, 2400])In [70]: BOut[70]: array([   4,  305,  789, 1234, 1890])In [71]: A[(np.abs(A[:,None] - B) < 90).any(1)]Out[71]: array([   2,  300,  793, 1300, 1810])

方法2：

thispost

在的基础上，这是一种使用的内存有效方法

np.searchsorted

，这对于大型阵列可能至关重要-

def searchsorted_filter(a, b, thresh):    choices = np.sort(b) # if b is already sorted, skip it    lidx = np.searchsorted(choices, a, 'left').clip(max=choices.size-1)    ridx = (np.searchsorted(choices, a, 'right')-1).clip(min=0)    cl = np.take(choices,lidx) # Or choices[lidx]    cr = np.take(choices,ridx) # Or choices[ridx]    return a[np.minimum(np.abs(a - cl), np.abs(a - cr)) < thresh]

样品运行-

In [95]: searchsorted_filter(A,B, thresh = 90)Out[95]: array([   2,  300,  793, 1300, 1810])

运行时测试

In [104]: A = np.sort(np.random.randint(0,100000,(1000)))In [105]: B = np.sort(np.random.randint(0,100000,(400)))In [106]: out1 = A[(np.abs(A[:,None] - B) < 10).any(1)]In [107]: out2 = searchsorted_filter(A,B, thresh = 10)In [108]: np.allclose(out1, out2)  # Verify resultsOut[108]: TrueIn [109]: %timeit A[(np.abs(A[:,None] - B) < 10).any(1)]100 loops, best of 3: 2.74 ms per loopIn [110]: %timeit searchsorted_filter(A,B, thresh = 10)10000 loops, best of 3: 85.3 µs per loop

2018年1月更新，进一步提升了性能

我们可以

np.searchsorted(..., 'right')

通过利用从中获得的索引

np.searchsorted(...,'left')

以及

absolute

计算来避免二次使用，就像这样-

def searchsorted_filter_v2(a, b, thresh):    N = len(b)    choices = np.sort(b) # if b is already sorted, skip it    l = np.searchsorted(choices, a, 'left')    l_invalid_mask = l==N    l[l_invalid_mask] = N-1    left_offset = choices[l]-a    left_offset[l_invalid_mask] *= -1    r = (l - (left_offset!=0))    r_invalid_mask = r<0    r[r_invalid_mask] = 0    r += l_invalid_mask    right_offset = a-choices[r]    right_offset[r_invalid_mask] *= -1    out = a[(left_offset < thresh) | (right_offset < thresh)]    return out

更新了计时以测试进一步的加速-

In [388]: np.random.seed(0)     ...: A = np.random.randint(0,1000000,(100000))     ...: B = np.unique(np.random.randint(0,1000000,(40000)))     ...: np.random.shuffle(B)     ...: thresh = 10     ...:      ...: out1 = searchsorted_filter(A, B, thresh)     ...: out2 = searchsorted_filter_v2(A, B, thresh)     ...: print np.allclose(out1, out2)TrueIn [389]: %timeit searchsorted_filter(A, B, thresh)10 loops, best of 3: 24.2 ms per loopIn [390]: %timeit searchsorted_filter_v2(A, B, thresh)100 loops, best of 3: 13.9 ms per loop

深层发掘 -

In [396]: a = A; b = BIn [397]: N = len(b)     ...:      ...: choices = np.sort(b) # if b is already sorted, skip it     ...:      ...: l = np.searchsorted(choices, a, 'left')In [398]: %timeit np.sort(B)100 loops, best of 3: 2 ms per loopIn [399]: %timeit np.searchsorted(choices, a, 'left')100 loops, best of 3: 10.3 ms per loop

似乎

searchsorted

并

sort

占用了几乎所有的运行时，并且它们似乎对于此方法至关重要。因此，采用这种基于排序的方法似乎无法进一步改善它。

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/zaji/5646866.html

根据另一个参考数组从一个数组中选择接近匹配

发表评论

评论列表（0条）