python – 在numpyscipy中为稀疏的矩阵添加一个非常重复的矩阵？_python

概述我正在尝试在NumPy / Scipy中实现一个函数来计算单个(训练)向量和大量其他(观察)向量之间的 Jensen-Shannon divergence.观察向量存储在非常大的(500,000×65536) Scipy sparse matrix中(密集矩阵不适合存储器). 作为算法的一部分,我需要为每个观察向量Oi计算T Oi,其中T是训练向量.我无法使用NumPy的常规广播规则找到一种方法, 我正在尝试在NumPy / Scipy中实现一个函数来计算单个(训练)向量和大量其他(观察)向量之间的 Jensen-Shannon divergence.观察向量存储在非常大的(500,000×65536) Scipy sparse matrix中(密集矩阵不适合存储器).

作为算法的一部分,我需要为每个观察向量Oi计算T Oi,其中T是训练向量.我无法使用NumPy的常规广播规则找到一种方法,因为稀疏矩阵似乎不支持那些(如果T保留为密集阵列,Scipy会尝试使稀疏矩阵首先密集,哪些运行内存不足;如果我将T变成稀疏矩阵,则T Oi失败,因为形状不一致).

目前,我正在采取将训练向量平铺为500,000×65536稀疏矩阵的非常低效的步骤：

training = sp.csr_matrix(training.astype(np.float32))tindptr = np.arange(0,len(training.indices)*observations.shape[0]+1,len(training.indices),dtype=np.int32)tindices = np.tile(training.indices,observations.shape[0])tdata = np.tile(training.data,observations.shape[0])mtraining = sp.csr_matrix((tdata,tindices,tindptr),shape=observations.shape)

但是当它只存储~1500个“真实”元素时,它占用了大量的内存(大约6GB).构建起来也很慢.

我试图通过使用strIDe_tricks使CSR矩阵的indptr变得聪明,数据成员不会在重复数据上使用额外的内存.

training = sp.csr_matrix(training)mtraining = sp.csr_matrix(observations.shape,dtype=np.int32)tdata = training.datavdata = np.lib.strIDe_tricks.as_strIDed(tdata,(mtraining.shape[0],tdata.size),(0,tdata.itemsize))indices = training.indicesvindices = np.lib.strIDe_tricks.as_strIDed(indices,indices.size),indices.itemsize))mtraining.indptr = np.arange(0,len(indices)*mtraining.shape[0]+1,len(indices),dtype=np.int32)mtraining.data = vdatamtraining.indices = vindices

但是这不起作用,因为跨步视图mtraining.data和mtraining.indices是错误的形状(根据this answer,没有办法使它成为正确的形状).尝试使用.flat迭代器使它们看起来平坦失败,因为它看起来不像数组(例如它没有dtype成员),并且使用flatten()方法最终制作副本.

有没有办法完成这项工作？

解决方法你甚至没有考虑过的另一个选择是自己以稀疏格式实现总和,这样你就可以充分利用数组的周期性.如果你滥用scipy的稀疏矩阵的这种特殊行为,这可能很容易做到：

>>> a = sps.csr_matrix([1,2,3,4])>>> a.dataarray([1,4])>>> a.indicesarray([0,1,3])>>> a.indptrarray([0,4])>>> b = sps.csr_matrix((np.array([1,4,5]),...                     np.array([0,0]),5])),shape=(1,4))>>> b<1x4 sparse matrix of type '<type 'numpy.int32'>'    with 5 stored elements in Compressed Sparse Row format>>>> b.todense()matrix([[6,4]])

因此,您甚至不必在训练向量和观察矩阵的每一行之间寻找巧合来添加它们：只需用正确的指针填充所有数据,并且需要求和的东西将得到求和何时访问数据.

编辑

鉴于第一个代码的速度很慢,您可以按如下方式将内存换成速度：

def csr_add_sparse_vec(sps_mat,sps_vec) :    """Adds a sparse vector to every row of a sparse matrix"""    # No checks done,but both arguments should be sparse matrices in CSR    # format,both should have the same number of columns,and the vector    # should be a vector and have only one row.    rows,cols = sps_mat.shape    nnz_vec = len(sps_vec.data)    nnz_per_row = np.diff(sps_mat.indptr)    longest_row = np.max(nnz_per_row)    old_data = np.zeros((rows * longest_row,),dtype=sps_mat.data.dtype)    old_cols = np.zeros((rows * longest_row,dtype=sps_mat.indices.dtype)    data_IDx = np.arange(longest_row) < nnz_per_row[:,None]    data_IDx = data_IDx.reshape(-1)    old_data[data_IDx] = sps_mat.data    old_cols[data_IDx] = sps_mat.indices    old_data = old_data.reshape(rows,-1)    old_cols = old_cols.reshape(rows,-1)    new_data = np.zeros((rows,longest_row + nnz_vec,dtype=sps_mat.data.dtype)    new_data[:,:longest_row] = old_data    del old_data    new_cols = np.zeros((rows,dtype=sps_mat.indices.dtype)    new_cols[:,:longest_row] = old_cols    del old_cols    new_data[:,longest_row:] = sps_vec.data    new_cols[:,longest_row:] = sps_vec.indices    new_data = new_data.reshape(-1)    new_cols = new_cols.reshape(-1)    new_pointer = np.arange(0,(rows + 1) * (longest_row + nnz_vec),longest_row + nnz_vec)    ret = sps.csr_matrix((new_data,new_cols,new_pointer),shape=sps_mat.shape)    ret.eliminate_zeros()    return ret

它没有以前那么快,但它可以在大约1秒内完成10,000行：

In [2]: aOut[2]: <10000x65536 sparse matrix of type '<type 'numpy.float64'>'    with 15000000 stored elements in Compressed Sparse Row format>In [3]: bOut[3]: <1x65536 sparse matrix of type '<type 'numpy.float64'>'    with 1500 stored elements in Compressed Sparse Row format>In [4]: csr_add_sparse_vec(a,b)Out[4]: <10000x65536 sparse matrix of type '<type 'numpy.float64'>'    with 30000000 stored elements in Compressed Sparse Row format>In [5]: %timeit csr_add_sparse_vec(a,b)1 loops,best of 3: 956 ms per loop

编辑此代码非常非常慢

def csr_add_sparse_vec(sps_mat,cols = sps_mat.shape    new_data = sps_mat.data    new_pointer = sps_mat.indptr.copy()    new_cols = sps_mat.indices    aux_IDx = np.arange(rows + 1)    for value,col in itertools.izip(sps_vec.data,sps_vec.indices) :        new_data = np.insert(new_data,new_pointer[1:],[value] * rows)        new_cols = np.insert(new_cols,[col] * rows)        new_pointer += aux_IDx    return sps.csr_matrix((new_data,shape=sps_mat.shape)

总结

以上是内存溢出为你收集整理的python – 在numpy / scipy中为稀疏的矩阵添加一个非常重复的矩阵？全部内容，希望文章能够帮你解决python – 在numpy / scipy中为稀疏的矩阵添加一个非常重复的矩阵？所遇到的程序开发问题。

如果觉得内存溢出网站内容还不错，欢迎将内存溢出网站推荐给程序员好友。

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/langs/1205170.html

python – 在numpyscipy中为稀疏的矩阵添加一个非常重复的矩阵？

发表评论

评论列表（0条）