【GPU加速系列】PyCUDA（一）：上手简单 *** 作_python

概述PyCUDA 可以通过 Python 访问 NVIDIA 的 CUDA 并行计算 API。具体介绍和安装可以参考 PyCUDA 官网文档和 pycuda PyPI。本文涵盖的内

PyCUDA 可以通过 Python 访问 NVIDIA 的 CUDA 并行计算 API。

具体介绍和安装可以参考 PyCUDA 官网文档和 pycuda PyPI。

本文涵盖的内容有：

通过 PyCUDA 查询 GPU 信息。NumPy array 和 gpuarray 之间的相互转换。使用 gpuarray 进行基本的运算。使用 ElementwiseKernel 进行按元素的运算。使用 InclusiveScanKernel 和 ReductionKernel 的 reduce *** 作。

本文示例在 GPU 环境下，使用 Jupyter Notebook 导入了以下包：

 1 import sys 2 from time  time 3 from functools  reduce 4  5  numpy as np 6  pandas as pd 7  matplotlib 8 from matplotlib  pyplot as plt 9 from IPython.core.interactiveshell  InteractiveShell10 11  pycuda12  pycuda.autoinit13  pycuda.driver as drv14 from pycuda  gpuarray15 from pycuda.elementwise  ElementwiseKernel16 from pycuda.scan  InclusiveScanKernel17 from pycuda.reduction  ReductionKernel18 19 InteractiveShell.ast_node_interactivity = "all"20 print(f'The version of PyCUDA: {pycuda.VERSION}')21 The version of Python: {sys.version}')

输出：

The version of PyCUDA: (2019,1,2)The version of Python: 3.6.6 |Anaconda,Inc.| (default,Oct  9 2018,12:34:16) [GCC 7.3.0]

查询 GPU 信息

GPU 查询是一个非常基本的 *** 作，比较常用的重要信息有 GPU 设备名、GPU 显存、核心数量等。

定义函数：

def query_device():    drv.init() 3     print(CUDA device query (PyCUDA version) \n 4     Detected {drv.Device.count()} CUDA Capable device(s) \n 5     for i in range(drv.Device.count()): 6  7         gpu_device = drv.Device(i) 8         Device {i}: {gpu_device.name()} 9         compute_capability = float( %d.%d' % gpu_device.compute_capability() )10         \t Compute Capability: {compute_capability}11         \t Total Memory: {gpu_device.total_memory()//(1024**2)} megabytes12 13         # The following will give us all remaining device attributes as seen 14          in the original devicequery.15          We set up a dictionary as such so that we can easily index16          the values using a string descriptor.17 18         device_attributes_tuples = gpu_device.get_attributes().items() 19         device_attributes = {}20 21         for k,v  device_attributes_tuples:22             device_attributes[str(k)] = v23 24         num_mp = device_attributes[MulTIPROCESSOR_COUNT]25 26          Cores per multiprocessor is not reported by the GPU!  27          We must use a lookup table based on compute capability.28          See the following:29          http://docs.nvIDia.com/cuda/cuda-c-programming-guIDe/index.HTML#compute-capabilitIEs30 31         cuda_cores_per_mp = { 5.0 : 128,5.1 : 128,5.2 : 128,6.0 : 64,6.1 : 128,6.2 : 128}[compute_capability]32 33         \t ({num_mp}) Multiprocessors,({cuda_cores_per_mp}) CUDA Cores / Multiprocessor: {num_mP*cuda_cores_per_mp} CUDA Cores34 35         device_attributes.pop(36 37         for k  device_attributes.keys():38             \t {k}: {device_attributes[k]}')

执行 GPU 查询 *** 作：

CUDA device query (PyCUDA version) Detected 1 CUDA Capable device(s) Device 0: Tesla P100-PCIE-16GB	 Compute Capability: 6.0	 Total Memory: 16280 megabytes	 (56) Multiprocessors,(64) CUDA Cores / Multiprocessor: 3584 CUDA Cores	 ASYNC_ENGINE_COUNT: 2	 CAN_MAP_HOST_MEMORY: 1	 CLOCK_RATE: 1328500	 COmpuTE_CAPABIliTY_MAJOR: 6	 COmpuTE_CAPABIliTY_MInor: 0	 COmpuTE_MODE: DEFAulT	 CONCURRENT_KERNELS: 1	 ECC_ENABLED: 1	 GLOBAL_L1_CACHE_SUPPORTED: 1	 GLOBAL_MEMORY_BUS_WIDTH: 4096	 GPU_OVERLAP: 1	 INTEGRATED: 0	 KERNEL_EXEC_TIMEOUT: 0	 L2_CACHE_SIZE: 4194304	 LOCAL_L1_CACHE_SUPPORTED: 1	 MANAGED_MEMORY: 1	 MAXIMUM_SURFACE1D_layered_LAYERS: 2048	 MAXIMUM_SURFACE1D_layered_WIDTH: 32768	 MAXIMUM_SURFACE1D_WIDTH: 32768	 MAXIMUM_SURFACE2D_HEIGHT: 65536	 MAXIMUM_SURFACE2D_layered_HEIGHT: 32768	 MAXIMUM_SURFACE2D_layered_LAYERS: 2048	 MAXIMUM_SURFACE2D_layered_WIDTH: 32768	 MAXIMUM_SURFACE2D_WIDTH: 131072	 MAXIMUM_SURFACE3D_DEPTH: 16384	 MAXIMUM_SURFACE3D_HEIGHT: 16384	 MAXIMUM_SURFACE3D_WIDTH: 16384	 MAXIMUM_SURFACECUBEMAP_layered_LAYERS: 2046	 MAXIMUM_SURFACECUBEMAP_layered_WIDTH: 32768	 MAXIMUM_SURFACECUBEMAP_WIDTH: 32768	 MAXIMUM_TEXTURE1D_layered_LAYERS: 2048	 MAXIMUM_TEXTURE1D_layered_WIDTH: 32768	 MAXIMUM_TEXTURE1D_liNEAR_WIDTH: 134217728	 MAXIMUM_TEXTURE1D_MIPMAPPED_WIDTH: 16384	 MAXIMUM_TEXTURE1D_WIDTH: 131072	 MAXIMUM_TEXTURE2D_ARRAY_HEIGHT: 32768	 MAXIMUM_TEXTURE2D_ARRAY_NUMSliCES: 2048	 MAXIMUM_TEXTURE2D_ARRAY_WIDTH: 32768	 MAXIMUM_TEXTURE2D_GATHER_HEIGHT: 32768	 MAXIMUM_TEXTURE2D_GATHER_WIDTH: 32768	 MAXIMUM_TEXTURE2D_HEIGHT: 65536	 MAXIMUM_TEXTURE2D_liNEAR_HEIGHT: 65000	 MAXIMUM_TEXTURE2D_liNEAR_PITCH: 2097120	 MAXIMUM_TEXTURE2D_liNEAR_WIDTH: 131072	 MAXIMUM_TEXTURE2D_MIPMAPPED_HEIGHT: 32768	 MAXIMUM_TEXTURE2D_MIPMAPPED_WIDTH: 32768	 MAXIMUM_TEXTURE2D_WIDTH: 131072	 MAXIMUM_TEXTURE3D_DEPTH: 16384	 MAXIMUM_TEXTURE3D_DEPTH_ALTERNATE: 32768	 MAXIMUM_TEXTURE3D_HEIGHT: 16384	 MAXIMUM_TEXTURE3D_HEIGHT_ALTERNATE: 8192	 MAXIMUM_TEXTURE3D_WIDTH: 16384	 MAXIMUM_TEXTURE3D_WIDTH_ALTERNATE: 8192	 MAXIMUM_TEXTURECUBEMAP_layered_LAYERS: 2046	 MAXIMUM_TEXTURECUBEMAP_layered_WIDTH: 32768	 MAXIMUM_TEXTURECUBEMAP_WIDTH: 32768	 MAX_BLOCK_DIM_X: 1024	 MAX_BLOCK_DIM_Y: 1024	 MAX_BLOCK_DIM_Z: 64	 MAX_GRID_DIM_X: 2147483647	 MAX_GRID_DIM_Y: 65535	 MAX_GRID_DIM_Z: 65535	 MAX_PITCH: 2147483647	 MAX_REGISTERS_PER_BLOCK: 65536	 MAX_REGISTERS_PER_MulTIPROCESSOR: 65536	 MAX_SHARED_MEMORY_PER_BLOCK: 49152	 MAX_SHARED_MEMORY_PER_MulTIPROCESSOR: 65536	 MAX_THREADS_PER_BLOCK: 1024	 MAX_THREADS_PER_MulTIPROCESSOR: 2048	 MEMORY_CLOCK_RATE: 715000	 MulTI_GPU_BOARD: 0	 MulTI_GPU_BOARD_GROUP_ID: 0	 PCI_BUS_ID: 0	 PCI_DEVICE_ID: 4	 PCI_DOMAIN_ID: 0	 STREAM_PRIORITIES_SUPPORTED: 1	 SURFACE_AlignmENT: 512	 TCC_DRIVER: 0	 TEXTURE_AlignmENT: 512	 TEXTURE_PITCH_AlignmENT: 32	 TOTAL_CONSTANT_MEMORY: 65536	 UNIFIED_ADDRESSING: 1	 WARP_SIZE: 32

在这里，我们发现了有一个 GPU 设备 Tesla P100-PCIE-16GB，其显存为 16G，核心数目为 3584 个。

NumPy array 和 gpuarray 之间的相互转换

GPU 有自己的显存，这区别于主机上的内存，这又称为设备内存（device memory）。

NumPy array 运行在 cpu 环境（主机端），而 gpuarray 运行在 GPU 环境（设备端），两者常常需要相互转换，即 cpu 数据和 GPU 数据之间的传输转换。

1 host_data = np.array([1,2,3,4,5],dtype=np.float32)2 device_data = gpuarray.to_gpu(host_data)3 device_data_x2 = 2 * device_data4 host_data_x2 = device_data_x2.get()5 print(host_data_x2)

其输出：

[ 2.  4.  6.  8. 10.]

进行转换的时候应该尽可能通过 dtype 指定类型，以避免不必要的性能损失。

gpuarray 的基本运算

按元素运算是天生的可并行计算的 *** 作类型，在进行这种运算时 gpuarray 会自动利用多核进行并行计算。

 1 x_host = np.array([1,3],1)"> 2 y_host = np.array([1,1],1)"> 3 z_host = np.array([2,2],1)"> 4 x_device = gpuarray.to_gpu(x_host) 5 y_device = gpuarray.to_gpu(y_host) 6 z_device = gpuarray.to_gpu(z_host) 7  8 x_host + y_host 9 (x_device + y_device).get()11 x_host ** z_host12 (x_device ** z_device).get()13 14 x_host / x_host15 (x_device / x_device).get()16 17 z_host -18 (z_device -19 20 z_host / 221 (z_device / 2).get()22 23 x_host - 124 (x_device - 1).get()

输出：

array([2.,3.,4.],dtype=float32)array([2.,dtype=float32)array([1.,4.,9.],1.,1.],dtype=float32)array([ 1.,0.,-1.],dtype=float32)array([0.,2.],dtype=float32)

性能比较

 simple_speed_test(): 2     host_data = np.float32(np.random.random(50000000)) 3  4     t1 = time() 5     host_data_2x =  host_data * np.float32(2 6     t2 = 8     total time to compute on cpu: {t2 - t1} 9 10     device_data =11 12     t1 =13     device_data_2x =  device_data * np.float32(214     t2 =15 16     from_device = device_data_2x.get()18     total time to compute on GPU: {t2 - t1}19     Is the host computation the same as the GPU computation? : {np.allclose(from_device,host_data_2x)}20     21 simple_speed_test()

如果是第一次执行会输出类似：

total time to compute on cpu: 0.14141535758972168total time to compute on GPU: 2.010883092880249Is the host computation the same as the GPU computation? : True

而后面再继续执行几次，会有类似的输出：

total time to compute on cpu: 0.1373155117034912total time to compute on GPU: 0.0006959438323974609Is the host computation the same as the GPU computation? : True

这是因为在 PyCUDA 中，通常会在程序第一次运行过程中，nvcc 编译器会对 GPU 代码进行编译，然后由 PyCUDA 进行调用。这个编译时间就是额外的性能损耗。

ElementwiseKernel：按元素运算

我们先看一下 Python 的内置函数 map。

第一个参数 function 以参数序列中的每一个元素调用 function 函数，返回包含每次 function 函数返回值的迭代器（Python2 中 map 输出的是列表），我们用 List() 把迭代器转换为列表观察结果。

List(map(lambda x: x + 10,[1,5]))

输出：

[11,12,13,14,15]

ElementWiseKernel 非常类似于 map 函数。

ElementwiseKernel 函数可以自定义按元素运算的内核。使用时需要嵌入 CUDA C 的代码。

内核（kernel）在这里可以简单理解为 CUDA 直接运行在 GPU 的函数。

看代码：

 1 gpu_2x_ker = ElementwiseKernel( 2         float *in,float *out", 3         out[i] = 2 * in[i]; 4         gpu_2x_ker    ) elementwise_kernel_example(): 8     host_data = np.float32(np.random.random(50000000 9     t1 =10     host_data_2x = host_data * np.float32(211     t2 =12     14     device_data =15      allocate memory for output16     device_data_2x = gpuarray.empty_like(device_data)17     18     t1 =19     gpu_2x_ker(device_data,device_data_2x)20     t2 =21     from_device =22     23     24     25 elementwise_kernel_example()26 27 28 29 elementwise_kernel_example()

输出：

total time to compute on cpu: 0.13545799255371094total time to compute on GPU: 0.4059629440307617Is the host computation the same as the GPU computation? : Truetotal time to compute on cpu: 0.13948774337768555total time to compute on GPU: 0.0001266002655029297Is the host computation the same as the GPU computation? : Truetotal time to compute on cpu: 0.1357274055480957total time to compute on GPU: 0.0001552104949951172Is the host computation the same as the GPU computation? : Truetotal time to compute on cpu: 0.13451647758483887total time to compute on GPU: 0.0001761913299560547Is the host computation the same as the GPU computation? : Truetotal time to compute on cpu: 0.1362597942352295total time to compute on GPU: 0.00011849403381347656Is the host computation the same as the GPU computation? : True

同样我们发现在第一次运行时，出现了 nvcc 编译产生的性能损耗。

ElementwiseKernel 的参数：

class pycuda.elementwise.ElementwiseKernel(arguments, operation, name="kernel", keep=False, options=[], preamble="")

arguments：该内核定义的传参。operation：该内核定义的内嵌 CUDA C 代码。name：定义的内核名称。

gpuarray.empty_like 用于分配与 device_data 相同形状和类型的内存空间。

InclusiveScanKernel 和 ReductionKernel 的 reduce *** 作

我们先看一下 Python 标准包 functools 中的 reduce 函数。

reduce(lambda x,y : x + y,4])

输出：

与 map 函数不同，reduce 执行迭代的二元运算，只输出一个单值。

我们将使用 InclusiveScan 和 ReductionKernel 来实现类似于 reduce 的 *** 作。

InclusiveScanKernel

InclusiveScanKernel 类似于 reduce，因为它并非输出单值，输出与输入形状相同。

计算求和的 *** 作，输出是一个累加的序列：

1 seq = np.array([1,4],1)">np.int32)2 seq_gpu = gpuarray.to_gpu(seq)3 sum_gpu = InclusiveScanKernel(np.int32,a+b4 print(sum_gpu(seq_gpu).get())print(np.cumsum(seq))

输出：

[ 1  3  6 10][ 1  3  6 10]

查找最大值（最大值向后冒泡）：

3 max_gpu = InclusiveScanKernel(np.int32,1)">a > b ? a : b4 seq_max_bubble = max_gpu(seq_gpu)(seq_max_bubble)6 print(seq_max_bubble.get()[-1])7 print(np.max(seq))

输出：

[    1   100   100   100   100 10000 10000 10000 10000]1000010000

对于 a > b ? a : b ，我们可以想象是做从前往后做一个遍历（实际是并行的），而对于每个当前元素 cur，都和前一个元素做比较，把最大值赋值给 cur。

这样，最大值就好像“冒泡”一样往后移动，最终取最后一个元素即可。

ReductionKernel

实际上，ReductionKernel 就像是执行 ElementWiseKernel 后再执行一个并行扫描内核。

一个计算两向量内积的例子：

1 a_host = np.array([1,1)">2 b_host = np.array([4,5,6],1)">3 (a_host.dot(b_host))4 5 dot_prod = ReductionKernel(np.float32,neutral=0",reduce_expr=6                            map_expr=x[i]*y[i]float *x,float *y7 a_device = gpuarray.to_gpu(a_host)8 b_device = gpuarray.to_gpu(b_host)9 print(dot_prod(a_device,b_device).get())

32.032.0

首先对两向量的每个元素进行 map_expr 的计算，其结果再进行 reduce_expr 的计算（neutral 表示初始值），最终得到两向量的内积。

好了，到此为止，就是初识 PyCUDA 的一些 *** 作。

原文作者：雨先生
原文链接：https://www.cnblogs.com/noluye/p/11465389.html
许可协议：知识共享署名-非商业性使用 4.0 国际许可协议

参考PyCUDA 官网文档pycuda PyPI《Hands-On GPU Programming with Python and CUDA》by Dr. Brian TuomanenCUDA系列学习（五）GPU基础算法: Reduce,Scan,Histogram 总结

以上是内存溢出为你收集整理的【GPU加速系列】PyCUDA（一）：上手简单 *** 作全部内容，希望文章能够帮你解决【GPU加速系列】PyCUDA（一）：上手简单 *** 作所遇到的程序开发问题。

如果觉得内存溢出网站内容还不错，欢迎将内存溢出网站推荐给程序员好友。

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/langs/1189696.html

【GPU加速系列】PyCUDA（一）：上手简单 *** 作

发表评论

评论列表（0条）