由于这本书内容实在是多,很多内容和其他讲解cuda的书又重复了,所以我只翻译一些重点,时间就是金钱嘛,一起来学cuda吧。如有错误,欢迎纠正
由于第一章第二章暂时没时间仔细看,我们从第三章开始
不喜欢受制于人,所以不用它的头文件,所有程序我都会改写,有些程序实在是太无聊,就算了。
//hello.cu
#include<stdio.h>
#include<cuda.h>
int main( void ) {
printf( "Hello, World!\n" );
return 0;
}
这第一个cuda程序并不能算是严格的cuda程序,它只不过用到了cuda的头文件,编译命令: nvcc hello.cu -o hello
执行命令:./hello
并没有在cuda上面执行任何任务。
第二个程序
#include<stdio.h>
#include<cuda.h>
__global__ void kernel(void){}
int main( void ) {
kernel<<<1,1>>>();
printf( "Hello, World!\n" );
return 0;
}
这个程序调用了一个函数,__global__的含义是该函数在CPU上调用,GPU上执行。
至于三个尖括号里面的参数是什么呢? 要看下一章
1 #include <stdio.h>
2 #include <cuda.h>
3 __global__ void add( int a, int b, int *c ) {
4 *c = a + b;
5 }
6 int main( void )
7 {
8 int c;
9 int *dev_c;
10 cudaMalloc( (void**)&dev_c, sizeof(int) );
11 add<<<1,1>>>( 2, 7, dev_c );
12 cudaMemcpy( &c,dev_c,sizeof(int),cudaMemcpyDeviceToHost );
13 printf( "2 + 7 = %d\n", c );
14 cudaFree( dev_c );
15 return 0;
16 }
17
cudaMalloc()分配GPU上的存储空间,cudaMemcpy是把运行结果从GPU上拷贝到CPU上cudaMemcpyDeviceToHost,或者把执行参数从CPU上拷贝到GPU上cudaMemcpyHostToDevice。
cudaFree是释放GPU上的空间,和CPU上的Free是同样的意义,只不过对象不同。
这一章的重点(对我来说)是3.3 访问GPU(device)
这章呢,是说,如果你没有你所用的GPU的说明书,或者你懒得拆解下来看,或者,为了让你的程序可以适用于更多不同的硬件环境,尝试用编程的方式来得到关于GPU的某些参数。
大量的废话大家自己看吧。俺讲写有意义的。
现在很多电脑里面都不只有一个GPU显卡,尤其是显卡做计算的集成环境,所以我们可以通过
int count;
cudaGetDeviceCount(&count);
来获得集成环境的显卡数量。
然后通过cudaDeviceProp这个结构提可以获得显卡的相关性能。
下面是以cuda3.0为例子.
定义的这个机构体在自己的程序中可以直接调用,无需自己定义。
struct cudaDeviceProp {
char name[256]; //器件的名字
size_t totalGlobalMem; //Global Memory 的byte大小
size_t sharedMemPerBlock; //线程块可以使用的共用记忆体的最大值。byte为单位,多处理器上的所有线程块可以同时共用这些记忆体
int regsPerBlock; //线程块可以使用的32位寄存器的最大值,多处理器上的所有线程快可以同时实用这些寄存器
int warpSize; //按线程计算的wrap块大小
size_t memPitch; //做内存复制是可以容许的最大间距,允许通过cudaMallocPitch()为包含记忆体区域的记忆提复制函数的最大间距,以byte为单位。
int maxThreadsPerBlock; //每个块中最大线程数
int maxThreadsDim[3]; //块各维度的最大值
int maxGridSize[3]; //Grid各维度的最大值
size_t totalConstMem; //常量内存的大小
int major; //计算能力的主代号
int minor; //计算能力的次要代号
int clockRate; //时钟频率
size_t textureAlignment; //纹理的对齐要求
int deviceOverlap; //器件是否能同时执行cudaMemcpy()和器件的核心代码
int multiProcessorCount; //设备上多处理器的数量
int kernelExecTimeoutEnabled; //是否可以给核心代码的执行时间设置限制
int integrated; //这个GPU是否是集成的
int canMapHostMemory; //这个GPU是否可以讲主CPU上的存储映射到GPU器件的地址空间
int computeMode; //计算模式
int maxTexture1D; //一维Textures的最大维度
int maxTexture2D[2]; //二维Textures的最大维度
int maxTexture3D[3]; //三维Textures的最大维度
int maxTexture2DArray[3]; //二维Textures阵列的最大维度
int concurrentKernels; //GPU是否支持同时执行多个核心程序
}
实例程序:
1 #include<stdio.h>
2 #include<stdlib.h>
3 #include<cuda.h>
4
5 int main()
6 {
7 int i;
8 /*cudaGetDeviceCount(&count)*/
9 int count;
10 cudaGetDeviceCount(&count);
11 printf("The count of CUDA devices:%d\n",count);
12 ////
13
14 cudaDeviceProp prop;
15 for(i=0;i<count;i++)
16 {
17 cudaGetDeviceProperties(&prop,i);
18 printf("\n---General Information for device %d---\n",i);
19 printf("Name of the cuda device: %s\n",prop.name);
20 printf("Compute capability: %d.%d\n",prop.major,prop.minor);
21 printf("Clock rate: %d\n",prop.clockRate);
22 printf("Device copy overlap(simulataneously perform a cudaMemcpy() and kernel execution): ");
23 if(prop.deviceOverlap)
24 printf("Enabled\n");
25 else
26 printf("Disabled\n");
27 printf("Kernel execution timeout(whether there is a runtime limit for kernels executed on this device): ");
28 if(prop.kernelExecTimeoutEnabled)
29 printf("Enabled\n");
30 else
31 printf("Disabled\n");
32
33 printf("\n---Memory Information for device %d ---\n",i);
34 printf("Total global mem in bytes: %ld\n",prop.totalGlobalMem);
35 printf("Total constant Mem: %ld\n",prop.totalConstMem);
36 printf("Max mem pitch for memory copies in bytes: %ld\n",prop.memPitch);
37 printf("Texture Alignment: %ld\n",prop.textureAlignment);
38
39 printf("\n---MP Information for device %d---\n",i);
40 printf("Multiprocessor count: %d\n",prop.multiProcessorCount);
41 printf("Shared mem per mp(block): %ld\n",prop.sharedMemPerBlock);
42 printf("Registers per mp(block):%d\n",prop.regsPerBlock);
43 printf("Threads in warp:%d\n",prop.warpSize);
44 printf("Max threads per block: %d\n",prop.maxThreadsPerBlock);
45 printf("Max thread dimensions in a block:(%d,%d,%d)\n",prop.maxThreadsDim[0],prop.maxThreadsDim[1],prop.maxThreadsDim[2]);
46 printf("Max blocks dimensions in a grid:(%d,%d,%d)\n",prop.maxGridSize[0],prop.maxGridSize[1],prop.maxGridSize[2]);
47 printf("\n");
48
49 printf("\nIs the device an integrated GPU:");
50 if(prop.integrated)
51 printf("Yes!\n");
52 else
53 printf("No!\n");
54
55 printf("Whether the device can map host memory into CUDA device address space:");
56 if(prop.canMapHostMemory)
57 printf("Yes!\n");
58 else
59 printf("No!\n");
60
61 printf("Device's computing mode:%d\n",prop.computeMode);
62
63 printf("\n The maximum size for 1D textures:%d\n",prop.maxTexture1D);
64 printf("The maximum dimensions for 2D textures:(%d,%d)\n",prop.maxTexture2D[0],prop.maxTexture2D[1]);
65 printf("The maximum dimensions for 3D textures:(%d,%d,%d)\n",prop.maxTexture3D[0],prop.maxTexture3D[1],prop.maxTexture3D[2]);
66 // printf("The maximum dimensions for 2D texture arrays:(%d,%d,%d)\n",prop.maxTexture2DArray[0],prop.maxTexture2DArray[1],prop.maxTexture2DArray[2]);
67
68 printf("Whether the device supports executing multiple kernels within the same context simultaneously:");
69 if(prop.concurrentKernels)
70 printf("Yes!\n");
71 else
72 printf("No!\n");
73 }
74
75 }
运行结果:
The count of CUDA devices:1
---General Information for device 0---
Name of the cuda device: GeForce GTX 470
Compute capability: 2.0
Clock rate: 1215000
Device copy overlap(simulataneously perform a cudaMemcpy() and kernel execution): Enabled
Kernel execution timeout(whether there is a runtime limit for kernels executed on this device): Enabled
---Memory Information for device 0 ---
Total global mem in bytes: 1341325312
Total constant Mem: 65536
Max mem pitch for memory copies in bytes: 2147483647
Texture Alignment: 512
---MP Information for device 0---
Multiprocessor count: 14
Shared mem per mp(block): 49152
Registers per mp(block):32768
Threads in warp:32
Max threads per block: 1024
Max thread dimensions in a block:(1024,1024,64)
Max blocks dimensions in a grid:(65535,65535,65535)
Is the device an integrated GPU:No!
Whether the device can map host memory into CUDA device address space:Yes!
Device's computing mode:0
The maximum size for 1D textures:65536
The maximum dimensions for 2D textures:(65536,65535)
The maximum dimensions for 3D textures:(2048,2048,2048)
Whether the device supports executing multiple kernels within the same context simultaneously:Yes!
yue@ubuntu-10:~/cuda/cudabye$ vim cudabyex331.cu
yue@ubuntu-10:~/cuda/cudabye$ vim cudabyex331.cu
yue@ubuntu-10:~/cuda/cudabye$ ./cuda
-bash: ./cuda: 沒有此一檔案或目錄
yue@ubuntu-10:~/cuda/cudabye$ ./cudabyex331
The count of CUDA devices:1
---General Information for device 0---
Name of the cuda device: GeForce GTX 470
Compute capability: 2.0
Clock rate: 1215000
Device copy overlap(simulataneously perform a cudaMemcpy() and kernel execution): Enabled
Kernel execution timeout(whether there is a runtime limit for kernels executed on this device): Enabled
---Memory Information for device 0 ---
Total global mem in bytes: 1341325312
Total constant Mem: 65536
Max mem pitch for memory copies in bytes: 2147483647
Texture Alignment: 512
---MP Information for device 0---
Multiprocessor count: 14
Shared mem per mp(block): 49152
Registers per mp(block):32768
Threads in warp:32
Max threads per block: 1024
Max thread dimensions in a block:(1024,1024,64)
Max blocks dimensions in a grid:(65535,65535,65535)
Is the device an integrated GPU:No!
Whether the device can map host memory into CUDA device address space:Yes!
Device's computing mode:0
The maximum size for 1D textures:65536
The maximum dimensions for 2D textures:(65536,65535)
The maximum dimensions for 3D textures:(2048,2048,2048)
Whether the device supports executing multiple kernels within the same context simultaneously:Yes!
参考书籍:《CUDA BY EXAMPLE》
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)