windows 11 搭建 TensorFlow2.6 GPU 开发环境【RTX 3060】_随笔

windows 11 搭建 TensorFlow2.6 GPU 开发环境【RTX 3060】

文章大纲

简介
windows 本地原生方式
- 主要步骤
- CUDA 本地安装
- cuDNN 本地安装
- 主要特性
- 环境变量配置
- anaconda 环境构建
WSL 2 docker 方式
- 简介
- 使用 wsl 的docker 进行深度学习与原生方式的对比
- 主要步骤
- - 1.安装 wsl-2 版本的windows NVIDIA驱动
  - 2. 在wsl-2 中安装 docker 及 NVIDIA 容器
参考文献

简介

CUDA® is a parallel computing platform and programming model invented by NVIDIA. It enables dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU).

CUDA was developed with several design goals in mind:
Provide a small set of extensions to standard programming languages, like C, that enable a straightforward implementation of parallel algorithms. With CUDA C/C++, programmers can focus on the task of parallelization of the algorithms rather than spending time on their implementation.
Support heterogeneous computation where applications use both the CPU and GPU. Serial portions of applications are run on the CPU, and parallel portions are offloaded to the GPU. As such, CUDA can be incrementally applied to existing applications. The CPU and GPU are treated as separate devices that have their own memory spaces. This configuration also allows simultaneous computation on the CPU and GPU without contention for memory resources.

CUDA-capable GPUs have hundreds of cores that can collectively run thousands of computing threads. These cores have shared resources including a register file and a shared memory. The on-chip shared memory allows parallel tasks running on these cores to share data without sending it over the system memory bus.

windows 本地原生方式

windows 下面安装的主要问题是包版本的匹配问题，我们不要着急，核心思想是多去官网找。

主要步骤

必须在系统中安装以下 NVIDIA® 软件：

NVIDIA® GPU 驱动程序 - CUDA® 11.2 要求 450.80.02 或更高版本。
CUDA® 工具包：TensorFlow 支持 CUDA® 11.2（TensorFlow 2.5.0 及更高版本）
CUDA® 工具包附带的 CUPTI。
cuDNN SDK 8.1.0 cuDNN 版本。
（可选）TensorRT 6.0，可缩短用某些模型进行推断的延迟时间并提高吞吐量。

CUDA 本地安装

我点的win 11 版本，比较迷惑的是这个命名方式，说明了什么？我估计说明了windows11 和windows 10内核并没有什么不同。【windows11 升级了个寂寞。。。】

CUDA Toolkit 11.5 Downloads

安装完成后：

PS C:Usersseason> nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_Sep_13_20:11:50_Pacific_Daylight_Time_2021
Cuda compilation tools, release 11.5, V11.5.50
Build cuda_11.5.r11.5/compiler.30411180_0

文档：

https://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/

cuDNN 本地安装

https://developer.nvidia.com/zh-cn/cudnn

https://docs.nvidia.com/deeplearning/cudnn/index.html

找到对应版本

https://developer.nvidia.com/zh-cn/cudnn

从官网上面可以看到：针对一些模型 cuDNN 专门做了优化，并且缩小了模型框架

cuDNN 8 的新功能

cuDNN 8 针对 A100 GPU 进行了优化，可提供高达 V100 GPU 5 倍的开箱即用性能，并且包含适用于对话式 AI 和计算机视觉等应用的新优化和 API。它已经过重新设计，可实现易用性和应用集成，同时还能为开发者提供更高的灵活性。

cuDNN 8 的亮点包括

已针对 NVIDIA A100 GPU 上的峰值性能进行调优，包括全新 TensorFloat-32、FP16 和 FP32
通过重新设计的低级别 API，可以直接访问 cuDNN 内核，从而实现更出色的控制和性能调优
向后兼容性层仍然支持 cuDNN 7.x，使开发者能够顺利过渡到新版 cuDNN 8 API
针对计算机视觉、语音和语言理解网络作出了新优化
已通过新 API 融合运算符，进而加速卷积神经网络
cuDNN 8 现以六个较小的库的形式提供，能够更精细地集成到应用中。开发者可以下载 cuDNN，也可从 NGC 上的框架容器中将其提取出来。

主要特性

为所有常用卷积实现了 Tensor Core 加速，包括 2D 卷积、3D 卷积、分组卷积、深度可分离卷积以及包含 NHWC 和 NCHW 输入及输出的扩张卷积
为诸多计算机视觉和语音模型优化了内核，包括 ResNet、ResNext、SSD、MaskRCNN、Unet、VNet、BERT、GPT-2、Tacotron2 和 WaveGlow
支持 FP32、FP16 和 TF32 浮点格式以及 INT8 和 UINT8 整数格式
4D 张量的任意维排序、跨步和子区域意味着可轻松集成到任意神经网络实现中能为任意 CNN 架构上融合的运算提速

数据中心采用 Ampere、Turing、Volta、Pascal、Maxwell 和 Kepler GPU 架构以及配备移动 GPU 的 Windows 和 Linux 支持 cuDNN。

使用cuDNN 的框架

环境变量配置 anaconda 环境构建

# 配置conda

conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
conda config --set show_channel_urls yes
conda config --show  #查看conda的配置


# 配置 pip
pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple

# 新建环境
conda create --name nlp_tf2 python=3.9

# 安装tensorflow-gpu

pip install tensorflow-gpu==2.6.2

装TensorFlow 时候推荐使用pip ，conda 的包可能不准确，所以这一步要用pip，当然我只是诱人的conda 方式没有尝试而已。

(nlp_tf2) C:Usersseason>pip install tensorflow==
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
ERROR: Could not find a version that satisfies the requirement tensorflow== (from versions:
 2.5.0rc0, 2.5.0rc1, 2.5.0rc2, 2.5.0rc3, 2.5.0, 2.5.1, 2.5.2, 2.6.0rc0, 2.6.0rc1, 2.6.0rc2, 2.6.0, 2.6.1, 2.6.2, 2.7.0rc0, 2.7.0rc1, 2.7.0)
ERROR: No matching distribution found for tensorflow==

(nlp_tf2) C:Usersseason>pip install tensorflow-gpu==
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
ERROR: Could not find a version that satisfies the requirement tensorflow-gpu== (from versions: 
2.5.0, 2.5.1, 2.5.2, 2.6.0, 2.6.1, 2.6.2, 2.7.0rc0, 2.7.0rc1, 2.7.0)
ERROR: No matching distribution found for tensorflow-gpu==

https://tensorflow.google.cn/install/pip#windows_1
从官网这个下面的来看，python 3.9 应该安装 2.6 版本的 TensorFlow

cmd 命令行设置环境变量，这种方式要求以后的程序跑之前都把这些加上，好处是可以使用多版本的cuda，不干扰我们的环境变量。

SET PATH=C:Program FilesNVIDIA GPU Computing ToolkitCUDAv11.5bin;%PATH%
SET PATH=C:Program FilesNVIDIA GPU Computing ToolkitCUDAv11.5extrasCUPTIlib64;%PATH%
SET PATH=C:Program FilesNVIDIA GPU Computing ToolkitCUDAv11.5include;%PATH%
SET PATH=C:cuDNNcudnn-11.5-windows-x64-v8.3.0.98cudabin;%PATH%

验证安装效果

(nlp_tf2) C:Usersseason>python
Python 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.


>>> import tensorflow as tf
>>> tf.reduce_sum(tf.random.normal([1000, 1000]))

2021-11-23 01:18:34.892308:

 I tensorflow/core/platform/cpu_feature_guard.cc:142] 

This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

2021-11-23 01:18:35.377735: 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510]

 Created device /job:localhost/replica:0/task:0/device:GPU:0 with 3495 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3060 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6




>>> version = tf.__version__
>>> gpu_ok = tf.test.is_gpu_available()

WARNING:tensorflow:From :1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
2021-11-24 23:56:25.051249: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /device:GPU:0 with 3272 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3060 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6
>>> print("tf version:",version,"nuse GPU",gpu_ok)
tf version: 2.6.2
use GPU True

WSL 2 docker 方式

目前我看官网主要推荐docker 方式了，那我们就用docker 方式试试。而且网上的安装教程也是docker 的居多【官方给出了一个教程】，我们也要与时俱进。

下面是我机器wsl kernel的版本：

season@season:~$ uname -r
5.10.16.3-microsoft-standard-WSL2

简介

官方文档：

https://docs.nvidia.com/cuda/wsl-user-guide/index.html

使用 wsl 的docker 进行深度学习与原生方式的对比

PyTorch MNIST 测试，这是一个有目的的小型玩具机器学习示例，它强调了保持 GPU 忙碌以达到满意的 WSL2性能的重要性。与原生 Linux 一样，工作负载越小，就越有可能由于启动 GPU 进程的开销而导致性能下降。这种退化在 WSL2上更为明显，并且与原生 Linux 的规模不同。

从图中可以看出如果batch size小的话，很多时间会消耗在CUDA调用上，batch size=8的时候，时间消耗会是native CUDA的138%。如果提高batch size，让CUDA充分忙碌，性能可以接近native！

https://developer.nvidia.com/blog/leveling-up-cuda-performance-on-wsl2-with-new-enhancements/

cuda 驱动 on wsl
https://developer.nvidia.com/cuda/wsl/download

主要步骤 1.安装 wsl-2 版本的windows NVIDIA驱动

不知道为啥，这个没有特别说明和wsl 有啥关系。。。我已经有驱动了，这个不知道装了个啥

特别注意，在wsl-2 中安装 cuda toolkit 要使用如下脚本：
红框处是单独的选项

wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin
sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.5.1/local_installers/cuda-repo-wsl-ubuntu-11-5-local_11.5.1-1_amd64.deb
sudo dpkg -i cuda-repo-wsl-ubuntu-11-5-local_11.5.1-1_amd64.deb
sudo apt-key add /var/cuda-repo-wsl-ubuntu-11-5-local/7fa2af80.pub
sudo apt-get update
sudo apt-get -y install cuda

安装好以后进行测试：Black-Scholes模型，简称B-S模型，是一种对金融产品估价的数学模型。

cd /usr/local/cuda-11.5/samples/4_Finance/BlackScholes

season@season:/usr/local/cuda-11.5/samples/4_Finance/BlackScholes$ ll
total 60
drwxr-xr-x  4 root root  4096 Nov 24 23:07 ./
drwxr-xr-x 10 root root  4096 Nov 24 23:07 ../
drwxr-xr-x  2 root root  4096 Nov 24 23:07 .vscode/
-rw-r--r--  1 root root  8382 Sep 21 01:38 BlackScholes.cu
-rw-r--r--  1 root root  2787 Sep 21 01:38 BlackScholes_gold.cpp
-rw-r--r--  1 root root  3646 Sep 21 01:38 BlackScholes_kernel.cuh
-rw-r--r--  1 root root 13454 Sep 21 01:38 Makefile
-rw-r--r--  1 root root  1859 Sep 21 01:38 NsightEclipse.xml
drwxr-xr-x  2 root root  4096 Nov 24 23:07 doc/
-rw-r--r--  1 root root   189 Sep 21 01:38 readme.txt
season@season:/usr/local/cuda-11.5/samples/4_Finance/BlackScholes$ sudo make BlackScholes
>>> GCC Version is greater or equal to 5.1.0 <<<
/usr/local/cuda-11.5/bin/nvcc -ccbin g++ -I../../common/inc  -m64    -maxrregcount=16 --threads 0 --std=c++11 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_86,code=compute_86 -o BlackScholes.o -c BlackScholes.cu
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
ptxas warning : For profile sm_86 adjusting per thread register count of 16 to lower bound of 24
ptxas warning : For profile sm_75 adjusting per thread register count of 16 to lower bound of 24
ptxas warning : For profile sm_70 adjusting per thread register count of 16 to lower bound of 24
ptxas warning : For profile sm_80 adjusting per thread register count of 16 to lower bound of 24
/usr/local/cuda-11.5/bin/nvcc -ccbin g++ -I../../common/inc  -m64    -maxrregcount=16 --threads 0 --std=c++11 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_86,code=compute_86 -o BlackScholes_gold.o -c BlackScholes_gold.cpp
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
/usr/local/cuda-11.5/bin/nvcc -ccbin g++   -m64      -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_86,code=compute_86 -o BlackScholes BlackScholes.o BlackScholes_gold.o
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
mkdir -p ../../bin/x86_64/linux/release
cp BlackScholes ../../bin/x86_64/linux/release
season@season:/usr/local/cuda-11.5/samples/4_Finance/BlackScholes$ ./BlackScholes
[./BlackScholes] - Starting...
GPU Device 0: "Ampere" with compute capability 8.6

Initializing data...
...allocating CPU memory for options.
...allocating GPU memory for options.
...generating input data in CPU mem.
...copying input data to GPU mem.
Data init done.

Executing Black-Scholes GPU kernel (512 iterations)...
Options count             : 8000000
BlackScholesGPU() time    : 0.261508 msec
Effective memory bandwidth: 305.918207 GB/s
Gigaoptions per second    : 30.591821

BlackScholes, Throughput = 30.5918 GOptions/s, Time = 0.00026 s, Size = 8000000 options, NumDevsUsed = 1, Workgroup = 128

Reading back GPU results...
Checking the results...
...running CPU calculations.

Comparing the results...
L1 norm: 1.741792E-07
Max absolute error: 1.192093E-05

Shutting down...
...releasing GPU memory.
...releasing CPU memory.
Shutdown done.

[BlackScholes] - Test Summary

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Test passed

2. 在wsl-2 中安装 docker 及 NVIDIA 容器

安装标准docker

curl https://get.docker.com | sh

之后输出：

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 18617  100 18617    0     0  25294      0 --:--:-- --:--:-- --:--:-- 25294
# Executing docker install script, commit: 93d2499759296ac1f9c510605fef85052a2c32be

WSL DETECTED: We recommend using Docker Desktop for Windows.
Please get Docker Desktop from https://www.docker.com/products/docker-desktop


You may press Ctrl+C now to abort this script.
+ sleep 20
+ sudo -E sh -c apt-get update -qq >/dev/null
+ sudo -E sh -c DEBIAN_FRONTEND=noninteractive apt-get install -y -qq apt-transport-https ca-certificates curl >/dev/null
+ sudo -E sh -c curl -fsSL "https://download.docker.com/linux/ubuntu/gpg" | gpg --dearmor --yes -o /usr/share/keyrings/docker-archive-keyring.gpg
+ sudo -E sh -c echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu focal stable" > /etc/apt/sources.list.d/docker.list
+ sudo -E sh -c apt-get update -qq >/dev/null
+ sudo -E sh -c DEBIAN_FRONTEND=noninteractive apt-get install -y -qq --no-install-recommends  docker-ce-cli docker-scan-plugin docker-ce >/dev/null
+ version_gte 20.10
+ [ -z  ]
+ return 0
+ sudo -E sh -c DEBIAN_FRONTEND=noninteractive apt-get install -y -qq docker-ce-rootless-extras >/dev/null

================================================================================

To run Docker as a non-privileged user, consider setting up the
Docker daemon in rootless mode for your user:

    dockerd-rootless-setuptool.sh install

Visit https://docs.docker.com/go/rootless/ to learn about rootless mode.


To run the Docker daemon as a fully privileged service, but granting non-root
users access, refer to https://docs.docker.com/go/daemon-access/

WARNING: Access to the remote API on a privileged Docker daemon is equivalent
         to root access on the host. Refer to the 'Docker daemon attack surface'
         documentation for details: https://docs.docker.com/go/attack-surface/

================================================================================

装完之后发现，这个突然出现的玩意也是挺占用内存的

安装 NVIDIA Container 以及 Toolkit

$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)

$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -

$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

$ curl -s -L https://nvidia.github.io/libnvidia-container/experimental/$distribution/libnvidia-container-experimental.list | sudo tee /etc/apt/sources.list.d/libnvidia-container-experimental.list

$ sudo apt-get update

$ sudo apt-get install -y nvidia-docker2

测试1，simple container

season@season:~$ sudo service docker stop
[sudo] password for season:
 * Docker already stopped - file /var/run/docker-ssd.pid not found.
season@season:~$ sudo service docker start
 * Starting Docker: docker                                                                                       [ OK ]

season@season:~$ sudo docker run --gpus all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
Unable to find image 'nvcr.io/nvidia/k8s/cuda-sample:nbody' locally
nbody: Pulling from nvidia/k8s/cuda-sample
d519e2592276: Pull complete
d22d2dfcfa9c: Pull complete
b3afe92c540b: Pull complete
b25f8d7adb24: Pull complete
ddb025f124b9: Pull complete
fe72fda9c19e: Pull complete
c6a265e4ffa3: Pull complete
c931a9542ebf: Pull complete
f7eb321dd245: Pull complete
d67fd954fbd5: Pull complete
Digest: sha256:a2117f5b8eb3012076448968fd1790c6b63975c6b094a8bd51411dee0c08440d
Status: Downloaded newer image for nvcr.io/nvidia/k8s/cuda-sample:nbody
Run "nbody -benchmark [-numbodies=]" to measure performance.
        -fullscreen       (run n-body simulation in fullscreen mode)
        -fp64             (use double precision floating point values for simulation)
        -hostmem          (stores simulation data in host memory)
        -benchmark        (run benchmark to measure performance)
        -numbodies=    (number of bodies (>= 1) to run in simulation)
        -device=       (where d=0,1,2.... for the CUDA device to use)
        -numdevices=   (where i=(number of CUDA devices > 0) to use for simulation)
        -compare          (compares simulation results running once on the default GPU and once on the CPU)
        -cpu              (run n-body simulation on the CPU)
        -tipsy= (load a tipsy model file for simulation)

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Ampere" with compute capability 8.6

> Compute 8.6 CUDA device: [NVIDIA GeForce RTX 3060 Laptop GPU]
30720 bodies, total time for 10 iterations: 25.952 ms
= 363.636 billion interactions per second
= 7272.727 single-precision GFLOP/s at 20 flops per interaction
season@season:~$

测试2：Jupyter Notebooks

参考文献
安装WSL2，官方文档说的比较清楚了

https://docs.microsoft.com/zh-cn/windows/wsl/install

5步搭建wsl2+cuda+docker解决windows深度学习开发问题

https://zhuanlan.zhihu.com/p/408403790

Windows+WSL2+CUDA+Docker

https://blog.csdn.net/fleaxin/article/details/108911522

tensor flow 官方gpu 支持文档

https://tensorflow.google.cn/install/gpu

cuda 官方指导

https://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/

欢迎分享，转载请注明来源：内存溢出
原文地址: http://outofmemory.cn/zaji/5592939.html

windows 11 搭建 TensorFlow2.6 GPU 开发环境【RTX 3060】

发表评论

评论列表（0条）