例如:新电脑到了,配置是3050ti 4G 其他参数都是比较平均的,安装好基础软件之后,上深度学习。
问题描述
在挣扎后回忆起怎么安装显卡驱动了,看到配置版本tf2.0配的cuda是10.0,cudnn是7.4,安装成功后的显示情况:(可以看到cuda是10.0,我显卡驱动限制版本是11.4,原则上是不高于显卡限制)
nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:04_Central_Daylight_Time_2018
Cuda compilation tools, release 10.0, V10.0.130
nvidia-smi
Sat Apr 16 13:06:29 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 472.47 Driver Version: 472.47 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... WDDM | 00000000:01:00.0 Off | N/A |
| N/A 41C P8 6W / N/A | 107MiB / 4096MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 12672 C+G ...mmandCenterBackground.exe N/A |
+-----------------------------------------------------------------------------+
在安装后,pycharm中配置tensorflow-gpu2.0后出现错误,提示显卡驱动不兼容
原因分析:
因为安装的是cuda10,可能是版本太低了。
解决方案:
我又开始卸载cuda,安装cuda11.1版本并配置cudnn7.6.5后测试成功,但是跑模型的时候又出现了无法调用显存,虽然在过程中百度瞟到过30系列不能配置cuda11以下版本,当时因为执意要跑tf2.0,所以没想那么多。
问题描述在挣扎后开始降低tf版本,tensorflow1.14,开始是美好的,后面运行demo也还行,复现口罩识别demo时,找到一个开源标签和开源代码,就开始了去复现,克服了数据集合代码兼容性,一步步走向成功时,训练的时候就给我说用不了gpu:
Epoch 1/50
2022-04-16 13:17:51.918104: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] shape_optimizer failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)
2022-04-16 13:17:51.959926: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] remapper failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)
2022-04-16 13:17:52.102568: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] layout failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)
2022-04-16 13:17:52.317837: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] shape_optimizer failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)
2022-04-16 13:17:52.351374: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] remapper failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)
2022-04-16 13:17:54.651630: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 831.81MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2022-04-16 13:17:54.651962: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 831.81MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2022-04-16 13:17:54.708460: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 760.50MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2022-04-16 13:17:54.708782: W tensorflow/core/common_runtime/bfc_allocator.cc:237] Allocator (GPU_0_bfc) ran out of memory trying to allocate 760.50MiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2022-04-16 13:17:55.002098: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
File "E:/depthLearning1.14/train.py", line 198, in <module>
_main()
File "E:/depthLearning1.14/train.py", line 73, in _main
callbacks=[logging, checkpoint])
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Blas SGEMM launch failed : m=692224, n=32, k=64
[[{{node conv2d_3/convolution}}]]
[[loss/add_74/_1451]]
(1) Internal: Blas SGEMM launch failed : m=692224, n=32, k=64
[[{{node conv2d_3/convolution}}]]
0 successful operations.
0 derived errors ignored.
Process finished with exit code 1
我寻思要不我用cpu:
os.environ["CUDA_VISIBLE_DEVICES"] = '1' #use GPU with ID=0
# add to the top of your code under import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.per_process_gpu_memory_fraction = 0.5 # maximun alloc gpu50% of MEM
# config.gpu_options.allow_growth = True #allocate dynamically
喵的,cpu启动就直接训练了
Create YOLOv3 model with 9 anchors and 2 classes.
Load weights model_data/yolo_weights.h5.
Freeze the first 249 layers of total 252 layers.
WARNING:tensorflow:From E:\depthLearning1.14\venv\lib\site-packages\keras\backend\tensorflow_backend.py:3080: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
WARNING:tensorflow:From E:\depthLearning1.14\venv\lib\site-packages\keras\optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.
Train on 839 samples, val on 93 samples, with batch size 16.
WARNING:tensorflow:From E:\depthLearning1.14\venv\lib\site-packages\keras\callbacks.py:850: The name tf.summary.merge_all is deprecated. Please use tf.compat.v1.summary.merge_all instead.
WARNING:tensorflow:From E:\depthLearning1.14\venv\lib\site-packages\keras\callbacks.py:853: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.
Epoch 1/50
2022-04-16 13:16:50.892058: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] shape_optimizer failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)
2022-04-16 13:16:50.931636: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] remapper failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)
2022-04-16 13:16:51.290190: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] shape_optimizer failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)
2022-04-16 13:16:51.325787: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] remapper failed: Invalid argument: Subshape must have computed start >= end since stride is negative, but is 0 and 2 (computed from start 0 and end 9223372036854775807 over shape with rank 2 and stride-1)
1/52 [..............................] - ETA: 5:17 - loss: 8804.4365
2/52 [>.............................] - ETA: 4:32 - loss: 8518.9624
解决方案:
要么直接上tenforflow最新版本,要么双系统或者虚拟机,去搞ubuntu。
一个大佬的留言:
目前来看win平台是不支持的,只支持11.0以上的cuda版本。安装低版本的tf会成功,但是进行网络训练会报错。我试了很多次是不行的。网上有linux平台30系显卡安装tf1.15的教程,没有尝试过。由于代码都是基于1.x写的,所以只能用旧显卡了。
我尝试完后才看到这个留言,我真的吐了,我只能去虚拟机看看咯。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)