yolov5训练时,出现系数为nan和0的问题。
cpu跑没有问题,gpu出现nan和0的问题。一般问题cuda问题和显卡的原因。
显卡为GTX 16XX系列的在cuda使用较新版本时会出现该问题。
例如我自己的问题:飞行堡垒7锐龙版 显卡:GTX 1650 cuda11.3(cuda11.5调试过)都会出现该问题 pytorch为1.11.0 。
AutoAnchor: 6.13 anchors/target, 1.000 Best Possible Recall (BPR). Current anchors are a good fit to dataset
Image sizes 640 train, 640 val
Using 0 dataloader workers
Logging results to runs\train\exp7
Starting training for 100 epochs...
Epoch gpu_mem box obj cls labels img_size
0/99 1.88G nan nan nan 10 640: 100%|██████████| 14/14 [00:35<00:00, 2.52s/it]
D:837\anaconda3\envs\pytorch\lib\site-packages\torch\optim\lr_scheduler.py:136: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|██████████| 7/7 [00:07<00:00, 1.09s/it]
all 106 0 0 0 0 0
Epoch gpu_mem box obj cls labels img_size
1/99 1.96G nan nan nan 104 640: 7%|▋ | 1/14 [00:02<00:38, 2.92s/it]
Process finished with exit code -1
解决方案为将cuda换为10.2的版本,链接如下,直接进行下载
CUDA Toolkit Archive | NVIDIA DeveloperPrevious releases of the CUDA Toolkit, GPU Computing SDK, documentation and developer drivers can be found using the links below. Please select the release you want from the list below, and be sure to check www.nvidia.com/drivers for more recent production drivers appropriate for your hardware configuration.https://developer.nvidia.com/cuda-toolkit-archive
cudnn下载:
cuDNN Archive | NVIDIA DeveloperNVIDIA cuDNN is a GPU-accelerated library of primitives for deep neural networks.https://developer.nvidia.com/rdp/cudnn-archive#a-collapse51b选择对应的版本
安装cuda过后将cudnn里面的放入C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2这个路径下。根据自己的路径进行修改
然后继续安装pytorch cu102版本
pip install torch==1.10.1+cu102 torchvision==0.11.2+cu102 torchaudio==0.10.1 -f https://download.pytorch.org/whl/torch_stable.html
接下来回到运行程序阶段
AutoAnchor: 6.13 anchors/target, 1.000 Best Possible Recall (BPR). Current anchors are a good fit to dataset
Image sizes 640 train, 640 val
Using 0 dataloader workers
Logging results to runs\train\exp10
Starting training for 100 epochs...
Epoch gpu_mem box obj cls labels img_size
0/99 1.85G 0.1244 0.0515 0.06827 10 640: 100%|██████████| 14/14 [01:59<00:00, 8.53s/it]
Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|██████████| 7/7 [00:11<00:00, 1.62s/it]
all 106 433 0.00107 0.00842 0.000487 0.000122
Epoch gpu_mem box obj cls labels img_size
1/99 1.96G 0.1171 0.06178 0.06603 63 640: 50%|█████ | 7/14 [00:30<00:30, 4.37s/it]
至此就完成了
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)