看书的时候,看到pytorch居然自带了瓶颈检查的工具:torch.utils.bottleneck
用法:
python -m torch.utils.bottleneck 待测脚本路径
示例效果:
-------------------------------------------------------------------------------- Environment Summary -------------------------------------------------------------------------------- PyTorch 1.7.1+cu101 DEBUG compiled w/ CUDA 10.1 Running with Python 3.8 and CUDA 10.1.243 `pip3 list` truncated output: numpy==1.19.2 torch==1.7.1+cu101 torch-cluster==1.5.9 torch-geometric==1.6.3 torch-scatter==2.0.6 torch-sparse==0.6.9 torch-spline-conv==1.2.1 torchaudio==0.7.2 torchsummary==1.5.1 torchvision==0.8.2+cu101 -------------------------------------------------------------------------------- cProfile output -------------------------------------------------------------------------------- 705723930 function calls (591115631 primitive calls) in 252.079 seconds Ordered by: internal time List reduced from 960 to 15 due to restriction <15> ncalls tottime percall cumtime percall filename:lineno(function) 114583910/48 99.432 0.000 197.735 4.119 /usr/lib/python3.8/copy.py:128(deepcopy) 9648/48 40.553 0.004 197.735 4.119 /usr/lib/python3.8/copy.py:200(_deepcopy_list) 229172740 25.338 0.000 25.338 0.000 {method 'get' of 'dict' objects} 24 16.446 0.685 16.446 0.685 {method 'run_backward' of 'torch._C._Enginebase' objects} 114605951 11.720 0.000 11.720 0.000 {built-in method builtins.id} 114574262 11.361 0.000 11.361 0.000 /usr/lib/python3.8/copy.py:182(_deepcopy_atomic) 117258341 9.575 0.000 9.575 0.000 {method 'append' of 'list' objects} 576 5.353 0.009 5.353 0.009 {built-in method equal} 1658851 4.531 0.000 4.531 0.000 {method 'uniform' of 'numpy.random.mtrand.RandomState' objects} 100 4.132 0.041 4.132 0.041 {built-in method __new__ of type object at 0x908780} 1005442 3.727 0.000 3.727 0.000 {method 'randint' of 'numpy.random.mtrand.RandomState' objects} 2080353 2.508 0.000 2.508 0.000 {built-in method builtins.min} 1918 1.864 0.001 2.046 0.001 /home3/jmh/scnet/TransformerFlow/data_builder.py:292(augment_expoential) 1 1.658 1.658 2.691 2.691 {built-in method _pickle.load} 9600 1.564 0.000 1.940 0.000 /home3/jmh/scnet/TransformerFlow/data_builder.py:132(refine_packet_length) -------------------------------------------------------------------------------- autograd profiler output (CPU mode) -------------------------------------------------------------------------------- top 15 events sorted by cpu_time_total ---------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ---------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ EmbeddingBackward 6.68% 664.353ms 6.68% 664.353ms 664.353ms 0.000us NaN 0.000us 0.000us 1 aten::embedding_backward 6.68% 664.345ms 6.68% 664.345ms 664.345ms 0.000us NaN 0.000us 0.000us 1 aten::embedding_dense_backward 6.68% 664.340ms 6.68% 664.340ms 664.340ms 0.000us NaN 0.000us 0.000us 1 EmbeddingBackward 6.67% 663.371ms 6.67% 663.371ms 663.371ms 0.000us NaN 0.000us 0.000us 1 aten::embedding_backward 6.67% 663.367ms 6.67% 663.367ms 663.367ms 0.000us NaN 0.000us 0.000us 1 aten::embedding_dense_backward 6.67% 663.363ms 6.67% 663.363ms 663.363ms 0.000us NaN 0.000us 0.000us 1 EmbeddingBackward 6.66% 663.205ms 6.66% 663.205ms 663.205ms 0.000us NaN 0.000us 0.000us 1 aten::embedding_backward 6.66% 663.200ms 6.66% 663.200ms 663.200ms 0.000us NaN 0.000us 0.000us 1 EmbeddingBackward 6.66% 663.198ms 6.66% 663.198ms 663.198ms 0.000us NaN 0.000us 0.000us 1 aten::embedding_dense_backward 6.66% 663.196ms 6.66% 663.196ms 663.196ms 0.000us NaN 0.000us 0.000us 1 aten::embedding_backward 6.66% 663.192ms 6.66% 663.192ms 663.192ms 0.000us NaN 0.000us 0.000us 1 aten::embedding_dense_backward 6.66% 663.189ms 6.66% 663.189ms 663.189ms 0.000us NaN 0.000us 0.000us 1 EmbeddingBackward 6.66% 663.103ms 6.66% 663.103ms 663.103ms 0.000us NaN 0.000us 0.000us 1 aten::embedding_backward 6.66% 663.098ms 6.66% 663.098ms 663.098ms 0.000us NaN 0.000us 0.000us 1 aten::embedding_dense_backward 6.66% 663.095ms 6.66% 663.095ms 663.095ms 0.000us NaN 0.000us 0.000us 1 ---------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 9.952s CUDA time total: 0.000us -------------------------------------------------------------------------------- autograd profiler output (CUDA mode) -------------------------------------------------------------------------------- top 15 events sorted by cpu_time_total Because the autograd profiler uses the CUDA event API, the CUDA time column reports approximately max(cuda_time, cpu_time). Please ignore this output if your code does not use CUDA. ---------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ---------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ EmbeddingBackward 6.71% 161.735ms 6.71% 161.735ms 161.735ms 511.000us 6.63% 511.000us 511.000us 1 aten::embedding_backward 6.71% 161.720ms 6.71% 161.720ms 161.720ms 508.500us 6.60% 508.500us 508.500us 1 aten::embedding_dense_backward 6.71% 161.708ms 6.71% 161.708ms 161.708ms 503.500us 6.53% 503.500us 503.500us 1 EmbeddingBackward 6.67% 160.718ms 6.67% 160.718ms 160.718ms 550.000us 7.14% 550.000us 550.000us 1 EmbeddingBackward 6.67% 160.708ms 6.67% 160.708ms 160.708ms 496.000us 6.44% 496.000us 496.000us 1 EmbeddingBackward 6.67% 160.677ms 6.67% 160.677ms 160.677ms 496.000us 6.44% 496.000us 496.000us 1 EmbeddingBackward 6.66% 160.661ms 6.66% 160.661ms 160.661ms 528.000us 6.85% 528.000us 528.000us 1 EmbeddingBackward 6.66% 160.504ms 6.66% 160.504ms 160.504ms 494.000us 6.41% 494.000us 494.000us 1 EmbeddingBackward 6.66% 160.453ms 6.66% 160.453ms 160.453ms 552.000us 7.16% 552.000us 552.000us 1 EmbeddingBackward 6.66% 160.446ms 6.66% 160.446ms 160.446ms 496.000us 6.44% 496.000us 496.000us 1 EmbeddingBackward 6.65% 160.314ms 6.65% 160.314ms 160.314ms 488.000us 6.33% 488.000us 488.000us 1 EmbeddingBackward 6.65% 160.290ms 6.65% 160.290ms 160.290ms 512.000us 6.65% 512.000us 512.000us 1 EmbeddingBackward 6.65% 160.238ms 6.65% 160.238ms 160.238ms 576.000us 7.48% 576.000us 576.000us 1 EmbeddingBackward 6.65% 160.237ms 6.65% 160.237ms 160.237ms 480.000us 6.23% 480.000us 480.000us 1 EmbeddingBackward 6.64% 160.183ms 6.64% 160.183ms 160.183ms 514.000us 6.67% 514.000us 514.000us 1 ---------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 2.411s CUDA time total: 7.705ms
它会分析cpu的执行耗时,以及GPU的执行耗时。
从这份瓶颈报告中,我可以很轻易的发现,deepcopy是目前我的程序的一个瓶颈。
于是对深层次拷贝做优化。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)