DLPrimitives Blog
Development Blog
Pytorch Training Benchrmarks
I managed to train some networks in pytorch with opencl/dlprimitives backend and the result is promising.
Below the results for several nVidia GPUs and comparison to baseline tf2/keras. Unlike previous benchmarks I fixed missing time of Adam optimiser that apparently was significant (it isn't very efficient in pytorch)
I also added times for AMD 6600xt, unfortunately there is no baseline I can use since AMD hadn't released ROCM for RDNA yet.
Absolute Performance
Batch size: 16 images 224x224, time in ms. Lower is better.
Framework | Vendor | GPU | alexnet | resnet18 | resnet50 | vgg16 | mobilenet |
---|---|---|---|---|---|---|---|
pytorch/opencl | AMD | rx 6600xt | 56.846 | 109.850 | 258.973 | 365.305 | 163.732 |
dlprimitives | AMD | rx 6600xt | 36.954 | 65.241 | 194.398 | 308.763 | 99.862 |
pytorch/cuda | Nvidia | rtx 2060s | 27.592 | 38.624 | 114.074 | 179.580 | 49.624 |
pytorch/opencl | Nvidia | rtx 2060s | 50.108 | 82.021 | 223.651 | 462.964 | 129.145 |
dlprimitives | Nvidia | rtx 2060s | 39.829 | 67.960 | 187.398 | 439.053 | 90.229 |
tf2/cuda | Nvidia | rtx 2060s | 29.165 | 55.523 | 147.999 | 156.714 | 102.596 |
pytorch/cuda | Nvidia | gtx 1080 | 38.310 | 44.382 | 137.754 | 232.824 | 63.324 |
pytorch/opencl | Nvidia | gtx 1080 | 54.828 | 85.016 | 301.898 | 411.928 | 173.885 |
dlprimitives | Nvidia | gtx 1080 | 38.804 | 71.147 | 264.286 | 374.168 | 134.650 |
tf2/cuda | Nvidia | gtx 1080 | 35.592 | 69.071 | 189.994 | 197.333 | 128.526 |
Relative Performance
Comparison of TF/Cuda with pytorch + opencl/dlprimitives and dlprimitives alone:
Baseline | tested | GPU | alexnet | resnet18 | resnet50 | vgg16 | mobilenet |
---|---|---|---|---|---|---|---|
tf2/cuda | dlprimitives | gtx 1080 | 92% | 97% | 72% | 53% | 95% |
tf2/cuda | pt/opencl | gtx 1080 | 65% | 81% | 63% | 48% | 74% |
tf2/cuda | dlprimitives | rtx 2060s | 73% | 82% | 79% | 36% | 114% |
tf2/cuda | pt/opencl | rtx 2060s | 58% | 68% | 66% | 34% | 79% |
Summary
Besides VGG, most of results are very assuring
Notes
Why do I compare to TF2/cuda as base line. Pytorch is faster framework. However since TF is good enough for most users I want to show that I get performance that is close enough.
Add Comment:
You must enable JavaScript in order to post comments.