DLPrimitives Blog
Development Blog
Posts in category ‘Benchmarks’.
Pytorch Training Benchrmarks
I managed to train some networks in pytorch with opencl/dlprimitives backend and the result is promising.
Below the results for several nVidia GPUs and comparison to baseline tf2/keras. Unlike previous benchmarks I fixed missing time of Adam optimiser that apparently was significant (it isn't very efficient in pytorch)
I also added times for AMD 6600xt, unfortunately there is no baseline I can use since AMD hadn't released ROCM for RDNA yet.
Absolute Performance
Batch size: 16 images 224x224, time in ms. Lower is better.
Framework | Vendor | GPU | alexnet | resnet18 | resnet50 | vgg16 | mobilenet |
---|---|---|---|---|---|---|---|
pytorch/opencl | AMD | rx 6600xt | 56.846 | 109.850 | 258.973 | 365.305 | 163.732 |
dlprimitives | AMD | rx 6600xt | 36.954 | 65.241 | 194.398 | 308.763 | 99.862 |
pytorch/cuda | Nvidia | rtx 2060s | 27.592 | 38.624 | 114.074 | 179.580 | 49.624 |
pytorch/opencl | Nvidia | rtx 2060s | 50.108 | 82.021 | 223.651 | 462.964 | 129.145 |
dlprimitives | Nvidia | rtx 2060s | 39.829 | 67.960 | 187.398 | 439.053 | 90.229 |
tf2/cuda | Nvidia | rtx 2060s | 29.165 | 55.523 | 147.999 | 156.714 | 102.596 |
pytorch/cuda | Nvidia | gtx 1080 | 38.310 | 44.382 | 137.754 | 232.824 | 63.324 |
pytorch/opencl | Nvidia | gtx 1080 | 54.828 | 85.016 | 301.898 | 411.928 | 173.885 |
dlprimitives | Nvidia | gtx 1080 | 38.804 | 71.147 | 264.286 | 374.168 | 134.650 |
tf2/cuda | Nvidia | gtx 1080 | 35.592 | 69.071 | 189.994 | 197.333 | 128.526 |
Relative Performance
Comparison of TF/Cuda with pytorch + opencl/dlprimitives and dlprimitives alone:
Baseline | tested | GPU | alexnet | resnet18 | resnet50 | vgg16 | mobilenet |
---|---|---|---|---|---|---|---|
tf2/cuda | dlprimitives | gtx 1080 | 92% | 97% | 72% | 53% | 95% |
tf2/cuda | pt/opencl | gtx 1080 | 65% | 81% | 63% | 48% | 74% |
tf2/cuda | dlprimitives | rtx 2060s | 73% | 82% | 79% | 36% | 114% |
tf2/cuda | pt/opencl | rtx 2060s | 58% | 68% | 66% | 34% | 79% |
Summary
Besides VGG, most of results are very assuring
Notes
Why do I compare to TF2/cuda as base line. Pytorch is faster framework. However since TF is good enough for most users I want to show that I get performance that is close enough.
Comparing Green and Red Apples
TL;DR
- OpenCL based DLPrimitives is almost as fast as TF based on cuDNN in inference and close enough in training.
- Framework Matters - TF is much slower than pytorch.
- AMD 6600 XT is faster than NVidia 1080 and 2060S by a margin that is similar to difference in GFlops of these cards
Also dlprimitives isn't as fast as best cudnn based solutions - pytorch its performance makes it more that useful for platform independent deep learning.
How to Compare Different GPUs
Comparing deep learning software performance on NVidia and AMD GPU isn't as simple as you may think of.
There are two many factors:
- No GPUs have identical specs. Major parameters are GFlops and Memroy bandwidth as most DL algorithms are either compute limited (like dense, conv layers) or bandwidth limited - (like batch normalization or activation)
- Both companies provide libraries optimized for their gpus: MIOpen and cuDNN. While they highly optimized and provide similar functionality they aren't have similar performance.
Now the situation becomes even more complex when it comes to RDNA architecture. AMD hasn't released support of their DL stack for these GPUs for more than two years.
Even though I decided to try to check it using dlprimitives.
Base Line
Note we compare 3 different GPUs that have similar performance withing reasonable margins.
AMD RX 6600 XT, NVidia GTX 1080, NVidia RTX 2060 Super.
The basic flops performance measured using custom kernel.
gpu | GFlops | GB/s |
---|---|---|
6600xt | 9,937 | 216 |
1080 | 8,970 | 242 |
2060s | 8,263 | 396 |
Flops performance of modern GPUs can be calculated as clock * cores * 2, however clock depends on specific model and thermal performance so both manual measures used as base line and calculated theoretical expected flops measured using median clock observed during benchmarks.
gpu | Cores | Clock Mhz | Exp GFlops | Exp GB/s |
---|---|---|---|---|
6600xt | 2048 | 2655 | 10,875 | 256 |
1080 | 2560 | 1809 | 9,262 | 320 |
2060s | 2176 | 1905 | 8,290 | 448 |
So GPUS performance varies, also 2060s has 17-24% less GFlops that 6600xt it has much higher memory throghtput that helps in bandwidth limited algorithms like batch normalization of depthwise separable convolutions for mobilenet. 1080 has 10-15% lower GFlops but 12% more bandwidth.
Testing Methodology
Three frameworks were tested using 64 images batch on:
- pytorch/1.8 using cuda+cudnn
- keras/tensorflow 2.5 using cuda+cudn
- OpenCL based solution dlprimitives.
Since there is no ROCM version of TF or Pytorch that supports AMD's RDNA GPU only dlprimitives were tested expecting to get similar results to other GPUs in same class.
Training Times
Measured in ms per batch, lower is better.
Framework | gpu | alexnet | resnet18 | resnet50 | vgg16 | mobilenet |
---|---|---|---|---|---|---|
dlprim | 6600xt | 83.73 | 231.2 | 716.2 | 1157.2 | 414.35 |
dlprim | 1080 | 93.03 | 262.1 | 926.6^ | 1348.9 | 614.02 |
dlprim | 2060s | 116.41 | 252.3 | 705.2^ | 1681.3 | 355.21 |
keras/tf2 | 1080 | 70.56 | 200.6 | 684.4^ | 633.1 | 437.84 |
keras/tf2 | 2060s | 70.00 | 172.2 | 520.0^ | 553.1 | 344.55 |
pytorch | 1080 | 62.37 | 151.4 | 518.0 | 780.9 | 229.20 |
pytorch | 2060s | 41.11 | 121.2 | 377.8 | 621.1^ | 143.23 |
^) Using half batch x32 twice, due to GPU memory limits
Observations:
- DLPrimitives has 67% of Tensorflow performance on NVidia GPUs, Biggest difference was in VGG. Comparison without VGG gives 75% of TF performance
- TF has 77% of pytorch. Biggest difference in VGG. Without VGG the difference is increased to 67%.
- DLPrimitives runs faster by 24% on AMD RX 6600 XT in comparison to GTX 1080 also the raw GFlops power differs by 10-17% depending on measurement strategy
- DLPrimitives runs faster by 15% on AMD RX 6600 XT in comparison to RTX 2060S. It is noticeable that major drop happens on mobile-net that is highly dependent on memory bandwidth with its depth-wise separable convolutions.
Inference Times
Measured in ms per batch, lower is better.
Framework | gpu | alexnet | resnet18 | resnet50 | vgg16 | mobilenet |
---|---|---|---|---|---|---|
dlprim | 6600xt | 34.28 | 63.57 | 185.72 | 277.97 | 102.84 |
dlprim | 1080 | 28.03 | 63.57 | 274.27 | 309.28 | 131.74 |
dlprim | 2060s | 47.52 | 81.09 | 210.97 | 428.34 | 97.80 |
keras/tf2 | 1080 | 40.55 | 80.64 | 199.38 | 189.07 | 109.85 |
keras/tf2 | 2060s | 47.95 | 75.73 | 165.31 | 174.27 | 93.01 |
pytorch | 1080 | 16.36 | 43.17 | 144.88 | 226.40 | 60.13 |
pytorch | 2060s | 9.65 | 33.27 | 107.56 | 172.47 | 35.55 |
Observations:
- DLPrimitives has 90% of Tensorflow performance on NVidia GPUs, Biggest difference was in VGG. Comparison without VGG gives 99% of TF performance
- TF has 61% of pytorch performance. Biggest difference in VGG. Without VGG the difference is increased to 49%.
- DLPrimitives runs faster by 14% on AMD RX 6600 XT in comparison to GTX 1080, and 26% faster in comparison to RTX 2060S. It is somewhat difference in comparison to training.
Summary and Conclusions
- There is a huge difference between different DL frameworks. Pytorch is much faster that TensorFlow by large margins.
- DLPrimitives provide decent performance that is comparable to TF (loosing ~25% of performance in training and 10% in inference)
- It seems that 6600XT gives decent performance for dlprimitives comparable to ones by nVidia 1080/2060s with performance improvement gap that is comparable to difference in GFlops gap.