DLPrimitives Blog
Development Blog
Comparing Green and Red Apples
TL;DR
- OpenCL based DLPrimitives is almost as fast as TF based on cuDNN in inference and close enough in training.
- Framework Matters - TF is much slower than pytorch.
- AMD 6600 XT is faster than NVidia 1080 and 2060S by a margin that is similar to difference in GFlops of these cards
Also dlprimitives isn't as fast as best cudnn based solutions - pytorch its performance makes it more that useful for platform independent deep learning.
How to Compare Different GPUs
Comparing deep learning software performance on NVidia and AMD GPU isn't as simple as you may think of.
There are two many factors:
- No GPUs have identical specs. Major parameters are GFlops and Memroy bandwidth as most DL algorithms are either compute limited (like dense, conv layers) or bandwidth limited - (like batch normalization or activation)
- Both companies provide libraries optimized for their gpus: MIOpen and cuDNN. While they highly optimized and provide similar functionality they aren't have similar performance.
Now the situation becomes even more complex when it comes to RDNA architecture. AMD hasn't released support of their DL stack for these GPUs for more than two years.
Even though I decided to try to check it using dlprimitives.
Base Line
Note we compare 3 different GPUs that have similar performance withing reasonable margins.
AMD RX 6600 XT, NVidia GTX 1080, NVidia RTX 2060 Super.
The basic flops performance measured using custom kernel.
gpu | GFlops | GB/s |
---|---|---|
6600xt | 9,937 | 216 |
1080 | 8,970 | 242 |
2060s | 8,263 | 396 |
Flops performance of modern GPUs can be calculated as clock * cores * 2, however clock depends on specific model and thermal performance so both manual measures used as base line and calculated theoretical expected flops measured using median clock observed during benchmarks.
gpu | Cores | Clock Mhz | Exp GFlops | Exp GB/s |
---|---|---|---|---|
6600xt | 2048 | 2655 | 10,875 | 256 |
1080 | 2560 | 1809 | 9,262 | 320 |
2060s | 2176 | 1905 | 8,290 | 448 |
So GPUS performance varies, also 2060s has 17-24% less GFlops that 6600xt it has much higher memory throghtput that helps in bandwidth limited algorithms like batch normalization of depthwise separable convolutions for mobilenet. 1080 has 10-15% lower GFlops but 12% more bandwidth.
Testing Methodology
Three frameworks were tested using 64 images batch on:
- pytorch/1.8 using cuda+cudnn
- keras/tensorflow 2.5 using cuda+cudn
- OpenCL based solution dlprimitives.
Since there is no ROCM version of TF or Pytorch that supports AMD's RDNA GPU only dlprimitives were tested expecting to get similar results to other GPUs in same class.
Training Times
Measured in ms per batch, lower is better.
Framework | gpu | alexnet | resnet18 | resnet50 | vgg16 | mobilenet |
---|---|---|---|---|---|---|
dlprim | 6600xt | 83.73 | 231.2 | 716.2 | 1157.2 | 414.35 |
dlprim | 1080 | 93.03 | 262.1 | 926.6^ | 1348.9 | 614.02 |
dlprim | 2060s | 116.41 | 252.3 | 705.2^ | 1681.3 | 355.21 |
keras/tf2 | 1080 | 70.56 | 200.6 | 684.4^ | 633.1 | 437.84 |
keras/tf2 | 2060s | 70.00 | 172.2 | 520.0^ | 553.1 | 344.55 |
pytorch | 1080 | 62.37 | 151.4 | 518.0 | 780.9 | 229.20 |
pytorch | 2060s | 41.11 | 121.2 | 377.8 | 621.1^ | 143.23 |
^) Using half batch x32 twice, due to GPU memory limits
Observations:
- DLPrimitives has 67% of Tensorflow performance on NVidia GPUs, Biggest difference was in VGG. Comparison without VGG gives 75% of TF performance
- TF has 77% of pytorch. Biggest difference in VGG. Without VGG the difference is increased to 67%.
- DLPrimitives runs faster by 24% on AMD RX 6600 XT in comparison to GTX 1080 also the raw GFlops power differs by 10-17% depending on measurement strategy
- DLPrimitives runs faster by 15% on AMD RX 6600 XT in comparison to RTX 2060S. It is noticeable that major drop happens on mobile-net that is highly dependent on memory bandwidth with its depth-wise separable convolutions.
Inference Times
Measured in ms per batch, lower is better.
Framework | gpu | alexnet | resnet18 | resnet50 | vgg16 | mobilenet |
---|---|---|---|---|---|---|
dlprim | 6600xt | 34.28 | 63.57 | 185.72 | 277.97 | 102.84 |
dlprim | 1080 | 28.03 | 63.57 | 274.27 | 309.28 | 131.74 |
dlprim | 2060s | 47.52 | 81.09 | 210.97 | 428.34 | 97.80 |
keras/tf2 | 1080 | 40.55 | 80.64 | 199.38 | 189.07 | 109.85 |
keras/tf2 | 2060s | 47.95 | 75.73 | 165.31 | 174.27 | 93.01 |
pytorch | 1080 | 16.36 | 43.17 | 144.88 | 226.40 | 60.13 |
pytorch | 2060s | 9.65 | 33.27 | 107.56 | 172.47 | 35.55 |
Observations:
- DLPrimitives has 90% of Tensorflow performance on NVidia GPUs, Biggest difference was in VGG. Comparison without VGG gives 99% of TF performance
- TF has 61% of pytorch performance. Biggest difference in VGG. Without VGG the difference is increased to 49%.
- DLPrimitives runs faster by 14% on AMD RX 6600 XT in comparison to GTX 1080, and 26% faster in comparison to RTX 2060S. It is somewhat difference in comparison to training.
Summary and Conclusions
- There is a huge difference between different DL frameworks. Pytorch is much faster that TensorFlow by large margins.
- DLPrimitives provide decent performance that is comparable to TF (loosing ~25% of performance in training and 10% in inference)
- It seems that 6600XT gives decent performance for dlprimitives comparable to ones by nVidia 1080/2060s with performance improvement gap that is comparable to difference in GFlops gap.
Add Comment:
You must enable JavaScript in order to post comments.