Pages

Categories

Posts in category ‘Benchmarks’.

Comparing Green and Red Apples

Friday, September 10, 2021, by artyom ; Posted in: Benchmarks; 0 comments

TL;DR

  • OpenCL based DLPrimitives is almost as fast as TF based on cuDNN in inference and close enough in training.
  • Framework Matters - TF is much slower than pytorch.
  • AMD 6600 XT is faster than NVidia 1080 and 2060S by a margin that is similar to difference in GFlops of these cards

Also dlprimitives isn't as fast as best cudnn based solutions - pytorch its performance makes it more that useful for platform independent deep learning.

How to Compare Different GPUs

Comparing deep learning software performance on NVidia and AMD GPU isn't as simple as you may think of.

There are two many factors:

  1. No GPUs have identical specs. Major parameters are GFlops and Memroy bandwidth as most DL algorithms are either compute limited (like dense, conv layers) or bandwidth limited - (like batch normalization or activation)
  2. Both companies provide libraries optimized for their gpus: MIOpen and cuDNN. While they highly optimized and provide similar functionality they aren't have similar performance.

Now the situation becomes even more complex when it comes to RDNA architecture. AMD hasn't released support of their DL stack for these GPUs for more than two years.

Even though I decided to try to check it using dlprimitives.

Base Line

Note we compare 3 different GPUs that have similar performance withing reasonable margins.

AMD RX 6600 XT, NVidia GTX 1080, NVidia RTX 2060 Super.

The basic flops performance measured using custom kernel.

gpu GFlops GB/s
6600xt 9,937 216
1080 8,970 242
2060s 8,263 396

Flops performance of modern GPUs can be calculated as clock * cores * 2, however clock depends on specific model and thermal performance so both manual measures used as base line and calculated theoretical expected flops measured using median clock observed during benchmarks.

gpu Cores Clock Mhz Exp GFlops Exp GB/s
6600xt 2048 2655 10,875 256
1080 2560 1809 9,262 320
2060s 2176 1905 8,290 448

So GPUS performance varies, also 2060s has 17-24% less GFlops that 6600xt it has much higher memory throghtput that helps in bandwidth limited algorithms like batch normalization of depthwise separable convolutions for mobilenet. 1080 has 10-15% lower GFlops but 12% more bandwidth.

Testing Methodology

Three frameworks were tested using 64 images batch on:

  1. pytorch/1.8 using cuda+cudnn
  2. keras/tensorflow 2.5 using cuda+cudn
  3. OpenCL based solution dlprimitives.

Since there is no ROCM version of TF or Pytorch that supports AMD's RDNA GPU only dlprimitives were tested expecting to get similar results to other GPUs in same class.

Training Times

Measured in ms per batch, lower is better.

Framework gpu alexnet resnet18 resnet50 vgg16 mobilenet
dlprim 6600xt 83.73 231.2 716.2 1157.2 414.35
dlprim 1080 93.03 262.1 926.6^ 1348.9 614.02
dlprim 2060s 116.41 252.3 705.2^ 1681.3 355.21
keras/tf2 1080 70.56 200.6 684.4^ 633.1 437.84
keras/tf2 2060s 70.00 172.2 520.0^ 553.1 344.55
pytorch 1080 62.37 151.4 518.0 780.9 229.20
pytorch 2060s 41.11 121.2 377.8 621.1^ 143.23

^) Using half batch x32 twice, due to GPU memory limits

Observations:

  1. DLPrimitives has 67% of Tensorflow performance on NVidia GPUs, Biggest difference was in VGG. Comparison without VGG gives 75% of TF performance
  2. TF has 77% of pytorch. Biggest difference in VGG. Without VGG the difference is increased to 67%.
  3. DLPrimitives runs faster by 24% on AMD RX 6600 XT in comparison to GTX 1080 also the raw GFlops power differs by 10-17% depending on measurement strategy
  4. DLPrimitives runs faster by 15% on AMD RX 6600 XT in comparison to RTX 2060S. It is noticeable that major drop happens on mobile-net that is highly dependent on memory bandwidth with its depth-wise separable convolutions.

Inference Times

Measured in ms per batch, lower is better.

Framework gpu alexnet resnet18 resnet50 vgg16 mobilenet
dlprim 6600xt 34.28 63.57 185.72 277.97 102.84
dlprim 1080 28.03 63.57 274.27 309.28 131.74
dlprim 2060s 47.52 81.09 210.97 428.34 97.80
keras/tf2 1080 40.55 80.64 199.38 189.07 109.85
keras/tf2 2060s 47.95 75.73 165.31 174.27 93.01
pytorch 1080 16.36 43.17 144.88 226.40 60.13
pytorch 2060s 9.65 33.27 107.56 172.47 35.55

Observations:

  1. DLPrimitives has 90% of Tensorflow performance on NVidia GPUs, Biggest difference was in VGG. Comparison without VGG gives 99% of TF performance
  2. TF has 61% of pytorch performance. Biggest difference in VGG. Without VGG the difference is increased to 49%.
  3. DLPrimitives runs faster by 14% on AMD RX 6600 XT in comparison to GTX 1080, and 26% faster in comparison to RTX 2060S. It is somewhat difference in comparison to training.

Summary and Conclusions

  1. There is a huge difference between different DL frameworks. Pytorch is much faster that TensorFlow by large margins.
  2. DLPrimitives provide decent performance that is comparable to TF (loosing ~25% of performance in training and 10% in inference)
  3. It seems that 6600XT gives decent performance for dlprimitives comparable to ones by nVidia 1080/2060s with performance improvement gap that is comparable to difference in GFlops gap.

next page