Pages

Categories

Posts in category ‘Benchmarks’.

Pytorch Training Benchrmarks

Tuesday, October 26, 2021, by artyom ; Posted in: Benchmarks; 0 comments

I managed to train some networks in pytorch with opencl/dlprimitives backend and the result is promising.

Below the results for several nVidia GPUs and comparison to baseline tf2/keras. Unlike previous benchmarks I fixed missing time of Adam optimiser that apparently was significant (it isn't very efficient in pytorch)

I also added times for AMD 6600xt, unfortunately there is no baseline I can use since AMD hadn't released ROCM for RDNA yet.

Absolute Performance

Batch size: 16 images 224x224, time in ms. Lower is better.

Framework Vendor GPU alexnet resnet18 resnet50 vgg16 mobilenet
pytorch/opencl AMD rx 6600xt 56.846 109.850 258.973 365.305 163.732
dlprimitives AMD rx 6600xt 36.954 65.241 194.398 308.763 99.862
pytorch/cuda Nvidia rtx 2060s 27.592 38.624 114.074 179.580 49.624
pytorch/opencl Nvidia rtx 2060s 50.108 82.021 223.651 462.964 129.145
dlprimitives Nvidia rtx 2060s 39.829 67.960 187.398 439.053 90.229
tf2/cuda Nvidia rtx 2060s 29.165 55.523 147.999 156.714 102.596
pytorch/cuda Nvidia gtx 1080 38.310 44.382 137.754 232.824 63.324
pytorch/opencl Nvidia gtx 1080 54.828 85.016 301.898 411.928 173.885
dlprimitives Nvidia gtx 1080 38.804 71.147 264.286 374.168 134.650
tf2/cuda Nvidia gtx 1080 35.592 69.071 189.994 197.333 128.526

Relative Performance

Comparison of TF/Cuda with pytorch + opencl/dlprimitives and dlprimitives alone:

Baseline tested GPU alexnet resnet18 resnet50 vgg16 mobilenet
tf2/cuda dlprimitives gtx 1080 92% 97% 72% 53% 95%
tf2/cuda pt/opencl gtx 1080 65% 81% 63% 48% 74%
tf2/cuda dlprimitives rtx 2060s 73% 82% 79% 36% 114%
tf2/cuda pt/opencl rtx 2060s 58% 68% 66% 34% 79%

Summary

Besides VGG, most of results are very assuring

Notes

Why do I compare to TF2/cuda as base line. Pytorch is faster framework. However since TF is good enough for most users I want to show that I get performance that is close enough.

Comparing Green and Red Apples

Friday, September 10, 2021, by artyom ; Posted in: Benchmarks; 0 comments

TL;DR

  • OpenCL based DLPrimitives is almost as fast as TF based on cuDNN in inference and close enough in training.
  • Framework Matters - TF is much slower than pytorch.
  • AMD 6600 XT is faster than NVidia 1080 and 2060S by a margin that is similar to difference in GFlops of these cards

Also dlprimitives isn't as fast as best cudnn based solutions - pytorch its performance makes it more that useful for platform independent deep learning.

How to Compare Different GPUs

Comparing deep learning software performance on NVidia and AMD GPU isn't as simple as you may think of.

There are two many factors:

  1. No GPUs have identical specs. Major parameters are GFlops and Memroy bandwidth as most DL algorithms are either compute limited (like dense, conv layers) or bandwidth limited - (like batch normalization or activation)
  2. Both companies provide libraries optimized for their gpus: MIOpen and cuDNN. While they highly optimized and provide similar functionality they aren't have similar performance.

Now the situation becomes even more complex when it comes to RDNA architecture. AMD hasn't released support of their DL stack for these GPUs for more than two years.

Even though I decided to try to check it using dlprimitives.

Base Line

Note we compare 3 different GPUs that have similar performance withing reasonable margins.

AMD RX 6600 XT, NVidia GTX 1080, NVidia RTX 2060 Super.

The basic flops performance measured using custom kernel.

gpu GFlops GB/s
6600xt 9,937 216
1080 8,970 242
2060s 8,263 396

Flops performance of modern GPUs can be calculated as clock * cores * 2, however clock depends on specific model and thermal performance so both manual measures used as base line and calculated theoretical expected flops measured using median clock observed during benchmarks.

gpu Cores Clock Mhz Exp GFlops Exp GB/s
6600xt 2048 2655 10,875 256
1080 2560 1809 9,262 320
2060s 2176 1905 8,290 448

So GPUS performance varies, also 2060s has 17-24% less GFlops that 6600xt it has much higher memory throghtput that helps in bandwidth limited algorithms like batch normalization of depthwise separable convolutions for mobilenet. 1080 has 10-15% lower GFlops but 12% more bandwidth.

Testing Methodology

Three frameworks were tested using 64 images batch on:

  1. pytorch/1.8 using cuda+cudnn
  2. keras/tensorflow 2.5 using cuda+cudn
  3. OpenCL based solution dlprimitives.

Since there is no ROCM version of TF or Pytorch that supports AMD's RDNA GPU only dlprimitives were tested expecting to get similar results to other GPUs in same class.

Training Times

Measured in ms per batch, lower is better.

Framework gpu alexnet resnet18 resnet50 vgg16 mobilenet
dlprim 6600xt 83.73 231.2 716.2 1157.2 414.35
dlprim 1080 93.03 262.1 926.6^ 1348.9 614.02
dlprim 2060s 116.41 252.3 705.2^ 1681.3 355.21
keras/tf2 1080 70.56 200.6 684.4^ 633.1 437.84
keras/tf2 2060s 70.00 172.2 520.0^ 553.1 344.55
pytorch 1080 62.37 151.4 518.0 780.9 229.20
pytorch 2060s 41.11 121.2 377.8 621.1^ 143.23

^) Using half batch x32 twice, due to GPU memory limits

Observations:

  1. DLPrimitives has 67% of Tensorflow performance on NVidia GPUs, Biggest difference was in VGG. Comparison without VGG gives 75% of TF performance
  2. TF has 77% of pytorch. Biggest difference in VGG. Without VGG the difference is increased to 67%.
  3. DLPrimitives runs faster by 24% on AMD RX 6600 XT in comparison to GTX 1080 also the raw GFlops power differs by 10-17% depending on measurement strategy
  4. DLPrimitives runs faster by 15% on AMD RX 6600 XT in comparison to RTX 2060S. It is noticeable that major drop happens on mobile-net that is highly dependent on memory bandwidth with its depth-wise separable convolutions.

Inference Times

Measured in ms per batch, lower is better.

Framework gpu alexnet resnet18 resnet50 vgg16 mobilenet
dlprim 6600xt 34.28 63.57 185.72 277.97 102.84
dlprim 1080 28.03 63.57 274.27 309.28 131.74
dlprim 2060s 47.52 81.09 210.97 428.34 97.80
keras/tf2 1080 40.55 80.64 199.38 189.07 109.85
keras/tf2 2060s 47.95 75.73 165.31 174.27 93.01
pytorch 1080 16.36 43.17 144.88 226.40 60.13
pytorch 2060s 9.65 33.27 107.56 172.47 35.55

Observations:

  1. DLPrimitives has 90% of Tensorflow performance on NVidia GPUs, Biggest difference was in VGG. Comparison without VGG gives 99% of TF performance
  2. TF has 61% of pytorch performance. Biggest difference in VGG. Without VGG the difference is increased to 49%.
  3. DLPrimitives runs faster by 14% on AMD RX 6600 XT in comparison to GTX 1080, and 26% faster in comparison to RTX 2060S. It is somewhat difference in comparison to training.

Summary and Conclusions

  1. There is a huge difference between different DL frameworks. Pytorch is much faster that TensorFlow by large margins.
  2. DLPrimitives provide decent performance that is comparable to TF (loosing ~25% of performance in training and 10% in inference)
  3. It seems that 6600XT gives decent performance for dlprimitives comparable to ones by nVidia 1080/2060s with performance improvement gap that is comparable to difference in GFlops gap.

next page