Pytorch Training Benchrmarks

I managed to train some networks in pytorch with opencl/dlprimitives backend and the result is promising.

Below the results for several nVidia GPUs and comparison to baseline tf2/keras. Unlike previous benchmarks I fixed missing time of Adam optimiser that apparently was significant (it isn't very efficient in pytorch)

I also added times for AMD 6600xt, unfortunately there is no baseline I can use since AMD hadn't released ROCM for RDNA yet.

Absolute Performance

Batch size: 16 images 224x224, time in ms. Lower is better.

Framework	Vendor	GPU	alexnet	resnet18	resnet50	vgg16	mobilenet
pytorch/opencl	AMD	rx 6600xt	56.846	109.850	258.973	365.305	163.732
dlprimitives	AMD	rx 6600xt	36.954	65.241	194.398	308.763	99.862
pytorch/cuda	Nvidia	rtx 2060s	27.592	38.624	114.074	179.580	49.624
pytorch/opencl	Nvidia	rtx 2060s	50.108	82.021	223.651	462.964	129.145
dlprimitives	Nvidia	rtx 2060s	39.829	67.960	187.398	439.053	90.229
tf2/cuda	Nvidia	rtx 2060s	29.165	55.523	147.999	156.714	102.596
pytorch/cuda	Nvidia	gtx 1080	38.310	44.382	137.754	232.824	63.324
pytorch/opencl	Nvidia	gtx 1080	54.828	85.016	301.898	411.928	173.885
dlprimitives	Nvidia	gtx 1080	38.804	71.147	264.286	374.168	134.650
tf2/cuda	Nvidia	gtx 1080	35.592	69.071	189.994	197.333	128.526

Relative Performance

Comparison of TF/Cuda with pytorch + opencl/dlprimitives and dlprimitives alone:

Baseline	tested	GPU	alexnet	resnet18	resnet50	vgg16	mobilenet
tf2/cuda	dlprimitives	gtx 1080	92%	97%	72%	53%	95%
tf2/cuda	pt/opencl	gtx 1080	65%	81%	63%	48%	74%
tf2/cuda	dlprimitives	rtx 2060s	73%	82%	79%	36%	114%
tf2/cuda	pt/opencl	rtx 2060s	58%	68%	66%	34%	79%

Summary

Besides VGG, most of results are very assuring

Notes

Why do I compare to TF2/cuda as base line. Pytorch is faster framework. However since TF is good enough for most users I want to show that I get performance that is close enough.

Comparing Green and Red Apples

TL;DR

OpenCL based DLPrimitives is almost as fast as TF based on cuDNN in inference and close enough in training.
Framework Matters - TF is much slower than pytorch.
AMD 6600 XT is faster than NVidia 1080 and 2060S by a margin that is similar to difference in GFlops of these cards

Also dlprimitives isn't as fast as best cudnn based solutions - pytorch its performance makes it more that useful for platform independent deep learning.

How to Compare Different GPUs

Comparing deep learning software performance on NVidia and AMD GPU isn't as simple as you may think of.

There are two many factors:

No GPUs have identical specs. Major parameters are GFlops and Memroy bandwidth as most DL algorithms are either compute limited (like dense, conv layers) or bandwidth limited - (like batch normalization or activation)
Both companies provide libraries optimized for their gpus: MIOpen and cuDNN. While they highly optimized and provide similar functionality they aren't have similar performance.

Now the situation becomes even more complex when it comes to RDNA architecture. AMD hasn't released support of their DL stack for these GPUs for more than two years.

Even though I decided to try to check it using dlprimitives.

Base Line

Note we compare 3 different GPUs that have similar performance withing reasonable margins.

AMD RX 6600 XT, NVidia GTX 1080, NVidia RTX 2060 Super.

The basic flops performance measured using custom kernel.

gpu	GFlops	GB/s
6600xt	9,937	216
1080	8,970	242
2060s	8,263	396

Flops performance of modern GPUs can be calculated as clock * cores * 2, however clock depends on specific model and thermal performance so both manual measures used as base line and calculated theoretical expected flops measured using median clock observed during benchmarks.

gpu	Cores	Clock Mhz	Exp GFlops	Exp GB/s
6600xt	2048	2655	10,875	256
1080	2560	1809	9,262	320
2060s	2176	1905	8,290	448

So GPUS performance varies, also 2060s has 17-24% less GFlops that 6600xt it has much higher memory throghtput that helps in bandwidth limited algorithms like batch normalization of depthwise separable convolutions for mobilenet. 1080 has 10-15% lower GFlops but 12% more bandwidth.

Testing Methodology

Three frameworks were tested using 64 images batch on:

pytorch/1.8 using cuda+cudnn
keras/tensorflow 2.5 using cuda+cudn
OpenCL based solution dlprimitives.

Since there is no ROCM version of TF or Pytorch that supports AMD's RDNA GPU only dlprimitives were tested expecting to get similar results to other GPUs in same class.

Training Times

Measured in ms per batch, lower is better.

Framework	gpu	alexnet	resnet18	resnet50	vgg16	mobilenet
dlprim	6600xt	83.73	231.2	716.2	1157.2	414.35
dlprim	1080	93.03	262.1	926.6^	1348.9	614.02
dlprim	2060s	116.41	252.3	705.2^	1681.3	355.21
keras/tf2	1080	70.56	200.6	684.4^	633.1	437.84
keras/tf2	2060s	70.00	172.2	520.0^	553.1	344.55
pytorch	1080	62.37	151.4	518.0	780.9	229.20
pytorch	2060s	41.11	121.2	377.8	621.1^	143.23

^) Using half batch x32 twice, due to GPU memory limits

Observations:

DLPrimitives has 67% of Tensorflow performance on NVidia GPUs, Biggest difference was in VGG. Comparison without VGG gives 75% of TF performance
TF has 77% of pytorch. Biggest difference in VGG. Without VGG the difference is increased to 67%.
DLPrimitives runs faster by 24% on AMD RX 6600 XT in comparison to GTX 1080 also the raw GFlops power differs by 10-17% depending on measurement strategy
DLPrimitives runs faster by 15% on AMD RX 6600 XT in comparison to RTX 2060S. It is noticeable that major drop happens on mobile-net that is highly dependent on memory bandwidth with its depth-wise separable convolutions.

Inference Times