Pages

Project Goals

About

DLPrimitives project is a project that aims to make deep learning platform independent and truly open source.

It does it by implementing efficient and optimized operators for GPU computing using OpenCL platform.

Navigate

Comparing Green and Red Apples

9/10/21, by artyom ; Posted in: Benchmarks; 0 comments

TL;DR

OpenCL based DLPrimitives is almost as fast as TF based on cuDNN in inference and close enough in training.
Framework Matters - TF is much slower than pytorch.
AMD 6600 XT is faster than NVidia 1080 and 2060S by a margin that is similar to difference in GFlops of these cards

Also dlprimitives isn't as fast as best cudnn based solutions - pytorch its performance makes it more that useful for platform independent deep learning.

How to Compare Different GPUs

Comparing deep learning software performance on NVidia and AMD GPU isn't as simple as you may think of.

There are two many factors:

No GPUs have identical specs. Major parameters are GFlops and Memroy bandwidth as most DL algorithms are either compute limited (like dense, conv layers) or bandwidth limited - (like batch normalization or activation)
Both companies provide libraries optimized for their gpus: MIOpen and cuDNN. While they highly optimized and provide similar functionality they aren't have similar performance.

Now the situation becomes even more complex when it comes to RDNA architecture. AMD hasn't released support of their DL stack for these GPUs for more than two years.

Even though I decided to try to check it using dlprimitives.

Base Line

Note we compare 3 different GPUs that have similar performance withing reasonable margins.

AMD RX 6600 XT, NVidia GTX 1080, NVidia RTX 2060 Super.

The basic flops performance measured using custom kernel.

gpu	GFlops	GB/s
6600xt	9,937	216
1080	8,970	242
2060s	8,263	396

Flops performance of modern GPUs can be calculated as clock * cores * 2, however clock depends on specific model and thermal performance so both manual measures used as base line and calculated theoretical expected flops measured using median clock observed during benchmarks.

gpu	Cores	Clock Mhz	Exp GFlops	Exp GB/s
6600xt	2048	2655	10,875	256
1080	2560	1809	9,262	320
2060s	2176	1905	8,290	448

So GPUS performance varies, also 2060s has 17-24% less GFlops that 6600xt it has much higher memory throghtput that helps in bandwidth limited algorithms like batch normalization of depthwise separable convolutions for mobilenet. 1080 has 10-15% lower GFlops but 12% more bandwidth.

Testing Methodology

Three frameworks were tested using 64 images batch on:

pytorch/1.8 using cuda+cudnn
keras/tensorflow 2.5 using cuda+cudn
OpenCL based solution dlprimitives.

Since there is no ROCM version of TF or Pytorch that supports AMD's RDNA GPU only dlprimitives were tested expecting to get similar results to other GPUs in same class.

Training Times

Measured in ms per batch, lower is better.

Framework	gpu	alexnet	resnet18	resnet50	vgg16	mobilenet
dlprim	6600xt	83.73	231.2	716.2	1157.2	414.35
dlprim	1080	93.03	262.1	926.6^	1348.9	614.02
dlprim	2060s	116.41	252.3	705.2^	1681.3	355.21
keras/tf2	1080	70.56	200.6	684.4^	633.1	437.84
keras/tf2	2060s	70.00	172.2	520.0^	553.1	344.55
pytorch	1080	62.37	151.4	518.0	780.9	229.20
pytorch	2060s	41.11	121.2	377.8	621.1^	143.23

^) Using half batch x32 twice, due to GPU memory limits

Observations:

DLPrimitives has 67% of Tensorflow performance on NVidia GPUs, Biggest difference was in VGG. Comparison without VGG gives 75% of TF performance
TF has 77% of pytorch. Biggest difference in VGG. Without VGG the difference is increased to 67%.
DLPrimitives runs faster by 24% on AMD RX 6600 XT in comparison to GTX 1080 also the raw GFlops power differs by 10-17% depending on measurement strategy
DLPrimitives runs faster by 15% on AMD RX 6600 XT in comparison to RTX 2060S. It is noticeable that major drop happens on mobile-net that is highly dependent on memory bandwidth with its depth-wise separable convolutions.

Inference Times