Hello Pytorch OpenCL

Friday, October 8, 2021, by artyom ; Posted in: Internals; 0 comments

TL;DR: I managed to run an inference of alexnet using OpenCL/DLPrimitives based pytorch backend!


I started from this tutorial to implement out of source backend for pytorch. It wasn't that simple and I had to do small changes in original pytorch source code but finally something is working:

Now I implemented only handful of ops and mostly for forward computations: github backend code. However, I managed to do forward computations and get correct result on pretrained alexnet.

$ python --model alexnet --device cuda *.ppm
dog.ppm,207,golden retriever
parrot.ppm,87,African grey
$ python --model alexnet --device opencl:1 *.ppm 
Accessing device #1:GeForce GTX 960 on NVIDIA CUDA
dog.ppm,207,golden retriever
parrot.ppm,87,African grey

Performance for this tiny task isn't not brilliant, but not horrible either, GTX 960, alexnet batch size of 16 images 224x224:

  • Pytorch Cuda/CUDNN: 15.317 ms - updated 2021-10-10
  • Pytorch OpenCL/DLPrimitives: 22.932 ms - updated 2021-10-10
  • DLPrim - microframework: 22.401 ms
  • Caffe/CuDNN: 16.1812 ms
  • Caffe/OpenCL: 41.072 ms
  • Caffe/OpenCL+DLPrimitives: 28.618 ms
  • Keras/CuDNN: 23.341 ms
  • Keras/PlaidML: 44.041 ms

Now, one of the issues that I currently have is synchronous execution that gives significant penalty for every operation. I need to understand an asynchronous execution and memory management stuff before I continue. The penalty for NVidia OpenCL backend isn't horrible but it is devastating for AMD OpenCL driver. Need to dive in.

Keep updated.

Edit, Oct 10, 2021

I found the way to implement asynchronous execution + initial GPU memory caching. That allowed to be bring the performance of pytorch OpenCL to same level as vanilla dlprimitives. This also solved the performance issues I had with AMD GPU.

Additionally I found that I didn't take in an account host-to-device transfer in pytorch benchmarks - so original CUDA run time for pytorch increased.


Sunday, September 19, 2021, by artyom ; 0 comments

DLPrimitives already gives promising results... But I'm really wondering what to prioritize:

  1. Add more useful operators (dropout, upscale, lstm, prelu, mse-loss etc) to make DLPrimitives fully featured?
  2. Try to improve existing OpenCL frameworks like Caffe (or PlaidML) by using DLPrimitives core operations?
  3. Start working on pytorch OpenCL backend - that is huge undertaking?
  4. Work on support of float16/bfloat16?
  5. Continue improving performance by integrating with open source implementations for Arm-Mali, Intel?

Every task is important.

It is logical to add more operators so DLPrimitives - DL framework can be useful for real world tasks - it can be done relatively fast since most of operators aren't that complex.

But in order to make it really useful (and not niche) it need to be integrated to at least one of the popular frameworks like Pytorch, TF or Mxnet. On the other hand implementing pytorch backend is huge task that will take lots of time - but it is actually the true goal.

I can go with improving Caffe-OpenCL were I mostly need to fix several performance critical layers by using dlprimitives... ahhh and fix Caffe memory management since Keras/PT uses 1/4 of the memory Caffe uses. It can be good POC but Caffe is actually dead - I already have working POC.

Hard to decide.

Documentation is Online

Thursday, September 16, 2021, by artyom ; Posted in: Releases; 0 comments

I published recent documentation online:

Why do we need OpenCL based deep learning?

Sunday, September 12, 2021, by artyom ; Posted in: Internals; 0 comments

Why do we need OpenCL based solution for deep learning?

NVidia provides high performance tools like cuDNN and TensorRT that power AI industry running CUDA API. AMD does the same for their compute cards with ROCm/MIOpen using their own CUDA clone called "hip".

Why should we care? We have a working solution why do we need to reimplement something that already exists with something that likely going to have lower performance?

The problem is somewhat deeper than that. I'll talk about 3 points.

  1. We need open source high performance low level algorithms especially for research purposes.
  2. We need to use unified GPU API that is standard, open and vendor independent.
  3. We need a worthy competition that would lead to better and more affordable products.

Open source algorithms are mandatory

A good example is a convolution algorithm. In the beginning of DL boom the most common approach was to run im2col with GEMM combination that gave decent performance thanks to very efficient cublas library.

However it was found that it is much better to merge two GPU kernels - one that converts image to a matrix and another that computes matrix multiplication to a single kernel. However, if you want to implement this technique and check its efficiency you need to have an efficient implementation of core matrix multiplication algorithm in first place. Also matrix multiplication seems like a simple task, in fact it is one of the most complicated alorithms to implement in GPU

If a researcher wants to add a new method of convolution, lets say dilation with variable steps or any other kind of modification it wouldn't be possible to do this efficiently without easily available source code. On the other hand if a researcher wants to improve some performance aspects of the Conv+GEMM algorithm he need to implement what nVidia engineers did in assembly to be able to compare his new implementation to what is done in cuDNN.

I would say a good example of such a case would be Winograd Convolution. There is a widely cited paper that describes the algorithm. In reality, there are virtually 0 details about specific implementation. How do you load/store data to shared memory without bank conflicts? how do you perform computations? How do you store and convert results between shared memory to main one? And so on. I personally attempted to implement it several times and couldn't even reach a GEMM based convolution performance.

Only this paper from 2020 had cleared many low level details an allowed me writing a relatively efficient kernel. It still does not reach the performance of cuDNN implementation but it is way better than GEMM based convolution.

Unfortunately, this is a typical case, if tomorrow somebody wants to implement an additional variant of convolution or any other algorithm and prove its efficiency, it is first needed to overcome the performance of unknown highly optimised code to prove that you did something efficient.

Thus having open source low level algorithms is very critical for research and the field itself. It is very clear to me why NVidia do not want to release the source code - it contradicts their interest as a monopoly in the field.

However the community that works in the field is hurt by this policy. We need faster and more efficient software to solve real world problems, without having the source code of core components the progress in the field may be much slower.

Open and Standardised API

I recently discovered that I can't run a code that was compiled with cuda10 and cudnn7 on RTX 30 series of GPU. I didn't want any of the new GPU features - just wanted to run the program. I couldn't. I need to rebuild the software from the scratch with a new API.

What do you think would happen in the gaming industry if NVidia would told the game developers who released their games a year ago to rebuild them from the scratch to be able to run on RTX 3060?

It isn't only user unfriendly, it is also sometimes impossible because the team isn't actively developing a product any more and moved to a new one.

Of course this does not happen in the gaming world, there are standard open or proprietary APIs: OpenGL, Vulkan and Direct3D. To get a new card you need to install a driver. The rest is the same. You can run today a classic old game like Jane's IAF: Israeli Air Force on Windows 10 on modern hardware.

There is a standard GPU compute API: OpenCL. It works very well, it works on virtually every modern GPU from 1st release date: NVidia, AMD, Intel. It even runs on your Android phone's GPU.

Kernels that are written for OpenCL or CUDA can be mechanically converted between the platforms as is. Virtually all concepts of OpenCL and CUDA are identical, code is similar.

This is highly critical for software maintainability and long-term support.

Of course, there are many cases you can't have same kernel running equally efficient on AMD, NVidia, Intel and Mali GPUs. It is true for GEMM and Convolution kernels, but it true to much lesser degree for vast majority of other kernels that do basic but important stuff. Kernels for activation, normalization, element-wise operators and many many others can run same code on different platforms with only minute tweaks.

So as long as you have optimised kernels for critical parts, rest can be shared across platforms. It can be compared to writing a code in C or C++ for all platforms like Intel, ARM and MIPS and only small computationally intensive parts are implemented using platform specific assembly on intrinsic code.

Finally, if you look into gaming industry dealing with GPUs on daily basis you would clearly understand how the requirement of rebuilding your entire code base upon arrival of new GPU generation is ridiculous. But today it is the sad reality.

Open standards are always better than best proprietary ones. Even if you loose something now you are winning in long-term run.

Monopoly Issue

It isn't a secret that NVidia controls virtually all deep-learning and GPU-compute market. It goes to that level that AMD creates "hip" - virtually implemented cuda API with fast replacement of s/cuda/hip/ just to be able to run NVidia's code. As long as vendor specific API is used, you will remain under the control of this vendor.

It isn't the case in Gaming. AMD, Intel and NVidia always try to challenge each other to this degree or other while we as customers enjoy better and more affordable products, enjoy new features like Ray Tracing etc.

It isn't the case for deep-learning. NVidia can charge premium for their "professional" grade devices that provide virtually same silicone. Why? Because they really can. If I was NVidia's CEO I would do the same.

Just to make it clear, currently NVidia is superior DL platform over AMD and Intel, mostly due to absolute ignorance of software by AMD and lack of any kind of powerful GPUs from Intel.

But even if you rightfully use NVidia, relying on open API and open source libraries will allow you to both challenge NVidia that does not take competition seriously and have much lesser headache when NVidia releases a new GPU.

And what is more important you are empowered to choose whatever product you or your customer want or has and not be limited to a single vendor.

Additional Points

Why not SYCL?

The answer is actually very simple. There are several points that make OpenCL superior:

  1. The fact that the kernel source code is separate from C++ code actually makes it much more portable. Mixing C++ and GPU code the way it is done in CUDA it is nice to start with but brings lots of issues in long term where you need to compile for numerous platforms and combinations.
  2. Many SYCL Implementation target one platform only instead of being cross platform.
  3. There are still no serious working open source SYCL implementation, while OpenCL is working today on virtually every device

Why wouldn't AMD or Intel create an alternative?

Good question. And this is should be addressed to them.

AMD created hip and MiOpen that aren't even compatible with their own RDNA hardware, started dropping support of older GCN (like rx580) and limited to Linux only. Their investment into deep learning is minimalist to none.

Intel played along and actually created OpenCL implementations. However their product oneDNN is optimised for Intel and wrapped cuDNN for NVidia.

Bottom line, each vendor cares about his own interests, probably they decided that this is a lost battle.

However the good thing that both AMD and Intel published almost everything as open source so it makes easier in future to use their work in critical training paths.

Comparing Green and Red Apples

Friday, September 10, 2021, by artyom ; Posted in: Benchmarks; 0 comments


  • OpenCL based DLPrimitives is almost as fast as TF based on cuDNN in inference and close enough in training.
  • Framework Matters - TF is much slower than pytorch.
  • AMD 6600 XT is faster than NVidia 1080 and 2060S by a margin that is similar to difference in GFlops of these cards

Also dlprimitives isn't as fast as best cudnn based solutions - pytorch its performance makes it more that useful for platform independent deep learning.

How to Compare Different GPUs

Comparing deep learning software performance on NVidia and AMD GPU isn't as simple as you may think of.

There are two many factors:

  1. No GPUs have identical specs. Major parameters are GFlops and Memroy bandwidth as most DL algorithms are either compute limited (like dense, conv layers) or bandwidth limited - (like batch normalization or activation)
  2. Both companies provide libraries optimized for their gpus: MIOpen and cuDNN. While they highly optimized and provide similar functionality they aren't have similar performance.

Now the situation becomes even more complex when it comes to RDNA architecture. AMD hasn't released support of their DL stack for these GPUs for more than two years.

Even though I decided to try to check it using dlprimitives.

Base Line

Note we compare 3 different GPUs that have similar performance withing reasonable margins.

AMD RX 6600 XT, NVidia GTX 1080, NVidia RTX 2060 Super.

The basic flops performance measured using custom kernel.

gpu GFlops GB/s
6600xt 9,937 216
1080 8,970 242
2060s 8,263 396

Flops performance of modern GPUs can be calculated as clock * cores * 2, however clock depends on specific model and thermal performance so both manual measures used as base line and calculated theoretical expected flops measured using median clock observed during benchmarks.

gpu Cores Clock Mhz Exp GFlops Exp GB/s
6600xt 2048 2655 10,875 256
1080 2560 1809 9,262 320
2060s 2176 1905 8,290 448

So GPUS performance varies, also 2060s has 17-24% less GFlops that 6600xt it has much higher memory throghtput that helps in bandwidth limited algorithms like batch normalization of depthwise separable convolutions for mobilenet. 1080 has 10-15% lower GFlops but 12% more bandwidth.

Testing Methodology

Three frameworks were tested using 64 images batch on:

  1. pytorch/1.8 using cuda+cudnn
  2. keras/tensorflow 2.5 using cuda+cudn
  3. OpenCL based solution dlprimitives.

Since there is no ROCM version of TF or Pytorch that supports AMD's RDNA GPU only dlprimitives were tested expecting to get similar results to other GPUs in same class.

Training Times

Measured in ms per batch, lower is better.

Framework gpu alexnet resnet18 resnet50 vgg16 mobilenet
dlprim 6600xt 83.73 231.2 716.2 1157.2 414.35
dlprim 1080 93.03 262.1 926.6^ 1348.9 614.02
dlprim 2060s 116.41 252.3 705.2^ 1681.3 355.21
keras/tf2 1080 70.56 200.6 684.4^ 633.1 437.84
keras/tf2 2060s 70.00 172.2 520.0^ 553.1 344.55
pytorch 1080 62.37 151.4 518.0 780.9 229.20
pytorch 2060s 41.11 121.2 377.8 621.1^ 143.23

^) Using half batch x32 twice, due to GPU memory limits


  1. DLPrimitives has 67% of Tensorflow performance on NVidia GPUs, Biggest difference was in VGG. Comparison without VGG gives 75% of TF performance
  2. TF has 77% of pytorch. Biggest difference in VGG. Without VGG the difference is increased to 67%.
  3. DLPrimitives runs faster by 24% on AMD RX 6600 XT in comparison to GTX 1080 also the raw GFlops power differs by 10-17% depending on measurement strategy
  4. DLPrimitives runs faster by 15% on AMD RX 6600 XT in comparison to RTX 2060S. It is noticeable that major drop happens on mobile-net that is highly dependent on memory bandwidth with its depth-wise separable convolutions.

Inference Times

Measured in ms per batch, lower is better.

Framework gpu alexnet resnet18 resnet50 vgg16 mobilenet
dlprim 6600xt 34.28 63.57 185.72 277.97 102.84
dlprim 1080 28.03 63.57 274.27 309.28 131.74
dlprim 2060s 47.52 81.09 210.97 428.34 97.80
keras/tf2 1080 40.55 80.64 199.38 189.07 109.85
keras/tf2 2060s 47.95 75.73 165.31 174.27 93.01
pytorch 1080 16.36 43.17 144.88 226.40 60.13
pytorch 2060s 9.65 33.27 107.56 172.47 35.55


  1. DLPrimitives has 90% of Tensorflow performance on NVidia GPUs, Biggest difference was in VGG. Comparison without VGG gives 99% of TF performance
  2. TF has 61% of pytorch performance. Biggest difference in VGG. Without VGG the difference is increased to 49%.
  3. DLPrimitives runs faster by 14% on AMD RX 6600 XT in comparison to GTX 1080, and 26% faster in comparison to RTX 2060S. It is somewhat difference in comparison to training.

Summary and Conclusions

  1. There is a huge difference between different DL frameworks. Pytorch is much faster that TensorFlow by large margins.
  2. DLPrimitives provide decent performance that is comparable to TF (loosing ~25% of performance in training and 10% in inference)
  3. It seems that 6600XT gives decent performance for dlprimitives comparable to ones by nVidia 1080/2060s with performance improvement gap that is comparable to difference in GFlops gap.

previous page

next page