Attempt to integrate with OneDNN

Sunday, November 21, 2021, by artyom ; Posted in: Internals; 0 comments

Intel's OneDNN is great project that provides cudnn/inference/training like tools for Intel's GPU.

Also it is called OneDNN... it should be called IntelDNN since it supports only Intel gpus and cpus.

Bottom line I tried to add OneDNN based convolutions for Intel GPU just to discover that my simple GEMM based convolution works better. Why? Apparently Intel's implementation seems to be optimized for Channel Last format only.

A simple convolution with 3x3 kernel with 64 input and output channels with image dimension of 56 on Intel HD 530 with 400 GFlops capacity gives:

  • 295.6 GFlops for OneDNN's channels last format
  • 144.7 GFlops for dlprimitive's channel first format
  • 33.4(!) GFlops for OneDNN's channels first format.

The problem is that channels first is the most common format used by pytorch, mxnet, caffe and many other tools (including dlprimitives)

Ok... I'll check it later when one of two happens:

  1. They fix channel first performance
  2. I'll support channel last format internally

Pytorch Updates

Tuesday, November 16, 2021, by artyom ; Posted in: Internals; 0 comments

In order to improve the progress I started validating all pretrained torchvision models one by one. I found several features I needed to implement but what is more important I found several critical bugs I could fix.

At this point following networks are validated against CPU version in both forward and backward propagation:

  • alexnet
  • resnet18
  • resnet50
  • vgg16
  • densenet161
  • googlenet
  • squeezenet1_0
  • inception_v3 (fwd only - backward fails on cuda/cpu)
  • shufflenet_v2_x1_0
  • mobilenet_v2
  • mobilenet_v3_large
  • mobilenet_v3_small (fwd only - same failure on bwd on cuda)
  • resnext50_32x4d
  • wide_resnet50_2
  • mnasnet1_0
  • efficientnet_b0
  • efficientnet_b4
  • regnet_y_400mf

To be continued...

Update Nov 17, 2021: I implemneted ceil rounding pooling mode, thus googlenet and squeezenet1_0 now pass validation

Pointwise Broadcast Reduce

Tuesday, November 16, 2021, by artyom ; Posted in: Internals; 0 comments

Lots of deep learning operations can be implemented as simple element-by-element operations over different tensors with numpy broadcasting and reduction afterwards. For example:

Adding Bias [C] to [B,C,H,W] image is can be seen in numpy as:

 x + bias.reshape((C,1,1))

Gradient of bias can be calculated as:


That is simple reduction operations. Calculation of mean and variance in batch normalisation requires calculation of x and x*x over all dims but C.

Observing this I implemented a broadcast/reduce templates API to simplify development.

The idea is following:

  • You provide input tensors and scalar parameters
  • You define the operation need to performed on each operand
  • You provide reduction operation

The OpenCL kernel code is auto-generated for you. For example calculations of x and x*x sums over all dims but channels would look like:

    auto op = dlprim::core::PointwiseOperationBroadcastReduce::create(
                "y0=x0; y1=x0*x0;", // operations
                "reduce_y0 = 0; reduce_y1 = 0", // reduce init
                "reduce_y0 += y0; reduce_y1 += y1"

So - 1st output is just x - sum and second is x*x - sum. So if you provide X in shape of [B,C,H,W] and Xsum, X2sum in shape [C,1,1] that is broadcast-able to X you'll get the sums you need without writing custom reduction code of manually writing kernels.

This vastly simplified writing multiple operators especially ones that are expected to support numpy style broadcasting in pytorch.

Pytorch Training Benchrmarks

Tuesday, October 26, 2021, by artyom ; Posted in: Benchmarks; 0 comments

I managed to train some networks in pytorch with opencl/dlprimitives backend and the result is promising.

Below the results for several nVidia GPUs and comparison to baseline tf2/keras. Unlike previous benchmarks I fixed missing time of Adam optimiser that apparently was significant (it isn't very efficient in pytorch)

I also added times for AMD 6600xt, unfortunately there is no baseline I can use since AMD hadn't released ROCM for RDNA yet.

Absolute Performance

Batch size: 16 images 224x224, time in ms. Lower is better.

Framework Vendor GPU alexnet resnet18 resnet50 vgg16 mobilenet
pytorch/opencl AMD rx 6600xt 56.846 109.850 258.973 365.305 163.732
dlprimitives AMD rx 6600xt 36.954 65.241 194.398 308.763 99.862
pytorch/cuda Nvidia rtx 2060s 27.592 38.624 114.074 179.580 49.624
pytorch/opencl Nvidia rtx 2060s 50.108 82.021 223.651 462.964 129.145
dlprimitives Nvidia rtx 2060s 39.829 67.960 187.398 439.053 90.229
tf2/cuda Nvidia rtx 2060s 29.165 55.523 147.999 156.714 102.596
pytorch/cuda Nvidia gtx 1080 38.310 44.382 137.754 232.824 63.324
pytorch/opencl Nvidia gtx 1080 54.828 85.016 301.898 411.928 173.885
dlprimitives Nvidia gtx 1080 38.804 71.147 264.286 374.168 134.650
tf2/cuda Nvidia gtx 1080 35.592 69.071 189.994 197.333 128.526

Relative Performance

Comparison of TF/Cuda with pytorch + opencl/dlprimitives and dlprimitives alone:

Baseline tested GPU alexnet resnet18 resnet50 vgg16 mobilenet
tf2/cuda dlprimitives gtx 1080 92% 97% 72% 53% 95%
tf2/cuda pt/opencl gtx 1080 65% 81% 63% 48% 74%
tf2/cuda dlprimitives rtx 2060s 73% 82% 79% 36% 114%
tf2/cuda pt/opencl rtx 2060s 58% 68% 66% 34% 79%


Besides VGG, most of results are very assuring


Why do I compare to TF2/cuda as base line. Pytorch is faster framework. However since TF is good enough for most users I want to show that I get performance that is close enough.

Hello Pytorch OpenCL

Friday, October 8, 2021, by artyom ; Posted in: Internals; 0 comments

TL;DR: I managed to run an inference of alexnet using OpenCL/DLPrimitives based pytorch backend!


I started from this tutorial to implement out of source backend for pytorch. It wasn't that simple and I had to do small changes in original pytorch source code but finally something is working:

Now I implemented only handful of ops and mostly for forward computations: github backend code. However, I managed to do forward computations and get correct result on pretrained alexnet.

$ python --model alexnet --device cuda *.ppm
dog.ppm,207,golden retriever
parrot.ppm,87,African grey
$ python --model alexnet --device opencl:1 *.ppm 
Accessing device #1:GeForce GTX 960 on NVIDIA CUDA
dog.ppm,207,golden retriever
parrot.ppm,87,African grey

Performance for this tiny task isn't not brilliant, but not horrible either, GTX 960, alexnet batch size of 16 images 224x224:

  • Pytorch Cuda/CUDNN: 15.317 ms - updated 2021-10-10
  • Pytorch OpenCL/DLPrimitives: 22.932 ms - updated 2021-10-10
  • DLPrim - microframework: 22.401 ms
  • Caffe/CuDNN: 16.1812 ms
  • Caffe/OpenCL: 41.072 ms
  • Caffe/OpenCL+DLPrimitives: 28.618 ms
  • Keras/CuDNN: 23.341 ms
  • Keras/PlaidML: 44.041 ms

Now, one of the issues that I currently have is synchronous execution that gives significant penalty for every operation. I need to understand an asynchronous execution and memory management stuff before I continue. The penalty for NVidia OpenCL backend isn't horrible but it is devastating for AMD OpenCL driver. Need to dive in.

Keep updated.

Edit, Oct 10, 2021

I found the way to implement asynchronous execution + initial GPU memory caching. That allowed to be bring the performance of pytorch OpenCL to same level as vanilla dlprimitives. This also solved the performance issues I had with AMD GPU.

Additionally I found that I didn't take in an account host-to-device transfer in pytorch benchmarks - so original CUDA run time for pytorch increased.

next page