Pages

Project Goals

About

DLPrimitives project is a project that aims to make deep learning platform independent and truly open source.

It does it by implementing efficient and optimized operators for GPU computing using OpenCL platform.

Navigate

Pytorch Updates

Tuesday, November 16, 2021, by artyom ; Posted in: Internals; 0 comments

In order to improve the progress I started validating all pretrained torchvision models one by one. I found several features I needed to implement but what is more important I found several critical bugs I could fix.

https://pytorch.org/vision/stable/models.html#classification

At this point following networks are validated against CPU version in both forward and backward propagation:

alexnet
resnet18
resnet50
vgg16
densenet161
googlenet
squeezenet1_0
inception_v3 (fwd only - backward fails on cuda/cpu)
shufflenet_v2_x1_0
mobilenet_v2
mobilenet_v3_large
mobilenet_v3_small (fwd only - same failure on bwd on cuda)
resnext50_32x4d
wide_resnet50_2
mnasnet1_0
efficientnet_b0
efficientnet_b4
regnet_y_400mf

To be continued...

Update Nov 17, 2021: I implemneted ceil rounding pooling mode, thus googlenet and squeezenet1_0 now pass validation

Pointwise Broadcast Reduce

Tuesday, November 16, 2021, by artyom ; Posted in: Internals; 0 comments

Lots of deep learning operations can be implemented as simple element-by-element operations over different tensors with numpy broadcasting and reduction afterwards. For example:

Adding Bias [C] to [B,C,H,W] image is can be seen in numpy as:

 x + bias.reshape((C,1,1))

Gradient of bias can be calculated as:

 np.sum(dy,dims=(0,2,3))

That is simple reduction operations. Calculation of mean and variance in batch normalisation requires calculation of x and x*x over all dims but C.

Observing this I implemented a broadcast/reduce templates API to simplify development. http://dlprimitives.org/docs/pointwise_8hpp_source.html

The idea is following:

You provide input tensors and scalar parameters
You define the operation need to performed on each operand
You provide reduction operation

The OpenCL kernel code is auto-generated for you. For example calculations of x and x*x sums over all dims but channels would look like:

    auto op = dlprim::core::PointwiseOperationBroadcastReduce::create(
                ctx,
                {X.specs()},{Xsum.specs(),X2sum.specs()},
                0,dlprim::float_data, 
                "y0=x0; y1=x0*x0;", // operations
                "reduce_y0 = 0; reduce_y1 = 0", // reduce init
                "reduce_y0 += y0; reduce_y1 += y1"
               );
    op->enqueue({X},{Xsum,X2sum},s,{},{1,1},{0,0},q);

So - 1st output is just x - sum and second is x*x - sum. So if you provide X in shape of [B,C,H,W] and Xsum, X2sum in shape [C,1,1] that is broadcast-able to X you'll get the sums you need without writing custom reduction code of manually writing kernels.

This vastly simplified writing multiple operators especially ones that are expected to support numpy style broadcasting in pytorch.

Pytorch Training Benchrmarks

Tuesday, October 26, 2021, by artyom ; Posted in: Benchmarks; 0 comments

I managed to train some networks in pytorch with opencl/dlprimitives backend and the result is promising.

Below the results for several nVidia GPUs and comparison to baseline tf2/keras. Unlike previous benchmarks I fixed missing time of Adam optimiser that apparently was significant (it isn't very efficient in pytorch)

I also added times for AMD 6600xt, unfortunately there is no baseline I can use since AMD hadn't released ROCM for RDNA yet.

Absolute Performance

Batch size: 16 images 224x224, time in ms. Lower is better.

Framework	Vendor	GPU	alexnet	resnet18	resnet50	vgg16	mobilenet
pytorch/opencl	AMD	rx 6600xt	56.846	109.850	258.973	365.305	163.732
dlprimitives	AMD	rx 6600xt	36.954	65.241	194.398	308.763	99.862
pytorch/cuda	Nvidia	rtx 2060s	27.592	38.624	114.074	179.580	49.624
pytorch/opencl	Nvidia	rtx 2060s	50.108	82.021	223.651	462.964	129.145
dlprimitives	Nvidia	rtx 2060s	39.829	67.960	187.398	439.053	90.229
tf2/cuda	Nvidia	rtx 2060s	29.165	55.523	147.999	156.714	102.596
pytorch/cuda	Nvidia	gtx 1080	38.310	44.382	137.754	232.824	63.324
pytorch/opencl	Nvidia	gtx 1080	54.828	85.016	301.898	411.928	173.885
dlprimitives	Nvidia	gtx 1080	38.804	71.147	264.286	374.168	134.650
tf2/cuda	Nvidia	gtx 1080	35.592	69.071	189.994	197.333	128.526

Relative Performance

Comparison of TF/Cuda with pytorch + opencl/dlprimitives and dlprimitives alone:

Baseline	tested	GPU	alexnet	resnet18	resnet50	vgg16	mobilenet
tf2/cuda	dlprimitives	gtx 1080	92%	97%	72%	53%	95%
tf2/cuda	pt/opencl	gtx 1080	65%	81%	63%	48%	74%
tf2/cuda	dlprimitives	rtx 2060s	73%	82%	79%	36%	114%
tf2/cuda	pt/opencl	rtx 2060s	58%	68%	66%	34%	79%

Summary

Besides VGG, most of results are very assuring

Notes

Why do I compare to TF2/cuda as base line. Pytorch is faster framework. However since TF is good enough for most users I want to show that I get performance that is close enough.

Hello Pytorch OpenCL

Friday, October 8, 2021, by artyom ; Posted in: Internals; 0 comments

TL;DR: I managed to run an inference of alexnet using OpenCL/DLPrimitives based pytorch backend!

Details

I started from this tutorial to implement out of source backend for pytorch. It wasn't that simple and I had to do small changes in original pytorch source code but finally something is working:

Now I implemented only handful of ops and mostly for forward computations: github backend code. However, I managed to do forward computations and get correct result on pretrained alexnet.

$ python validate_network.py --model alexnet --device cuda *.ppm
cat.ppm,281,tabby
dog.ppm,207,golden retriever
parrot.ppm,87,African grey
$ python validate_network.py --model alexnet --device opencl:1 *.ppm 
Accessing device #1:GeForce GTX 960 on NVIDIA CUDA
cat.ppm,281,tabby
dog.ppm,207,golden retriever
parrot.ppm,87,African grey

Performance for this tiny task isn't not brilliant, but not horrible either, GTX 960, alexnet batch size of 16 images 224x224:

Pytorch Cuda/CUDNN: 15.317 ms - updated 2021-10-10
Pytorch OpenCL/DLPrimitives: 22.932 ms - updated 2021-10-10
DLPrim - microframework: 22.401 ms
Caffe/CuDNN: 16.1812 ms
Caffe/OpenCL: 41.072 ms
Caffe/OpenCL+DLPrimitives: 28.618 ms
Keras/CuDNN: 23.341 ms
Keras/PlaidML: 44.041 ms

Now, one of the issues that I currently have is synchronous execution that gives significant penalty for every operation. I need to understand an asynchronous execution and memory management stuff before I continue. The penalty for NVidia OpenCL backend isn't horrible but it is devastating for AMD OpenCL driver. Need to dive in.

Keep updated.

Edit, Oct 10, 2021

I found the way to implement asynchronous execution + initial GPU memory caching. That allowed to be bring the performance of pytorch OpenCL to same level as vanilla dlprimitives. This also solved the performance issues I had with AMD GPU.

Additionally I found that I didn't take in an account host-to-device transfer in pytorch benchmarks - so original CUDA run time for pytorch increased.

Priorities?

Sunday, September 19, 2021, by artyom ; 0 comments

DLPrimitives already gives promising results... But I'm really wondering what to prioritize:

Add more useful operators (dropout, upscale, lstm, prelu, mse-loss etc) to make DLPrimitives fully featured?
Try to improve existing OpenCL frameworks like Caffe (or PlaidML) by using DLPrimitives core operations?
Start working on pytorch OpenCL backend - that is huge undertaking?
Work on support of float16/bfloat16?
Continue improving performance by integrating with open source implementations for Arm-Mali, Intel?

Every task is important.

It is logical to add more operators so DLPrimitives - DL framework can be useful for real world tasks - it can be done relatively fast since most of operators aren't that complex.

But in order to make it really useful (and not niche) it need to be integrated to at least one of the popular frameworks like Pytorch, TF or Mxnet. On the other hand implementing pytorch backend is huge task that will take lots of time - but it is actually the true goal.

I can go with improving Caffe-OpenCL were I mostly need to fix several performance critical layers by using dlprimitives... ahhh and fix Caffe memory management since Keras/PT uses 1/4 of the memory Caffe uses. It can be good POC but Caffe is actually dead - I already have working POC.

Hard to decide.