Pages

Project Goals

About

DLPrimitives project is a project that aims to make deep learning platform independent and truly open source.

It does it by implementing efficient and optimized operators for GPU computing using OpenCL platform.

Navigate

Hello Pytorch OpenCL

10/8/21, by artyom ; Posted in: Internals; 0 comments

TL;DR: I managed to run an inference of alexnet using OpenCL/DLPrimitives based pytorch backend!

Details

I started from this tutorial to implement out of source backend for pytorch. It wasn't that simple and I had to do small changes in original pytorch source code but finally something is working:

Now I implemented only handful of ops and mostly for forward computations: github backend code. However, I managed to do forward computations and get correct result on pretrained alexnet.

$ python validate_network.py --model alexnet --device cuda *.ppm
cat.ppm,281,tabby
dog.ppm,207,golden retriever
parrot.ppm,87,African grey
$ python validate_network.py --model alexnet --device opencl:1 *.ppm 
Accessing device #1:GeForce GTX 960 on NVIDIA CUDA
cat.ppm,281,tabby
dog.ppm,207,golden retriever
parrot.ppm,87,African grey

Performance for this tiny task isn't not brilliant, but not horrible either, GTX 960, alexnet batch size of 16 images 224x224:

Pytorch Cuda/CUDNN: 15.317 ms - updated 2021-10-10
Pytorch OpenCL/DLPrimitives: 22.932 ms - updated 2021-10-10
DLPrim - microframework: 22.401 ms
Caffe/CuDNN: 16.1812 ms
Caffe/OpenCL: 41.072 ms
Caffe/OpenCL+DLPrimitives: 28.618 ms
Keras/CuDNN: 23.341 ms
Keras/PlaidML: 44.041 ms

Now, one of the issues that I currently have is synchronous execution that gives significant penalty for every operation. I need to understand an asynchronous execution and memory management stuff before I continue. The penalty for NVidia OpenCL backend isn't horrible but it is devastating for AMD OpenCL driver. Need to dive in.

Keep updated.

Edit, Oct 10, 2021

I found the way to implement asynchronous execution + initial GPU memory caching. That allowed to be bring the performance of pytorch OpenCL to same level as vanilla dlprimitives. This also solved the performance issues I had with AMD GPU.

Additionally I found that I didn't take in an account host-to-device transfer in pytorch benchmarks - so original CUDA run time for pytorch increased.

Add Comment:

Author
E-Mail	the email would not displayed
URL

You can write your messages using Markdown syntax.

You must enable JavaScript in order to post comments.