DLPrimitives Blog http://blog.dlprimitives.org/ Development Blog First release 0.1.0 of dlprimitives pytorch backend is out http://blog.dlprimitives.org/post/13 http://blog.dlprimitives.org/post/13 <div style="direction:ltr"> <p>And now with binaries for easy installation</p> <p>This release supports pytorch 2.4 and introduces better way to use OpenCL pytorch</p> <p>This time I provided binary distributions of the backend</p> <ul> <li>Linux for python 3.8 till 3.12, for torch=2.4</li> <li>Windows for python 3.11 and 3.12 for torch=2.4</li> </ul> <p>To install - install CPU version of pytorch in virtual evironment, download whl file from release make sure torch version, python version and architecture matches your environment.</p> <p>For example python 3.10, torch 2.4 on Linux it is:</p> <p><code>pip install pytorch_ocl-0.1.0+torch2.4-cp310-none-linux_x86_64.whl</code></p> <p>To use <code>import pytorch_ocl</code></p> </div> Some important updates http://blog.dlprimitives.org/post/12 http://blog.dlprimitives.org/post/12 <div style="direction:ltr"> <p>Lots of passed since latest updates... So</p> <ol> <li>Now it works with pytorch 2.4 - in fact it is requirement. Either 1.13 or >=2.4</li> <li>I created a much easier interface to use - all you need is to import <code>pytorch_ocl</code> module and you'll get all the goodies on Linux and Windows.</li> <li>With python module you can use <code>torch.ocl.synchonize()</code> and <code>torch.ocl.empty_cache()</code> as with CUDA</li> <li>I ordered Intel Arc GPU (a380) - so hopefully I'm going to be able to check/optimise for a new platform</li> <li>Implemented other things needed like <code>manual_seed_all</code> - as required for backed.</li> </ol> <p>Known issues: Currently there is an issue with loading back saved state dictionary if it was saved from ocl device. It crashes for some reason (outside of ocl backend).</p> <p><em>Workaround:</em> When you save/restore the model move it to CPU and than back to ocl device.</p> </div> Pytorch OpenCL backend - simplified http://blog.dlprimitives.org/post/11 http://blog.dlprimitives.org/post/11 <div style="direction:ltr"> <p>Now installation of opencl backend for pytorch is really simple.</p> <ol> <li>Install nighly version of pytorch for CPU in virtual environment</li> <li>Clone <code>dlrpim_backend</code> repository and checkout <code>true_out_of_tree_support</code> branch</li> <li>Update submodules</li> <li><p>Run few commands inside repo</p> <pre><code> mkdir build cd build cmake -DCMAKE_PREFIX_PATH=$VIRTUAL_ENV/lib/python3.8/site-packages/torch/share/cmake/Torch .. make cd .. </code></pre></li> <li><p>Run mnist training:</p> <pre><code> python mnist.py --device=ocl:0 </code></pre></li> </ol> <p>That's it.</p> <p>Keep updated</p> </div> Inference of ONNX Models using DLPrimitives http://blog.dlprimitives.org/post/10 http://blog.dlprimitives.org/post/10 <div style="direction:ltr"> <p>I worked on integration of inference of ONNX models using DLPrimitives. It isn't a simple task since ONNX operator set is very reach and many things can be implemented in different ways.</p> <p>After many revisions and improvements I managed to validate multiple imagenet pretrained networks from pytorch, mxnet and few based on TensorFlow (see about issues with TF later)</p> <p>How do you create a dlprimitives network using ONNX Model?</p> <pre><code>// load and parse ONNX Model dp::ONNXModel model; model.load(onnx_path); // create network dp::Context ctx(device_id); dp::Net net(ctx); // load parameters net.load_model(model); </code></pre> <p>And you are ready to go.</p> <p>I validated following networks and frameworks:</p> <ul> <li>Pytorch, op-sets 9, 11, 13, nets <code>alexnet</code>, <code>vgg16</code>, <code>resnet18</code>, <code>resnext50_32x4d</code>, <code>wide_resnet50_2</code>, <code>efficientnet_b0</code>, <code>efficientnet_b4</code>, <code>regnet_y_400mf</code>, <code>squeezenet1_0</code>, <code>mobilenet_v2</code>, <code>densenet121</code></li> <li>MXNet: <code>vgg11_bn</code>, <code>alexnet</code>, <code>mobilenetv2_0.25</code>, <code>mobilenet0.25</code>, <code>densenet121</code>, <code>resnet18_v1</code>, <code>squeezenet1.0</code></li> <li>Tensorflow: op-sets 9, and 11 limited initial support, channel first format: <code>resnet50</code>, <code>densenet121</code></li> </ul> <p>Some networks on pytorch don't pass due to lack of some of the operators. The situation with TensorFlow is more complicated and only few networks worked ok.</p> <h2>TensorFlow</h2> <p>When I stated validated pretrained keras networks I discovered very surprising thing. TensorFlow uses asymmetrical padding in some cases, since in TF/Keras you don't explicitly provide padding but rather give some vague definition of "same" or "valid" for the padding, in some cases padding may differ on start and end of the image.</p> <p>Interestingly, cuDNN does not even provide asymmetrical padding option for convolutions. Looking into the code TF does padding manually is such case (that is actually huge waste of memory and memory bandwidth)</p> <p>So implementing these convolutions will require implementing of new simple padding layer just to make sure we can use dlprimitives for inference of TF models.</p> <p>To be continued...</p> </div> Attempt to integrate with OneDNN http://blog.dlprimitives.org/post/9 http://blog.dlprimitives.org/post/9 <div style="direction:ltr"> <p>Intel's OneDNN is great project that provides cudnn/inference/training like tools for Intel's GPU.</p> <p>Also it is called OneDNN... it should be called IntelDNN since it supports only Intel gpus and cpus.</p> <p>Bottom line I tried to add <a href="https://github.com/artyom-beilis/dlprimitives/tree/onednn_integration">OneDNN based convolutions for Intel GPU</a> just to discover that my simple GEMM based convolution works better. Why? Apparently Intel's implementation seems to be optimized for Channel Last format only.</p> <p><a href="https://github.com/oneapi-src/oneDNN/issues/1194">https://github.com/oneapi-src/oneDNN/issues/1194</a></p> <p>A simple convolution with 3x3 kernel with 64 input and output channels with image dimension of 56 on Intel HD 530 with 400 GFlops capacity gives:</p> <ul> <li>295.6 GFlops for OneDNN's channels last format</li> <li>144.7 GFlops for dlprimitive's channel first format</li> <li>33.4(!) GFlops for OneDNN's channels first format.</li> </ul> <p>The problem is that channels first is the most common format used by pytorch, mxnet, caffe and many other tools (including dlprimitives)</p> <p>Ok... I'll check it later when one of two happens:</p> <ol> <li>They fix channel first performance</li> <li>I'll support channel last format internally</li> </ol> </div> Pytorch Updates http://blog.dlprimitives.org/post/8 http://blog.dlprimitives.org/post/8 <div style="direction:ltr"> <p>In order to improve the progress I started validating all pretrained torchvision models one by one. I found several features I needed to implement but what is more important I found several critical bugs I could fix.</p> <p><a href="https://pytorch.org/vision/stable/models.html#classification">https://pytorch.org/vision/stable/models.html#classification</a></p> <p>At this point following networks are validated against CPU version in both forward and backward propagation:</p> <ul> <li><code>alexnet</code></li> <li><code>resnet18</code></li> <li><code>resnet50</code></li> <li><code>vgg16</code></li> <li><code>densenet161</code></li> <li><code>googlenet</code></li> <li><code>squeezenet1_0</code></li> <li><code>inception_v3</code> (fwd only - backward fails on cuda/cpu)</li> <li><code>shufflenet_v2_x1_0</code></li> <li><code>mobilenet_v2</code></li> <li><code>mobilenet_v3_large</code></li> <li><code>mobilenet_v3_small</code> (fwd only - same failure on bwd on cuda)</li> <li><code>resnext50_32x4d</code></li> <li><code>wide_resnet50_2</code></li> <li><code>mnasnet1_0</code></li> <li><code>efficientnet_b0</code></li> <li><code>efficientnet_b4</code></li> <li><code>regnet_y_400mf</code></li> </ul> <p>To be continued...</p> <p><strong>Update Nov 17, 2021:</strong> I implemneted ceil rounding pooling mode, thus <code>googlenet</code> and <code>squeezenet1_0</code> now pass validation</p> </div> Pointwise Broadcast Reduce http://blog.dlprimitives.org/post/7 http://blog.dlprimitives.org/post/7 <div style="direction:ltr"> <p>Lots of deep learning operations can be implemented as simple element-by-element operations over different tensors with numpy broadcasting and reduction afterwards. For example:</p> <p>Adding Bias <code>[C]</code> to <code>[B,C,H,W]</code> image is can be seen in numpy as:</p> <pre><code> x + bias.reshape((C,1,1)) </code></pre> <p>Gradient of bias can be calculated as:</p> <pre><code> np.sum(dy,dims=(0,2,3)) </code></pre> <p>That is simple reduction operations. Calculation of mean and variance in batch normalisation requires calculation of <code>x</code> and <code>x*x</code> over all dims but <code>C</code>.</p> <p>Observing this I implemented a <code>broadcast</code>/<code>reduce</code> templates API to simplify development. <a href="http://dlprimitives.org/docs/pointwise_8hpp_source.html">http://dlprimitives.org/docs/pointwise_8hpp_source.html</a></p> <p>The idea is following:</p> <ul> <li>You provide input tensors and scalar parameters</li> <li>You define the operation need to performed on each operand</li> <li>You provide reduction operation</li> </ul> <p>The OpenCL kernel code is auto-generated for you. For example calculations of x and x*x sums over all dims but channels would look like:</p> <pre><code> auto op = dlprim::core::PointwiseOperationBroadcastReduce::create( ctx, {X.specs()},{Xsum.specs(),X2sum.specs()}, 0,dlprim::float_data, "y0=x0; y1=x0*x0;", // operations "reduce_y0 = 0; reduce_y1 = 0", // reduce init "reduce_y0 += y0; reduce_y1 += y1" ); op-&gt;enqueue({X},{Xsum,X2sum},s,{},{1,1},{0,0},q); </code></pre> <p>So - 1st output is just x - sum and second is <code>x*x</code> - sum. So if you provide X in shape of <code>[B,C,H,W]</code> and Xsum, X2sum in shape <code>[C,1,1]</code> that is broadcast-able to X you'll get the sums you need without writing custom reduction code of manually writing kernels.</p> <p>This vastly simplified writing multiple operators especially ones that are expected to support numpy style broadcasting in pytorch.</p> </div> Pytorch Training Benchrmarks http://blog.dlprimitives.org/post/6 http://blog.dlprimitives.org/post/6 <div style="direction:ltr"> <p>I managed to train some networks in pytorch with opencl/dlprimitives backend and the result is promising.</p> <p>Below the results for several nVidia GPUs and comparison to baseline tf2/keras. Unlike previous benchmarks I fixed missing time of Adam optimiser that apparently was significant (it isn't very efficient in pytorch)</p> <p>I also added times for AMD 6600xt, unfortunately there is no baseline I can use since AMD hadn't released ROCM for RDNA yet.</p> <h2>Absolute Performance</h2> <p>Batch size: 16 images 224x224, time in ms. Lower is better.</p> <table> <thead> <tr> <th> Framework </th> <th> Vendor </th> <th> GPU </th> <th> alexnet </th> <th> resnet18 </th> <th> resnet50 </th> <th> vgg16 </th> <th> mobilenet </th> </tr> </thead> <tbody> <tr> <td>pytorch/opencl </td> <td> AMD </td> <td> rx 6600xt </td> <td> 56.846 </td> <td> 109.850 </td> <td> 258.973 </td> <td> 365.305 </td> <td> 163.732 </td> </tr> <tr> <td>dlprimitives </td> <td> AMD </td> <td> rx 6600xt </td> <td> 36.954 </td> <td> 65.241 </td> <td> 194.398 </td> <td> 308.763 </td> <td> 99.862 </td> </tr> <tr> <td>pytorch/cuda </td> <td> Nvidia </td> <td> rtx 2060s </td> <td> 27.592 </td> <td> 38.624 </td> <td> 114.074 </td> <td> 179.580 </td> <td> 49.624 </td> </tr> <tr> <td>pytorch/opencl </td> <td> Nvidia </td> <td> rtx 2060s </td> <td> 50.108 </td> <td> 82.021 </td> <td> 223.651 </td> <td> 462.964 </td> <td> 129.145 </td> </tr> <tr> <td>dlprimitives </td> <td> Nvidia </td> <td> rtx 2060s </td> <td> 39.829 </td> <td> 67.960 </td> <td> 187.398 </td> <td> 439.053 </td> <td> 90.229 </td> </tr> <tr> <td>tf2/cuda </td> <td> Nvidia </td> <td> rtx 2060s </td> <td> 29.165 </td> <td> 55.523 </td> <td> 147.999 </td> <td> 156.714 </td> <td> 102.596 </td> </tr> <tr> <td>pytorch/cuda </td> <td> Nvidia </td> <td> gtx 1080 </td> <td> 38.310 </td> <td> 44.382 </td> <td> 137.754 </td> <td> 232.824 </td> <td> 63.324 </td> </tr> <tr> <td>pytorch/opencl </td> <td> Nvidia </td> <td> gtx 1080 </td> <td> 54.828 </td> <td> 85.016 </td> <td> 301.898 </td> <td> 411.928 </td> <td> 173.885 </td> </tr> <tr> <td>dlprimitives </td> <td> Nvidia </td> <td> gtx 1080 </td> <td> 38.804 </td> <td> 71.147 </td> <td> 264.286 </td> <td> 374.168 </td> <td> 134.650 </td> </tr> <tr> <td>tf2/cuda </td> <td> Nvidia </td> <td> gtx 1080 </td> <td> 35.592 </td> <td> 69.071 </td> <td> 189.994 </td> <td> 197.333 </td> <td> 128.526 </td> </tr> </tbody> </table> <h2>Relative Performance</h2> <p>Comparison of TF/Cuda with pytorch + opencl/dlprimitives and dlprimitives alone:</p> <table> <thead> <tr> <th>Baseline </th> <th> tested </th> <th> GPU </th> <th> alexnet </th> <th> resnet18 </th> <th> resnet50 </th> <th> vgg16 </th> <th> mobilenet </th> </tr> </thead> <tbody> <tr> <td>tf2/cuda </td> <td> dlprimitives </td> <td> gtx 1080 </td> <td> 92% </td> <td> 97% </td> <td> 72% </td> <td> 53% </td> <td> 95% </td> </tr> <tr> <td>tf2/cuda </td> <td> pt/opencl </td> <td> gtx 1080 </td> <td> 65% </td> <td> 81% </td> <td> 63% </td> <td> 48% </td> <td> 74% </td> </tr> <tr> <td>tf2/cuda </td> <td> dlprimitives </td> <td> rtx 2060s </td> <td> 73% </td> <td> 82% </td> <td> 79% </td> <td> 36% </td> <td> 114% </td> </tr> <tr> <td>tf2/cuda </td> <td> pt/opencl </td> <td> rtx 2060s </td> <td> 58% </td> <td> 68% </td> <td> 66% </td> <td> 34% </td> <td> 79% </td> </tr> </tbody> </table> <h2>Summary</h2> <p>Besides VGG, most of results are very assuring</p> <h2>Notes</h2> <p>Why do I compare to TF2/cuda as base line. Pytorch is faster framework. However since TF is good enough for most users I want to show that I get performance that is close enough.</p> </div> Hello Pytorch OpenCL http://blog.dlprimitives.org/post/5 http://blog.dlprimitives.org/post/5 <div style="direction:ltr"> <p>TL;DR: I managed to run an inference of alexnet using OpenCL/DLPrimitives based pytorch backend!</p> <h2>Details</h2> <p>I started from this <a href="https://pytorch.org/tutorials/advanced/extend_dispatcher.html">tutorial</a> to implement out of source backend for pytorch. It wasn't that simple and I had to do <a href="https://github.com/artyom-beilis/pytorch/commit/eb74af18af6e90ae47f24997af8468bf7b9deb72">small changes</a> in original pytorch source code but finally something is working:</p> <p>Now I implemented only handful of ops and mostly for forward computations: <a href="https://github.com/artyom-beilis/pytorch_dlprim">github backend code</a>. However, I managed to do forward computations and get correct result on pretrained alexnet.</p> <pre><code>$ python validate_network.py --model alexnet --device cuda *.ppm cat.ppm,281,tabby dog.ppm,207,golden retriever parrot.ppm,87,African grey $ python validate_network.py --model alexnet --device opencl:1 *.ppm Accessing device #1:GeForce GTX 960 on NVIDIA CUDA cat.ppm,281,tabby dog.ppm,207,golden retriever parrot.ppm,87,African grey </code></pre> <p>Performance for this tiny task isn't not brilliant, but not horrible either, GTX 960, alexnet batch size of 16 images 224x224:</p> <ul> <li>Pytorch Cuda/CUDNN: 15.317 ms - updated 2021-10-10</li> <li>Pytorch OpenCL/DLPrimitives: 22.932 ms - updated 2021-10-10</li> <li>DLPrim - microframework: 22.401 ms</li> <li>Caffe/CuDNN: 16.1812 ms</li> <li>Caffe/OpenCL: 41.072 ms</li> <li>Caffe/OpenCL+DLPrimitives: 28.618 ms</li> <li>Keras/CuDNN: 23.341 ms</li> <li>Keras/PlaidML: 44.041 ms</li> </ul> <p>Now, one of the issues that I currently have is synchronous execution that gives significant penalty for every operation. I need to understand an asynchronous execution and memory management stuff before I continue. The penalty for NVidia OpenCL backend isn't horrible but it is devastating for AMD OpenCL driver. Need to dive in.</p> <p>Keep updated.</p> <h2>Edit, Oct 10, 2021</h2> <p>I found the way to implement asynchronous execution + initial GPU memory caching. That allowed to be bring the performance of pytorch OpenCL to same level as vanilla dlprimitives. This also solved the performance issues I had with AMD GPU.</p> <p>Additionally I found that I didn't take in an account host-to-device transfer in pytorch benchmarks - so original CUDA run time for pytorch increased.</p> </div> Priorities? http://blog.dlprimitives.org/post/4 http://blog.dlprimitives.org/post/4 <div style="direction:ltr"> <p>DLPrimitives already gives promising results... But I'm really wondering what to prioritize:</p> <ol> <li>Add more useful operators (dropout, upscale, lstm, prelu, mse-loss etc) to make DLPrimitives fully featured?</li> <li>Try to improve existing OpenCL frameworks like Caffe (or PlaidML) by using DLPrimitives core operations?</li> <li>Start working on pytorch OpenCL backend - that is huge undertaking?</li> <li>Work on support of float16/bfloat16?</li> <li>Continue improving performance by integrating with open source implementations for Arm-Mali, Intel?</li> </ol> <p>Every task is important.</p> <p>It is logical to add more operators so DLPrimitives - DL framework can be useful for real world tasks - it can be done relatively fast since most of operators aren't that complex.</p> <p>But in order to make it really useful (and not niche) it need to be integrated to at least one of the popular frameworks like Pytorch, TF or Mxnet. On the other hand implementing pytorch backend is huge task that will take lots of time - but it is actually the true goal.</p> <p>I can go with improving Caffe-OpenCL were I mostly need to fix several performance critical layers by using dlprimitives... ahhh and fix Caffe memory management since Keras/PT uses 1/4 of the memory Caffe uses. It can be good POC but Caffe is actually dead - I already have working POC.</p> <p>Hard to decide.</p> </div>