DLPrimitives Blog
Development Blog
First release 0.1.0 of dlprimitives pytorch backend is out
And now with binaries for easy installation
This release supports pytorch 2.4 and introduces better way to use OpenCL pytorch
This time I provided binary distributions of the backend
- Linux for python 3.8 till 3.12, for torch=2.4
- Windows for python 3.11 and 3.12 for torch=2.4
To install - install CPU version of pytorch in virtual evironment, download whl file from release make sure torch version, python version and architecture matches your environment.
For example python 3.10, torch 2.4 on Linux it is:
pip install pytorch_ocl-0.1.0+torch2.4-cp310-none-linux_x86_64.whl
To use import pytorch_ocl
Some important updates
Lots of passed since latest updates... So
- Now it works with pytorch 2.4 - in fact it is requirement. Either 1.13 or >=2.4
- I created a much easier interface to use - all you need is to import
pytorch_ocl
module and you'll get all the goodies on Linux and Windows. - With python module you can use
torch.ocl.synchonize()
andtorch.ocl.empty_cache()
as with CUDA - I ordered Intel Arc GPU (a380) - so hopefully I'm going to be able to check/optimise for a new platform
- Implemented other things needed like
manual_seed_all
- as required for backed.
Known issues: Currently there is an issue with loading back saved state dictionary if it was saved from ocl device. It crashes for some reason (outside of ocl backend).
Workaround: When you save/restore the model move it to CPU and than back to ocl device.
Pytorch OpenCL backend - simplified
Now installation of opencl backend for pytorch is really simple.
- Install nighly version of pytorch for CPU in virtual environment
- Clone
dlrpim_backend
repository and checkouttrue_out_of_tree_support
branch - Update submodules
Run few commands inside repo
mkdir build cd build cmake -DCMAKE_PREFIX_PATH=$VIRTUAL_ENV/lib/python3.8/site-packages/torch/share/cmake/Torch .. make cd ..
Run mnist training:
python mnist.py --device=ocl:0
That's it.
Keep updated
Inference of ONNX Models using DLPrimitives
I worked on integration of inference of ONNX models using DLPrimitives. It isn't a simple task since ONNX operator set is very reach and many things can be implemented in different ways.
After many revisions and improvements I managed to validate multiple imagenet pretrained networks from pytorch, mxnet and few based on TensorFlow (see about issues with TF later)
How do you create a dlprimitives network using ONNX Model?
// load and parse ONNX Model
dp::ONNXModel model;
model.load(onnx_path);
// create network
dp::Context ctx(device_id);
dp::Net net(ctx);
// load parameters
net.load_model(model);
And you are ready to go.
I validated following networks and frameworks:
- Pytorch, op-sets 9, 11, 13, nets
alexnet
,vgg16
,resnet18
,resnext50_32x4d
,wide_resnet50_2
,efficientnet_b0
,efficientnet_b4
,regnet_y_400mf
,squeezenet1_0
,mobilenet_v2
,densenet121
- MXNet:
vgg11_bn
,alexnet
,mobilenetv2_0.25
,mobilenet0.25
,densenet121
,resnet18_v1
,squeezenet1.0
- Tensorflow: op-sets 9, and 11 limited initial support, channel first format:
resnet50
,densenet121
Some networks on pytorch don't pass due to lack of some of the operators. The situation with TensorFlow is more complicated and only few networks worked ok.
TensorFlow
When I stated validated pretrained keras networks I discovered very surprising thing. TensorFlow uses asymmetrical padding in some cases, since in TF/Keras you don't explicitly provide padding but rather give some vague definition of "same" or "valid" for the padding, in some cases padding may differ on start and end of the image.
Interestingly, cuDNN does not even provide asymmetrical padding option for convolutions. Looking into the code TF does padding manually is such case (that is actually huge waste of memory and memory bandwidth)
So implementing these convolutions will require implementing of new simple padding layer just to make sure we can use dlprimitives for inference of TF models.
To be continued...
Attempt to integrate with OneDNN
Intel's OneDNN is great project that provides cudnn/inference/training like tools for Intel's GPU.
Also it is called OneDNN... it should be called IntelDNN since it supports only Intel gpus and cpus.
Bottom line I tried to add OneDNN based convolutions for Intel GPU just to discover that my simple GEMM based convolution works better. Why? Apparently Intel's implementation seems to be optimized for Channel Last format only.
https://github.com/oneapi-src/oneDNN/issues/1194
A simple convolution with 3x3 kernel with 64 input and output channels with image dimension of 56 on Intel HD 530 with 400 GFlops capacity gives:
- 295.6 GFlops for OneDNN's channels last format
- 144.7 GFlops for dlprimitive's channel first format
- 33.4(!) GFlops for OneDNN's channels first format.
The problem is that channels first is the most common format used by pytorch, mxnet, caffe and many other tools (including dlprimitives)
Ok... I'll check it later when one of two happens:
- They fix channel first performance
- I'll support channel last format internally