DLPrimitives Blog :: Benchmarks http://blog.dlprimitives.org/ Development Blog Pytorch Training Benchrmarks http://blog.dlprimitives.org/post/6 http://blog.dlprimitives.org/post/6 <div style="direction:ltr"> <p>I managed to train some networks in pytorch with opencl/dlprimitives backend and the result is promising.</p> <p>Below the results for several nVidia GPUs and comparison to baseline tf2/keras. Unlike previous benchmarks I fixed missing time of Adam optimiser that apparently was significant (it isn't very efficient in pytorch)</p> <p>I also added times for AMD 6600xt, unfortunately there is no baseline I can use since AMD hadn't released ROCM for RDNA yet.</p> <h2>Absolute Performance</h2> <p>Batch size: 16 images 224x224, time in ms. Lower is better.</p> <table> <thead> <tr> <th> Framework </th> <th> Vendor </th> <th> GPU </th> <th> alexnet </th> <th> resnet18 </th> <th> resnet50 </th> <th> vgg16 </th> <th> mobilenet </th> </tr> </thead> <tbody> <tr> <td>pytorch/opencl </td> <td> AMD </td> <td> rx 6600xt </td> <td> 56.846 </td> <td> 109.850 </td> <td> 258.973 </td> <td> 365.305 </td> <td> 163.732 </td> </tr> <tr> <td>dlprimitives </td> <td> AMD </td> <td> rx 6600xt </td> <td> 36.954 </td> <td> 65.241 </td> <td> 194.398 </td> <td> 308.763 </td> <td> 99.862 </td> </tr> <tr> <td>pytorch/cuda </td> <td> Nvidia </td> <td> rtx 2060s </td> <td> 27.592 </td> <td> 38.624 </td> <td> 114.074 </td> <td> 179.580 </td> <td> 49.624 </td> </tr> <tr> <td>pytorch/opencl </td> <td> Nvidia </td> <td> rtx 2060s </td> <td> 50.108 </td> <td> 82.021 </td> <td> 223.651 </td> <td> 462.964 </td> <td> 129.145 </td> </tr> <tr> <td>dlprimitives </td> <td> Nvidia </td> <td> rtx 2060s </td> <td> 39.829 </td> <td> 67.960 </td> <td> 187.398 </td> <td> 439.053 </td> <td> 90.229 </td> </tr> <tr> <td>tf2/cuda </td> <td> Nvidia </td> <td> rtx 2060s </td> <td> 29.165 </td> <td> 55.523 </td> <td> 147.999 </td> <td> 156.714 </td> <td> 102.596 </td> </tr> <tr> <td>pytorch/cuda </td> <td> Nvidia </td> <td> gtx 1080 </td> <td> 38.310 </td> <td> 44.382 </td> <td> 137.754 </td> <td> 232.824 </td> <td> 63.324 </td> </tr> <tr> <td>pytorch/opencl </td> <td> Nvidia </td> <td> gtx 1080 </td> <td> 54.828 </td> <td> 85.016 </td> <td> 301.898 </td> <td> 411.928 </td> <td> 173.885 </td> </tr> <tr> <td>dlprimitives </td> <td> Nvidia </td> <td> gtx 1080 </td> <td> 38.804 </td> <td> 71.147 </td> <td> 264.286 </td> <td> 374.168 </td> <td> 134.650 </td> </tr> <tr> <td>tf2/cuda </td> <td> Nvidia </td> <td> gtx 1080 </td> <td> 35.592 </td> <td> 69.071 </td> <td> 189.994 </td> <td> 197.333 </td> <td> 128.526 </td> </tr> </tbody> </table> <h2>Relative Performance</h2> <p>Comparison of TF/Cuda with pytorch + opencl/dlprimitives and dlprimitives alone:</p> <table> <thead> <tr> <th>Baseline </th> <th> tested </th> <th> GPU </th> <th> alexnet </th> <th> resnet18 </th> <th> resnet50 </th> <th> vgg16 </th> <th> mobilenet </th> </tr> </thead> <tbody> <tr> <td>tf2/cuda </td> <td> dlprimitives </td> <td> gtx 1080 </td> <td> 92% </td> <td> 97% </td> <td> 72% </td> <td> 53% </td> <td> 95% </td> </tr> <tr> <td>tf2/cuda </td> <td> pt/opencl </td> <td> gtx 1080 </td> <td> 65% </td> <td> 81% </td> <td> 63% </td> <td> 48% </td> <td> 74% </td> </tr> <tr> <td>tf2/cuda </td> <td> dlprimitives </td> <td> rtx 2060s </td> <td> 73% </td> <td> 82% </td> <td> 79% </td> <td> 36% </td> <td> 114% </td> </tr> <tr> <td>tf2/cuda </td> <td> pt/opencl </td> <td> rtx 2060s </td> <td> 58% </td> <td> 68% </td> <td> 66% </td> <td> 34% </td> <td> 79% </td> </tr> </tbody> </table> <h2>Summary</h2> <p>Besides VGG, most of results are very assuring</p> <h2>Notes</h2> <p>Why do I compare to TF2/cuda as base line. Pytorch is faster framework. However since TF is good enough for most users I want to show that I get performance that is close enough.</p> </div> Comparing Green and Red Apples http://blog.dlprimitives.org/post/1 http://blog.dlprimitives.org/post/1 <div style="direction:ltr"> <h2>TL;DR</h2> <ul> <li>OpenCL based DLPrimitives is almost as fast as TF based on cuDNN in inference and close enough in training.</li> <li>Framework Matters - TF is much slower than pytorch.</li> <li>AMD 6600 XT is faster than NVidia 1080 and 2060S by a margin that is similar to difference in GFlops of these cards</li> </ul> <p>Also dlprimitives isn't as fast as best cudnn based solutions - pytorch its performance makes it more that useful for platform independent deep learning.</p> <h2>How to Compare Different GPUs</h2> <p>Comparing deep learning software performance on NVidia and AMD GPU isn't as simple as you may think of.</p> <p>There are two many factors:</p> <ol> <li>No GPUs have identical specs. Major parameters are GFlops and Memroy bandwidth as most DL algorithms are either compute limited (like dense, conv layers) or bandwidth limited - (like batch normalization or activation)</li> <li>Both companies provide libraries optimized for their gpus: MIOpen and cuDNN. While they highly optimized and provide similar functionality they aren't have similar performance.</li> </ol> <p>Now the situation becomes even more complex when it comes to RDNA architecture. <a href="https://github.com/RadeonOpenCompute/ROCm/issues/819">AMD hasn't released support of their DL stack</a> for these GPUs for more than two years.</p> <p>Even though I decided to try to check it using dlprimitives.</p> <h2>Base Line</h2> <p>Note we compare 3 different GPUs that have similar performance withing reasonable margins.</p> <p>AMD RX 6600 XT, NVidia GTX 1080, NVidia RTX 2060 Super.</p> <p>The basic flops performance measured using custom kernel.</p> <table> <thead> <tr> <th>gpu </th> <th>GFlops </th> <th>GB/s</th> </tr> </thead> <tbody> <tr> <td>6600xt </td> <td>9,937 </td> <td>216 </td> </tr> <tr> <td>1080 </td> <td>8,970 </td> <td>242 </td> </tr> <tr> <td>2060s </td> <td>8,263 </td> <td>396 </td> </tr> </tbody> </table> <p>Flops performance of modern GPUs can be calculated as clock * cores * 2, however clock depends on specific model and thermal performance so both manual measures used as base line and calculated theoretical expected flops measured using median clock observed during benchmarks.</p> <table> <thead> <tr> <th>gpu </th> <th>Cores</th> <th>Clock Mhz</th> <th>Exp GFlops</th> <th>Exp GB/s</th> </tr> </thead> <tbody> <tr> <td>6600xt </td> <td>2048 </td> <td>2655 </td> <td>10,875 </td> <td>256 </td> </tr> <tr> <td>1080 </td> <td>2560 </td> <td>1809 </td> <td> 9,262 </td> <td>320 </td> </tr> <tr> <td>2060s </td> <td>2176 </td> <td>1905 </td> <td> 8,290 </td> <td>448 </td> </tr> </tbody> </table> <p>So GPUS performance varies, also 2060s has 17-24% less GFlops that 6600xt it has much higher memory throghtput that helps in bandwidth limited algorithms like batch normalization of depthwise separable convolutions for mobilenet. 1080 has 10-15% lower GFlops but 12% more bandwidth.</p> <h2>Testing Methodology</h2> <p>Three frameworks were tested using 64 images batch on:</p> <ol> <li>pytorch/1.8 using cuda+cudnn</li> <li>keras/tensorflow 2.5 using cuda+cudn</li> <li>OpenCL based solution dlprimitives.</li> </ol> <p>Since there is no ROCM version of TF or Pytorch that supports AMD's RDNA GPU only dlprimitives were tested expecting to get similar results to other GPUs in same class.</p> <h2>Training Times</h2> <p>Measured in ms per batch, lower is better.</p> <table> <thead> <tr> <th>Framework </th> <th>gpu </th> <th>alexnet</th> <th>resnet18 </th> <th>resnet50 </th> <th>vgg16 </th> <th>mobilenet</th> </tr> </thead> <tbody> <tr> <td>dlprim </td> <td>6600xt </td> <td> 83.73 </td> <td>231.2 </td> <td>716.2 </td> <td>1157.2 </td> <td>414.35 </td> </tr> <tr> <td>dlprim </td> <td>1080 </td> <td> 93.03 </td> <td>262.1 </td> <td>926.6^ </td> <td>1348.9 </td> <td>614.02 </td> </tr> <tr> <td>dlprim </td> <td>2060s </td> <td>116.41 </td> <td>252.3 </td> <td>705.2^ </td> <td>1681.3 </td> <td>355.21 </td> </tr> <tr> <td>keras/tf2 </td> <td>1080 </td> <td> 70.56 </td> <td>200.6 </td> <td>684.4^ </td> <td> 633.1 </td> <td>437.84 </td> </tr> <tr> <td>keras/tf2 </td> <td>2060s </td> <td> 70.00 </td> <td>172.2 </td> <td>520.0^ </td> <td> 553.1 </td> <td>344.55 </td> </tr> <tr> <td>pytorch </td> <td>1080 </td> <td> 62.37 </td> <td>151.4 </td> <td>518.0 </td> <td> 780.9 </td> <td>229.20 </td> </tr> <tr> <td>pytorch </td> <td>2060s </td> <td> 41.11 </td> <td>121.2 </td> <td>377.8 </td> <td> 621.1^</td> <td>143.23 </td> </tr> </tbody> </table> <p>^) Using half batch x32 twice, due to GPU memory limits</p> <p>Observations:</p> <ol> <li>DLPrimitives has 67% of Tensorflow performance on NVidia GPUs, Biggest difference was in VGG. Comparison without VGG gives 75% of TF performance</li> <li>TF has 77% of pytorch. Biggest difference in VGG. Without VGG the difference is increased to 67%.</li> <li>DLPrimitives runs faster by 24% on AMD RX 6600 XT in comparison to GTX 1080 also the raw GFlops power differs by 10-17% depending on measurement strategy</li> <li>DLPrimitives runs faster by 15% on AMD RX 6600 XT in comparison to RTX 2060S. It is noticeable that major drop happens on mobile-net that is highly dependent on memory bandwidth with its depth-wise separable convolutions.</li> </ol> <h2>Inference Times</h2> <p>Measured in ms per batch, lower is better.</p> <table> <thead> <tr> <th>Framework </th> <th>gpu </th> <th>alexnet</th> <th>resnet18 </th> <th>resnet50 </th> <th>vgg16 </th> <th>mobilenet</th> </tr> </thead> <tbody> <tr> <td>dlprim </td> <td>6600xt </td> <td>34.28 </td> <td>63.57 </td> <td>185.72 </td> <td>277.97 </td> <td>102.84 </td> </tr> <tr> <td>dlprim </td> <td>1080 </td> <td>28.03 </td> <td>63.57 </td> <td>274.27 </td> <td>309.28 </td> <td>131.74 </td> </tr> <tr> <td>dlprim </td> <td>2060s </td> <td>47.52 </td> <td>81.09 </td> <td>210.97 </td> <td>428.34 </td> <td> 97.80 </td> </tr> <tr> <td>keras/tf2 </td> <td>1080 </td> <td>40.55 </td> <td>80.64 </td> <td>199.38 </td> <td>189.07 </td> <td>109.85 </td> </tr> <tr> <td>keras/tf2 </td> <td>2060s </td> <td>47.95 </td> <td>75.73 </td> <td>165.31 </td> <td>174.27 </td> <td> 93.01 </td> </tr> <tr> <td>pytorch </td> <td>1080 </td> <td>16.36 </td> <td>43.17 </td> <td>144.88 </td> <td>226.40 </td> <td> 60.13 </td> </tr> <tr> <td>pytorch </td> <td>2060s </td> <td> 9.65 </td> <td>33.27 </td> <td>107.56 </td> <td>172.47 </td> <td> 35.55 </td> </tr> </tbody> </table> <p>Observations:</p> <ol> <li>DLPrimitives has 90% of Tensorflow performance on NVidia GPUs, Biggest difference was in VGG. Comparison without VGG gives 99% of TF performance</li> <li>TF has 61% of pytorch performance. Biggest difference in VGG. Without VGG the difference is increased to 49%.</li> <li>DLPrimitives runs faster by 14% on AMD RX 6600 XT in comparison to GTX 1080, and 26% faster in comparison to RTX 2060S. It is somewhat difference in comparison to training.</li> </ol> <h2>Summary and Conclusions</h2> <ol> <li>There is a huge difference between different DL frameworks. Pytorch is much faster that TensorFlow by large margins.</li> <li>DLPrimitives provide decent performance that is comparable to TF (loosing ~25% of performance in training and 10% in inference)</li> <li>It seems that 6600XT gives decent performance for dlprimitives comparable to ones by nVidia 1080/2060s with performance improvement gap that is comparable to difference in GFlops gap.</li> </ol> </div>