DLPrimitives Blog :: Benchmarks http://blog.dlprimitives.org/ Development Blog Comparing Green and Red Apples http://blog.dlprimitives.org/post/1 http://blog.dlprimitives.org/post/1 <div style="direction:ltr"> <h2>TL;DR</h2> <ul> <li>OpenCL based DLPrimitives is almost as fast as TF based on cuDNN in inference and close enough in training.</li> <li>Framework Matters - TF is much slower than pytorch.</li> <li>AMD 6600 XT is faster than NVidia 1080 and 2060S by a margin that is similar to difference in GFlops of these cards</li> </ul> <p>Also dlprimitives isn't as fast as best cudnn based solutions - pytorch its performance makes it more that useful for platform independent deep learning.</p> <h2>How to Compare Different GPUs</h2> <p>Comparing deep learning software performance on NVidia and AMD GPU isn't as simple as you may think of.</p> <p>There are two many factors:</p> <ol> <li>No GPUs have identical specs. Major parameters are GFlops and Memroy bandwidth as most DL algorithms are either compute limited (like dense, conv layers) or bandwidth limited - (like batch normalization or activation)</li> <li>Both companies provide libraries optimized for their gpus: MIOpen and cuDNN. While they highly optimized and provide similar functionality they aren't have similar performance.</li> </ol> <p>Now the situation becomes even more complex when it comes to RDNA architecture. <a href="https://github.com/RadeonOpenCompute/ROCm/issues/819">AMD hasn't released support of their DL stack</a> for these GPUs for more than two years.</p> <p>Even though I decided to try to check it using dlprimitives.</p> <h2>Base Line</h2> <p>Note we compare 3 different GPUs that have similar performance withing reasonable margins.</p> <p>AMD RX 6600 XT, NVidia GTX 1080, NVidia RTX 2060 Super.</p> <p>The basic flops performance measured using custom kernel.</p> <table> <thead> <tr> <th>gpu </th> <th>GFlops </th> <th>GB/s</th> </tr> </thead> <tbody> <tr> <td>6600xt </td> <td>9,937 </td> <td>216 </td> </tr> <tr> <td>1080 </td> <td>8,970 </td> <td>242 </td> </tr> <tr> <td>2060s </td> <td>8,263 </td> <td>396 </td> </tr> </tbody> </table> <p>Flops performance of modern GPUs can be calculated as clock * cores * 2, however clock depends on specific model and thermal performance so both manual measures used as base line and calculated theoretical expected flops measured using median clock observed during benchmarks.</p> <table> <thead> <tr> <th>gpu </th> <th>Cores</th> <th>Clock Mhz</th> <th>Exp GFlops</th> <th>Exp GB/s</th> </tr> </thead> <tbody> <tr> <td>6600xt </td> <td>2048 </td> <td>2655 </td> <td>10,875 </td> <td>256 </td> </tr> <tr> <td>1080 </td> <td>2560 </td> <td>1809 </td> <td> 9,262 </td> <td>320 </td> </tr> <tr> <td>2060s </td> <td>2176 </td> <td>1905 </td> <td> 8,290 </td> <td>448 </td> </tr> </tbody> </table> <p>So GPUS performance varies, also 2060s has 17-24% less GFlops that 6600xt it has much higher memory throghtput that helps in bandwidth limited algorithms like batch normalization of depthwise separable convolutions for mobilenet. 1080 has 10-15% lower GFlops but 12% more bandwidth.</p> <h2>Testing Methodology</h2> <p>Three frameworks were tested using 64 images batch on:</p> <ol> <li>pytorch/1.8 using cuda+cudnn</li> <li>keras/tensorflow 2.5 using cuda+cudn</li> <li>OpenCL based solution dlprimitives.</li> </ol> <p>Since there is no ROCM version of TF or Pytorch that supports AMD's RDNA GPU only dlprimitives were tested expecting to get similar results to other GPUs in same class.</p> <h2>Training Times</h2> <p>Measured in ms per batch, lower is better.</p> <table> <thead> <tr> <th>Framework </th> <th>gpu </th> <th>alexnet</th> <th>resnet18 </th> <th>resnet50 </th> <th>vgg16 </th> <th>mobilenet</th> </tr> </thead> <tbody> <tr> <td>dlprim </td> <td>6600xt </td> <td> 83.73 </td> <td>231.2 </td> <td>716.2 </td> <td>1157.2 </td> <td>414.35 </td> </tr> <tr> <td>dlprim </td> <td>1080 </td> <td> 93.03 </td> <td>262.1 </td> <td>926.6^ </td> <td>1348.9 </td> <td>614.02 </td> </tr> <tr> <td>dlprim </td> <td>2060s </td> <td>116.41 </td> <td>252.3 </td> <td>705.2^ </td> <td>1681.3 </td> <td>355.21 </td> </tr> <tr> <td>keras/tf2 </td> <td>1080 </td> <td> 70.56 </td> <td>200.6 </td> <td>684.4^ </td> <td> 633.1 </td> <td>437.84 </td> </tr> <tr> <td>keras/tf2 </td> <td>2060s </td> <td> 70.00 </td> <td>172.2 </td> <td>520.0^ </td> <td> 553.1 </td> <td>344.55 </td> </tr> <tr> <td>pytorch </td> <td>1080 </td> <td> 62.37 </td> <td>151.4 </td> <td>518.0 </td> <td> 780.9 </td> <td>229.20 </td> </tr> <tr> <td>pytorch </td> <td>2060s </td> <td> 41.11 </td> <td>121.2 </td> <td>377.8 </td> <td> 621.1^</td> <td>143.23 </td> </tr> </tbody> </table> <p>^) Using half batch x32 twice, due to GPU memory limits</p> <p>Observations:</p> <ol> <li>DLPrimitives has 67% of Tensorflow performance on NVidia GPUs, Biggest difference was in VGG. Comparison without VGG gives 75% of TF performance</li> <li>TF has 77% of pytorch. Biggest difference in VGG. Without VGG the difference is increased to 67%.</li> <li>DLPrimitives runs faster by 24% on AMD RX 6600 XT in comparison to GTX 1080 also the raw GFlops power differs by 10-17% depending on measurement strategy</li> <li>DLPrimitives runs faster by 15% on AMD RX 6600 XT in comparison to RTX 2060S. It is noticeable that major drop happens on mobile-net that is highly dependent on memory bandwidth with its depth-wise separable convolutions.</li> </ol> <h2>Inference Times</h2> <p>Measured in ms per batch, lower is better.</p> <table> <thead> <tr> <th>Framework </th> <th>gpu </th> <th>alexnet</th> <th>resnet18 </th> <th>resnet50 </th> <th>vgg16 </th> <th>mobilenet</th> </tr> </thead> <tbody> <tr> <td>dlprim </td> <td>6600xt </td> <td>34.28 </td> <td>63.57 </td> <td>185.72 </td> <td>277.97 </td> <td>102.84 </td> </tr> <tr> <td>dlprim </td> <td>1080 </td> <td>28.03 </td> <td>63.57 </td> <td>274.27 </td> <td>309.28 </td> <td>131.74 </td> </tr> <tr> <td>dlprim </td> <td>2060s </td> <td>47.52 </td> <td>81.09 </td> <td>210.97 </td> <td>428.34 </td> <td> 97.80 </td> </tr> <tr> <td>keras/tf2 </td> <td>1080 </td> <td>40.55 </td> <td>80.64 </td> <td>199.38 </td> <td>189.07 </td> <td>109.85 </td> </tr> <tr> <td>keras/tf2 </td> <td>2060s </td> <td>47.95 </td> <td>75.73 </td> <td>165.31 </td> <td>174.27 </td> <td> 93.01 </td> </tr> <tr> <td>pytorch </td> <td>1080 </td> <td>16.36 </td> <td>43.17 </td> <td>144.88 </td> <td>226.40 </td> <td> 60.13 </td> </tr> <tr> <td>pytorch </td> <td>2060s </td> <td> 9.65 </td> <td>33.27 </td> <td>107.56 </td> <td>172.47 </td> <td> 35.55 </td> </tr> </tbody> </table> <p>Observations:</p> <ol> <li>DLPrimitives has 90% of Tensorflow performance on NVidia GPUs, Biggest difference was in VGG. Comparison without VGG gives 99% of TF performance</li> <li>TF has 61% of pytorch performance. Biggest difference in VGG. Without VGG the difference is increased to 49%.</li> <li>DLPrimitives runs faster by 14% on AMD RX 6600 XT in comparison to GTX 1080, and 26% faster in comparison to RTX 2060S. It is somewhat difference in comparison to training.</li> </ol> <h2>Summary and Conclusions</h2> <ol> <li>There is a huge difference between different DL frameworks. Pytorch is much faster that TensorFlow by large margins.</li> <li>DLPrimitives provide decent performance that is comparable to TF (loosing ~25% of performance in training and 10% in inference)</li> <li>It seems that 6600XT gives decent performance for dlprimitives comparable to ones by nVidia 1080/2060s with performance improvement gap that is comparable to difference in GFlops gap.</li> </ol> </div>