In this video we continue our discussion on parallel histogram computation, which is a commonly used parallel programming pattern.
We discuss:
1) Using registers/thread to reduce the number of atomic updates that are required to be done by the kernel (application of register tiling)
2) How to optimally utilise registers in CUDA kernels (to make sure we are not wasting any bits in registers)
3) How to use warp level primitives to perform block level reduction
![](https://s2.save4k.ru/pic/PdpCMJpxwfw/maxresdefault.jpg)