I found some nice optimizations for bounds testing in shaders. Basically my motion blur was blurring over the edge of the screen, resulting in either black color if I used texelFetch() which gave a dark aura around the screen when in motion, or being clamped to the edge pixel if I used texture() which inflated the weights of the edge pixels. Neither of these looked very good, so I decided to simply remove the samples that fell outside the screen. However, detecting these scenarios was expensive.
Simply using an if-statement to test whether the coordinates are inside the screen was dead slow as this was compiled to 1 branch per sample.
1 2 3
| if(texCoords.x >= 0 && texCoords.y >= 0 && texCoords.x < resolution.x && texCoords.y < resolution.y){ samples += 1.0; } |
However, there's a trick you can use. By casting the resulting boolean to a float, we can convert the boolean to 1.0 if it's true and 0.0 if it's false! Exactly what we want!
1
| samples += float(texCoords.x >= 0 && texCoords.y >= 0 && texCoords.x < resolution.x && texCoords.y < resolution.y); |
Sadly, this still compiles to the same thing as the if-statement. That's weird since I know that GPUs have specific instructions to set the value of a register based on a simple comparison. As an example, this line:
1
| float x = float(someValue < 1); |
compiles to this instruction:
1
| x: SETGT R0.x, 1.0f, PV0.x |
It would seem that the boolean &&-operators are messing this up, causing it to revert to branches. Let's try casting the result of each comparison to floats and then use a simple float multiply between them to effectively and them together.
1
| samples += float(texCoords.x >= 0)*float(texCoords.y >= 0)*float(texCoords.x < resolution.x)*float(texCoords.y < resolution.y); |
Bam! The comparison compiles to 2 SETGE (greater-equals) instructions, 2 SETGT (greater-than) instructions and 3 multiplies. I need to do this 16 times per pixel, once for each sample, so this saves a load of work! There is one final optimization we can make to improve this code on AMD's vector GPUs. AMD's older GPUs are a bit funny in that they run each shader on 4 or 5 different cores at the same time, trying to do as much as possible at the same time. This code:
1
| float x = a + b + c + d; |
would fit this extremely badly. GLSL requires the GPUs to enforce the order of the operations, so none of these instructions can be run in parallel. First we do a+b, then (a+b)+c, then finally ((a+b)+c)+d, which requires 3 cycles. If we add some "unnecessary" parenthesises, we can encourage vector GPUs to do these additions in parallel without affecting the performance of scalar GPUs that don't have this problem:
1
| float x = (a + b) + (c + d) |
This only takes 2 cycles, as a+b and c+d can both be calculated in the first cycle, and then (a+b)+(c+d) can be calculated in the second cycle, making this chain of addition 50% faster. Doing this for the bounds testing gives this code:
1
| samples += (float(texCoords.x >= 0)*float(texCoords.y >= 0))*(float(texCoords.x < resolution.x)*float(texCoords.y < resolution.y)); |
Theoretical performance of the 4 versions of bounds checking done for 16 samples on a Radeon HD 6870:
1. 2.04 pixels per second
2. 2.02 pixels per second
3. 11.20 pixels per second
4. 11.79 pixels per second
All in all, that's a 5.78x improvement compared to a naive if-statement.