A few months ago I saw these slides: http://www.slideshare.net/DevCentralAMD/holy-smoke-faster-particle-rendering-using-direct-compute-by-gareth-thomas
Apparently they found that foregoing rasterization of particles and instead going with a tiled compute shader was actually faster than hardware blending. In essence they divided the screen into tiles, binned all particles to said tiles and then had a compute shader "rasterize" those particles, blending into a vec4 completely inside the shader (no read-modify-write to VRAM). They also implemented sorting of particles in the compute shader.
I took a slightly different approach. I've been experimenting with quite a few order-independent transparency algorithms in the past few months/year, and I've got stochastic transparency, adaptive OIT, fourier-mapped OIT (hopefully I'll get around to posting my bachelor thesis on this soon) and a reference CPU-side sorting simple renderer. So as a first test, I tried merging all 3 passes stochastic transparency into a single compute shader. Instead of writing to 8xRGBA16F render targets in the first pass, and then reading all those textures in the second pass and finally doing the weighted average resolving in a final fullscreen pass, I simply use 8 vec4s in the compute shader, immediately do the second pass again writing to local variables and finally doing the resolve and outputting the final RGBA of all particles blended together correctly, all in one (not so) massive shader. I currently lack the tile binning, and a lot of calculations are currently done on the CPU that need a lot of optimizations, but the GPU performance looks very optimistic. In some cases with a large number of particles covering the entire screen, the compute shader achieves almost twice the framerate of my old algorithm, and even more impressively at the same time reduces memory controller load from 80% to a meager 2%. The next step would be to port adaptive OIT to a compute shader. This would be even more interesting at it would eliminate the need for a linked list, as I can just compute the visibility curve as I process the particles. This would in theory allow AOIT to work on OpenGL 3.3 hardware if I just emulate a compute shader with a fullscreen fragment shader pass.
The biggest problem with this approach is that I would need to have every single piece of transparent geometry available in the compute shader, and I wouldn't be able to have different shaders for different particles. However, it would be possible to only use the tiled approach to construct the visibility curve for AOIT in using a tiled compute shader, output the curve to textures and finally proceed with the second pass as usual. That would allow me to have fairly complex shaders for the particles (as long as they don't modify alpha in a complex way) and still have the flexibility of my old system.
I don't really have any good pictures I'm proud of to show off of all this, but hopefully I'll get some nice screenshots in the end. >___<
Several questions about your rendering:
- What did you need 8 render targets for in the first pass?
- How does your new render system work with the compute shader? Do you still output to the render targets and then each instance of the compute shader processes the 8 vec4s across the 8 RT?
- How is memory controller load reduced from 80% to 2%? Isn't the same data being processed by the compute shaders instead of fragment shaders now?
I've been getting very interested in how OIT works!