A few months ago I experimented with CPU particles drawn using OpenGL and tried to get as good performance as possible with them. I added all sorts of features to make it as quick as possible, for example multi-threading support, keeping particle data in "structs" using Riven's MappedObject, e.t.c. After having improved it little by little until now, I decided to tackle the biggest problem of it so far: the inflexibility of having a large buffer full of randomly dying particles.
Basically the problem is that any one particle can die at any moment due to its life time running out, leaving a gap in the particle buffer. If I want to draw it all with glDrawArrays, I'd either have to compact the buffer on the CPU or discard dead particles in a geometry shader, but in the shader approach, I'd have to send a lot more data to OpenGL each frame, not to mention the insane fragmentation that builds up in the buffer. I decided to try to keep the particle buffer compacted each frame using the CPU instead.
I'm going to simplify the example a little. Pretend I'm using an Particle to store my Particle instances. I keep track of its capacity (length) and its current size (used space), similar to a Buffer instance. If the buffer is full and I try to add a new Particle, I double the capacity similar to an ArrayList.
Now, the real difference in how I do things is how I update my particles. Particle updating happens partly in my createParticle() method (!!!). Why? I do the updating there to be able to locate a dying particle and return it instead of just adding it to the end. This obviously reduces fragmentation and reuses objects, but not that much. In an optimal scenario where 100 particles die per frame and 100 particles are created per frame, it will stay completely compact, but that is pretty unrealistic for explosions, bursts, e.t.c. Just having random life time pretty much guarantees that the number of dying and created particles will be different each frame.
So after creating all new particles, we have a half updated Particle array, which is compact up to the last updated Particle. I have a second method (finishUpdate()) which is supposed to update the remaining particles. First it just continues to update particles until it encounters a particle that dies. Then it continues updating and checks how many particles that die in a row (often just one, but it could be very many too). Then it continues updating and checks how many particles that -do NOT die are in a row. These still alive particles are copied to a lower index to keep it compacted using System.arraycopy(). I then repeat these last two until all Particles are updated. Haha, I guess that wasn't very clear...
TL;DR: I basically avoid the System.arraycopy on each remove that would've happened if I used ArrayList.remove(), only copying each alive Particle instance only once (or not at all if it was kept compact due to newly created particles in the first step).
Obviously things are slightly more complicated compared to how I described it above. I'm using Riven's MappedObject, so I also have to keep track of a ByteBuffer and copy around the data needed by the GPU (I have a Particle instance paired with a MappedObject containing only the data relevant to the GPU as it is much faster).
My original particle test simply created a new particle every time a particle died, which kept the number of particles constant and the buffer compacted. My current test, a firework simulator, creates particles for firework trails/tracers, and also lots of particles when the firework explodes. All particles have very random life times, but I do shoot fireworks at regular intervals, so the amount of reuse is still quite high.
The performance is great, being only marginally slower than my original particle test. The firework test only runs single-threaded at the moment, so I'm comparing it to my old particle test using only one thread too.
- My old test runs very stable at 72-73 FPS with 510.000 particles.
- The firework test runs at 69-72 FPS, with particles varying between 500.000 and 525.000.
In short: the performance is excellent compared to my old static test. Oh, did I mention that the fireworks look awesome?
How would you handle a dynamic particle system? Is there a better way? Keeping everything on the graphics card and updating it using draw commands would obviously be a lot faster, but is there any other way of handling the data on the CPU? I though about not keeping the array compacted, and instead generate a list of indices containing only the currently alive particles. This would however force me to loop over the whole array when updating, even if I only have a few Particles which would slow down rendering only a few indices. I'd also have to build that index array every frame, which would be pretty slow if I have many particles. Finally I'd have to send the whole data buffer to OpenGL each frame, instead of only the used part. I'm pretty sure that would be slower...
EDIT: Paint skills!
Notice how the right-most block isn't moved twice, only once.