Life's been tough on me the last few weeks, especially the last few days, so I decided to do some extra fun coding this weekend.

3-4 years ago I made some threads about an extreme scale space game with realistic Newtonian physics. The game would require simulating a huge number of objects affected by gravity, with extreme speed collision detection. I am talking 100k+ ships, each orbiting a planet at 10km/second, with accurate collision detection. The technical challenges are enormous. After some spitballing here on JGO, I ended up implementing a test program using fixed-precision values (64-bit longs) to represent positions and velocities to get a constant amount of precision regardless of distance from origin. Simple circle-based collision detection was handled by sorting the ships along the X-axis, then checking collisions only for ships that overlap along the X-axis. The whole thing was completely multi-threaded, and I even tried out Riven's mapped struct library to help with cache locality. Even sorting was multithreaded using my home-made parallel insertion sort algorithm, tailor-made for almost-sorted data sets (the order along the X-axis did not change very quickly). It scaled well with more cores, but was still very heavy for my poor i7.

I realized that the only way to get decent performance for this problem on a server would be to run the physics simulation on the GPU. With a magnitude higher performance and bandwidth, the GPU should be able to easily beat this result as long as the right algorithms are used. The physics simulation is easy enough as it's an embarrassingly parallel problem and fits perfectly for the GPU. The collision detection (sorting + neighbor check) is a whole different game. GPU sorting is NOT a fun topic, at least if you ask me. The go-to algorithm for this is a parallel GPU radix sort, but with 64-bit keys that's very expensive. Just like my parallel insertion sort algorithm took advantage of the almost-sorted nature of the sorting, I needed something like that that could run on the GPU as well. That's when I stumbled upon a simple

GPU selection sort algorithm.

The idea is simple. For each element, loop over the entire array of elements to sort. Calculate how many elements that should be in front of this element. You now know the new index of your element, so move it directly to that index. Obviously, this is O(n^2), so it doesn't scale too well. However, the raw power of the GPU can compensate for that to some extent. 45*1024 = 46 080 elements can be sorted in ~60FPS, regardless of how sorted the array is. By using shared memory as a cache, performance almost triples to 160 FPS, allowing me to sort 80*1024 = 81 920 elements at 60 FPS. Still not fast enough. Anything above 200k elements runs a big risk of causing the driver to time out and restart...

Enter block-based selection sort for almost sorted data-sets! The idea is to split the list up into blocks of 256 elements, then calculate the bounds of the values of each block. This allows us to skip entire blocks of 256 values if the block doesn't intersect with the current block we're processing. Most likely, only the blocks in the immediate vicinity of each block needs to be taken into consideration when sorting, while the rest of the blocks can be skimmed over. Obviously, this makes the data dependent, and the worst case is still the same as vanilla GPU selection sort if all blocks intersect with each other (which is essentially guaranteed for a list of completely random values). However, for almost sorted data sets this is magnitudes faster!

To simulate an almost sorted data-set, an array is filled with elements like this:

1 2 3
| for(int i = 0; i < NUM_KEYS; i++){ data.putLong(i*8, i + r.nextInt(1000)); } |

This gives us an almost sorted array with quite a lot of elements with the exact same value, to test the robustness of the sort. The block-based selection sort algorithm is able to sort a 2048*1024 = 2 097 152 element list... at 75 FPS, way over the target of 100 000. It's time to implement a real physics simulation based on this!

Let's define the test scenario. 1024*1024 = 1 048 576 ships are in perfect circular orbits around the earth. The orbit heights range from low earth orbit (International Space Station height) to geosynchronous orbit. Approximately half of the ships are orbiting clockwise, the other half counterclockwise. The size of the earth, the mass, the gravity calculations, etc are physically accurate and based on real-life measurements.

Going back to my original threaded CPU implementation, it really can't handle one million ships very well. Just the physics simulation of the ships takes 20.43ms, and sorting another 18.75ms. Collision detection then takes another 10.16ms.

The compute shader implementation is a LOT faster. Physics calculations take only 0.27ms, calculating block bounds another 0.1ms and finally sorting takes 2.07ms. I have not yet implemented the final collision detection pass, but I have no reason to expect it to be inefficient on the GPU, so I'm optimistic about the final performance of the GPU implementation.

Each ship is drawn as a point. The color depends on the current index in the list of the ship, so the perfect gradient means that the list is perfectly sorted along the X-axis. 303 FPS, with rendering taking up 0.61ms, 370 FPS without rendering.