Today, I wrote a mesh optimizer in Java to be used on the output of my Blender export plugin.
I'm not too proficient in Python and it's just so goddamn slow at everything, so instead of completely ruining the fast export times I've managed to get so far, I opted for writing a simple optimizer that takes in a model file, optimizes it, then writes it out again. This is done in 3 steps:
1. In the first step, I simply eliminate all duplicated vertices and leave just one of each, then reference the same one with an index buffer. The mesh I output from Blender is mostly unoptimized with lots of duplicated vertices, and this step reduces the vertex count by about 75% percent. This saves a lot of memory on the harddrive and reduces loadtimes a lot, but it actually does not improve performance very much. For example, in one of the submeshes of the giant robots of Robot Farm, the vertex count goes from 38544 --> 9248, a 76% reduction, but the number of vertex shader invocations only drop from 38544 --> 26902, a 30% drop. This leads to the next step...
2. In the next step, the triangles in the mesh is reordered to improve the effectiveness of the hardware vertex cache. As some people might remember, I've done a lot of experiments on the hardware vertex caches of Nvidia, AMD and Intel GPUs, and basically concluded that Nvidia's is f**king weird, AMD's is kinda weird and Intel's is the only sane one. In the end, I concluded that optimizing the meshes for each graphics vendor was too much work, so I simply went with a mainstream algorithm. I ended up implementing Tom Forsyth's simple algorithm
, and it works really well. This algorithm simply reorders the triangles in the mesh (by reordering the indices in the index buffer) to increase the chance that vertices are reused as much as possible without specializing it too much to a specific cache size. It's pretty much a general-purpose optimization that should be better than nothing. Anyway, this step further reduced the vertex shader invocation count from 26902 --> 11847, which together with step 1 was a total reduction of 70%! Now, optimally, the number of invocations should be equal to the number of vertices in the mesh, which would mean that there is perfect reuse of vertices and no vertex had to be transformed more than once, but it may not be possible to achieve this. In the end, for this particular mesh the number of invocations is 28% more than optimal, which is acceptable to me.
3. Now, the meshes are optimized for the vertex cache and need a minimal number of vertex shader invocations, but there is one last thing that can be optimized. Inspecting the actual index buffer generated after step 2 reordered the triangles, it's pretty damn random.
..., 6031, 5989, 5991, 5992, 5993, 5991, 6032, 5992, 5994, 5995, 5996, 5994, 6033, 5995, 8387, 8388, 8389, 8387, 8409, 8388, 8390, 8391, 8392, 8390...
There's a lot of good reuse happening here, but also very huge jumps in which vertex each triangle uses. It essentially reads a completely random vertex from the vertex buffer each execution. This is bad for the GPU, as the vertex data from the VBO is read in with a certain cache line size depending on the GPU. In short, if you reference vertex 1000 and 1001, there's a big chance that the memory read for the data of vertex 1000 will cause the data for vertex 1001 to also be loaded into the cache. Therefore, it is a good idea to try to minimize the "distance" between each subsequent index in the index buffer to increase the chance of the input data of the vertex shader already being in the cache. I did this by simply reordering the vertices in the data to so that they are in the order that the index buffer references them. This does not change the triangle order; it simply changes where the vertices of each triangle lies in memory. Here's an excerpt from the output index buffer:
..., 2571, 2565, 2571, 2568, 2566, 2567, 2572, 2571, 2566, 2572, 2572, 2567, 2573, 2571, 2574, 2568, 2569, 2568, 2574, 2569, 2575, 2570, 2569, 2574, 2575, 2575, 2576, 2570, 2571, 2572, ...
Much better! This works super well with the optimized triangle order from step 2, as the algorithm tries to place all uses of a given vertex in one place so that it can be reused, so there's a very low risk of a very old vertex being referenced again. To measure the effectiveness of this step, I measured the average "index distance", which is the average
abs(currentIndex - previousIndex)
over the entire index buffer. The average index distance went from 42.82251 --> 5.490958, a very significant improvement to cache coherency!
All in all, here are the final results for the mesh:
Eliminating duplicate vertices...
Optimizing for post-transform vertex cache...
Optimizing for pre-transform vertex cache...
Vertex count: 38544 --> 9248 (76.00664% reduction)
Invocations: 38544 --> 26902 --> 11847 (69.2637% reduction, 28.103378% more than optimal)
Average index distance: 42.82251 --> 5.490958
Placing 500 of those high-poly robots with no LOD or anything in a test program, my FPS went from 12 to 32 when I switched to the optimized mesh, a ~62.5% reduction in frame time, which matches very well with the ~~70% invocation reduction when you add postprocessing and everything else taking up a bit of time in the first place.