I seem to recall years and years ago getting perfectly decent performance optimisation just by the simple expedient of sorting triangles in order of their average vertex index number. And that's O(n log n) perf on average. It's not quite
as fancy but it gets like 90% of the win for bugger all effort and the best bit is that it can be done in realtime with dynamic meshes. Food for thought. (Why not try it and compare?)
Actually, the first thing I did was sort based on the lowest index number. It couldn't compare to Tom's algorithm. Hmm... I'm gonna try something similar though.
EDIT: So I tried something really simple:
1. Sort triangles based on (i0 + i1 + i2).
2. Reorder vertices in the order they are used by the triangles (= the pre-transform cache coherency optimization, step 3 in my previous long post).
3. Repeat until the result gets stable.
The result was actually pretty good, but still not as good as Tom Forsyth's algorithm.
Simple algorithm after 86 iterations: invocations: 38544 --> 26902 --> 14308 (62.878788% reduction, 54.71454% more than optimal)
Tom Forsyth's algorithm: 38544 --> 26902 --> 11904 (69.115814% reduction, 28.719727% more than optimal)
EDIT 2: According to Forsyth, his algorithm can be implemented in linear time.
EDIT 3: Note that all the invocations counts are reported by the hardware of my Nvidia GPU. The numbers may be very different on AMD and Intel GPUs.