Nah, dirty rects almost completely useless for me - we have a lot of movement and a lot of overdraw.
The speedup difference between 50 and 150 draw calls was actually fairly negligible - maybe a 10% boost. However... I finally put VBOs into the sprite engine. Woah! What a difference. 60fps nearly all the time, just like that. Because of how VBOs work I couldn't interleave writes and draws on a state-change basis any more; it's back to writing all the sprites in one go and then rendering everything in one go. But even so - much faster.
One thing in particular may be helping here, which is that I use GL_STREAM_DRAW_ARB and GL_WRITE_ONLY_ARB and map the VBO. This means the data that I write to the buffers is written straight to the card, probably even bypassing AGP RAM, and especially importantly, it completely bypasses any RAM caches on the way.
So: VBOs FTW! I can still implement band sorting but I'll have to write my own much faster interval tree class.
And then I'll have a look at optimising the sprite atlases depending on adjacent sprite image usage and feeding back a data file into the sprite packer.
After that it'll very likely be back to being fill-rate limited like it was 7 years ago when I first wrote the damned thing! But this time I can fill 30x as many pixels