Haha, you're right, I'm going to invest some time and look what the current state of my engine is capable of

Yes, it's open source, but I'm a bit ashamed of it because I mostly hack this beast in my spare time while sitting in the bus and the train....so lots of hacky things in there and probably hard to read for anyone else than me. Maybe I can just write about how I did things? Would be happy to talk about some optimizations, especially regarding things where you have experience optimizing it in C++.
Regarding instancing: I'm doing regular instancing and I was generous regarding the instanced data. I have nearly everything I use as object properties as instanced properties. For example transformations, materials, animations (currently only 4 possibly active ones per object), bounding boxes. So I can go with two draw calls for main rendering. With indirect rendering, instancing or not instancing is practically the same for me, although performance would differ when you have a lot of different meshes I guess because of bandwidth. The gpu is limiting me here always.
Regarding latency: I have a triple buffer renderstate construct. The rendering/gpu stuff runs on its own thread with a Kotlin coroutine dispatcher and always goas max or is limited by vsync. Then I have an update thread that's also a coroutine dispatcher. Beginning the update frame, I have a stage where one can schedule single thread execution. That means I don't have to synchronize datastructures, but can use scheduling for synchronizing things easily. After this, all systems updates are executed ... on a coroutine context with max core count, in general maxing out the cpu or limited by the gpu. After that, the triple buffer extraction is done, single threaded. Since I wrote my own struct library, the extraction window is pretty small here because it's close to a memcopy and until now never limited me

This is probably one of the things that C++ can do better, because my struct library reduces extraction time but introduces a little bit of runtime overhead in general during update.
Regarding latency, I would need to take a look at where I update my inputs, but I think it's just in the update look that runs as fast as possible, sometimes with 10k times per second and the triple buffer is updated whenever the gpu finishes a frame, so I don't know how to do that any faster to be honest

EDIT: Regarding culling: The given video doesn't use any culling. I have a culling system that implements two-phase occlusion culling completely on the gpu, so the gpu feeds itself. It's capable of processing clusters which is my kind of "batch" and instances....but....I don't know if it pays off to be honest, don't have nice numbers yet.