Further, my engine has never been CPU-bound

I also found that reducing the polycount of a VBO below a certain value (under 75 on my card), won't increase performance anymore. So in the end I might have to do geom-grouping anyway, whatever overhead LWJGL has. Or do that "pseudo-instancing" or yours, whatever it may be

Under 75 polygons, you *are* CPU bound. Also, no matter the scene you're rendering, the CPU always affects the frame rate to a certain degree.
Pseudo-instancing is very simple actually. Instead of changing the modelview matrix for every object in the scene, you only load the view matrix at the beginning of the frame and then pass the model matrix individually for every object. The trick is to pass the model matrix as a bunch of "persistent" vertex attributes. Because the modelview matrix is not changing, the driver doesn't have to "touch" the shader code for every object and that's where the speed-up comes from.
Go
here for details and sample code (search => pseudo).
I've also found that this can be applied to other uniform values too. So, for example, instead of defining a model-space light position as a uniform, it's faster to pass it as a vertex attribute. Btw, because of the above model matrix trick, you end up with a vertex position in *world-space* inside the shader. This is actually very convenient and saves some instructions when you've got to do shadow mapping or calculate an eye vector (for specular, parallax, etc).