Pairing? Ignoring lastest version of CPU? Oh dear, just again that ugly microoptimalizations for old CPU. It was a reason why Intel didn't improved it fast enough.
I like assembler & optimising at this level
If you know what you're doing you can usually double the speed of any decently large function (such as vector transform & perspectivise, software renderers etc.)
The insights from learning how to optimise can help you write C/Java code that runs at least 10% faster just by knowing a few tricks and how code is really translated - and once it becomes second nature then it doesn't cost any more development time - and done properly it rarely produces code that would appear in this thread
One clipping routine I came across had the comments:
It's not big, & its not clever