Are you sure about that? That means that integer multiplies and divides don't use the integer ALU.. that seems like a waste, since an integer multiply is easier to do that a floating point multiply. It COULD be done faster if the integer ALU supported the operation. And the fact that integer operations are generally more common.. since they are needed to calculate array offsets etc.. it just doesn't seem to make sense to go through all those hoops.
It makes more sense to save silicon and only have a single divide unit. Int & float divides take the same time on PII+ (39 cycles, double precision, 23 single), and cannot be done in parallel (like they could on plain Pentium, as that had 2 units). The int conversion is done 'transparently' in these cases, and the FPU can perform multiplication & additions while waiting for the result. However, only the last 2 cycles of the divide can overlap with integer instructions, so by trying to use integer math you would lose 36 cycles worth of computation - thats 36 multiplies!
Int multiplies are 4 cycles (PII+, 9 cycles before) and 3 cycles for a float. The float instructions can also be pielined to allow 3 multiplies to be in flight at any one time. Int*float is even worse - 6 cycles with no concurrency.
I know in many cases you can get better performance if you convert floating point ops to integer ops with fixed point math... maybe that was only if you could avoid division.. For example to divide by a constant multiply be the reciprocal.. if possible use power of two fractions so it ends up being an integer multiply followed by a shift. Sometimes with a little extra fudging you can get the exact same answer as if you did the math with floats, yet the result is computed in half the time. I know of someone that used this technique to get dramatic speedups with a video codec and that was using at least a Pentium 3.
Multiplying/dividing by simple powers of 2 is a great speed up. Not something easily achieved in Java except for small cases. In C++, we use a union between an int and a float (called a floint) that lets you do some great speed ups:
Clamp to zero:
Float:
if(f < 0.0f)
f = 0.0f;
Looks easy enough, but in fact it takes 15-30 cycles to work out
instead:
floint fl;
fl.f = f;
int m = fl.i>>31; // If f was negative, this is 0xffffffff, else it is 0
fl.i = fl.i & (~m); // either 'f' or '0'
This takes around 5 cycles to compute with no chance of a Branch Predict error - 3 to 6 times faster. You can use this when doing clamping operations on colours, so it is a good thing for video codecs & software renderers.
Avoiding division is good to always aim for, e.g:
a/b + c/d
is faster if you do:
(a*d + c*b) / (b*d)
And avoiding multiplies:
a * f + b * (1-f)
is faster as:
b + (a-b)*f
If you want to know more, I reccomend googling for 'Agner Fog', the godfather of pentium optimisation.
- Dom