Riven
« League of Dukes » JGO Kernel      Posts: 5866 Medals: 255
Hand over your head.
|
 |
«
on:
2009-10-10 19:01:08 » |
|
I was trying to speedup some boundschecks on data that had a POT dimension. So I needed to clamp integers between 0..63, 0..127, 0..255, 0..511, etc. clamp(val, 0, 511) clampPOT(val, 9)1 2 3 4 5 6 7 8
| public static final int clampPOT(int val, int bits) { val = ((val >> 31) | (val - 1)) + 1;
return (((0 - (val & (-1 << bits))) >> 31) | val) & (~(-1 << bits)); } |
1 2 3 4
| public static final int clamp(int val, int min, int max) { return val < min ? min : (val > max ? max : val); } |
I had this assumption that branching was kinda slow, and fiddling with bits was ultra super fast. I was going to proudly present my bits fiddler on here... Turns out the non-branching code is slower on my Intel Code2Quad. Benchmark: clamp(val, 0, 511) ==> 19ms clampPOT(val, 9) ==> 28msIt's not a benchmark-flaw, as my code gets slower in the real algorithm too. Let this be a lesson for... all of us 
|
Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social rankings
|
|
|
pjt33
JGO Strike Force    Posts: 913 Medals: 17
|
 |
«
Reply #1 on:
2009-10-12 07:47:28 » |
|
I discovered years ago that this is true for integer abs implementations too: (x < 0 ? -x : x) was faster than (x ^ (x >> 31)) + (x >>> 31). I even tried using Jasmin to make various versions of the non-branching version and they were all slower than the branching one, even though I'm pretty sure my test set was about 50% negative, so it's worst-case for branch prediction.
|
|
|
|
|
Riven
« League of Dukes » JGO Kernel      Posts: 5866 Medals: 255
Hand over your head.
|
 |
«
Reply #2 on:
2009-10-12 08:11:47 » |
|
Makes you wonder why loop-unrolling yields so much performance gain...
Either that, or the JIT is screwing up. I should bench this code in C.
|
Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social rankings
|
|
|
Games published by our own members! Go get 'em!
|
|
princec
« League of Dukes » JGO Kernel      Posts: 8076 Medals: 91
Eh? Who? What? ... Me?
|
 |
«
Reply #3 on:
2009-10-12 08:31:55 » |
|
Branches aren't quite so expensive as they were once now that all these modernfangled CPUs have got branch prediction and parallel paths where they actually start executing the code on both sides of a conditional branch before the condition is evaluated. Cas 
|
|
|
|
Roquen
JGO Strike Force    Posts: 823 Medals: 25
|
 |
«
Reply #4 on:
2009-10-12 08:36:04 » |
|
You could reduce the number of bit operations (still slower through). This is probably not the shortest, but as an example: 1 2 3 4 5 6 7 8 9
| public static int clampBitHacko(int x, int max) { int t0 = (max-x) >> 31; int t1 = ( x) >> 31; int t3 = t0 & max; int t4 = ~(t0 | t1); return (x & t4) + t3; } |
Probably slower: trade an 'if' for a shift for the in-range values. 1 2 3 4 5 6 7 8 9
| public static int clampPOTx(int val, int bits) { int t = val >> bits; if (t == 0) return val; return (t < 0) ? 0 : 1<<bits; } |
WRT: Abs, your code is using two instead of one shift. This is faster than compares on my boxes: 1 2 3 4
| int t = x >> 31; x ^= t; x -= t; return x; |
The costs of branching keeps jumping around with different cores. Also on small code in each block, the compiler may be able to convert into conditional moves and/or masking.
|
|
|
|
|
Markus_Persson
JGO Kernel      Posts: 2092 Medals: 10
Mojang Specifications
|
 |
«
Reply #5 on:
2009-10-13 04:43:49 » |
|
What CPU is this on?
|
|
|
|
Roquen
JGO Strike Force    Posts: 823 Medals: 25
|
 |
«
Reply #6 on:
2009-10-13 05:20:48 » |
|
If you're asking me about abs, then the last P4 stepping and a Core-2 Quad. Where the difference is greater on the Core-2.
|
|
|
|
|
Markus_Persson
JGO Kernel      Posts: 2092 Medals: 10
Mojang Specifications
|
 |
«
Reply #7 on:
2009-10-13 06:04:07 » |
|
I have some vague memory of integer barrel shifting not being available on some modern cpu (I think it was intel), that's why I ask.
So a i<<15 would be 15 times slower than a i<<1
|
|
|
|
Abuse
JGO Kernel      Posts: 1866 Medals: 5
falling into the abyss of reality
|
 |
«
Reply #8 on:
2009-10-13 06:53:15 » |
|
While i'm not entirely sure how branch prediction works in modern cpus, presumably there is a finite number of branches that can be predicted simultaneously?
So, while a micro-benchmark comparing a branching abs() against a non-branching abs() might indicate the branching version is faster; if you introduce it into a more complex app. where a high density of branching code already exists, your results may vary.
|
|
|
|
|
princec
« League of Dukes » JGO Kernel      Posts: 8076 Medals: 91
Eh? Who? What? ... Me?
|
 |
«
Reply #9 on:
2009-10-13 07:08:54 » |
|
Barrel shifting was removed on P4 and beyond. Cas 
|
|
|
|
Games published by our own members! Go get 'em!
|
|
Roquen
JGO Strike Force    Posts: 823 Medals: 25
|
 |
«
Reply #10 on:
2009-10-13 08:21:23 » |
|
@Markus_Persson: fixed and variable length shifts have the same cost, regardless of number of positions shifted on intel.
@Abuse: I'm saying that the non-branching version of abs is faster on both NetBurst and Core architectures. Didn't you mean the non-branching clamp? Another big issue for clamp is the increased number of operations which are basically serial thus not allowing multiple issues to occur..so your point is valid: a non-branching version might be superior integrated in another piece of code.
As for branch predicitors, this stuff varies in a given family. I think Core has 16 and they are only 'consumed' on a taken branch, but this is from memory and most likely wrong.
|
|
|
|
|
princec
« League of Dukes » JGO Kernel      Posts: 8076 Medals: 91
Eh? Who? What? ... Me?
|
 |
«
Reply #11 on:
2009-10-13 09:19:51 » |
|
Ain't this exactly the sort of optimisation that JVMs are supposed to make based on architecture, rather than us plebian programmers? Cas 
|
|
|
|
pjt33
JGO Strike Force    Posts: 913 Medals: 17
|
 |
«
Reply #12 on:
2009-10-13 09:50:35 » |
|
Ain't this exactly the sort of optimisation that JVMs are supposed to make based on architecture, rather than us plebian programmers? Cas  I think replacing branches with bit hackery or vice versa is probably a bit much to expect from any optimiser.
|
|
|
|
|
Roquen
JGO Strike Force    Posts: 823 Medals: 25
|
 |
«
Reply #13 on:
2009-10-13 10:07:26 » |
|
Simple branches to bit-hacks isn't too bad. It's a simple pattern match like for converting constant muls to addition-chains (and other tech), reducing bit-ops chains, etc. The real trick is scheduling with a constantly moving target (a la arch changes). I'm hoping that eventually one of the LLVM based projects will make enough progress to be usable.
|
|
|
|
|
Eli Delventhal
« League of Dukes » JGO Kernel      Posts: 3573 Medals: 44
Game Engineer
|
 |
«
Reply #14 on:
2009-10-13 10:32:54 » |
|
Couldn't you use synchronized?   Anyway, I actually haven't bothered to try to write a very fast clamp method, I typically just do this: 1 2 3
| myValue = whatever; myValue = Math.min(myValue, max); myValue = Math.max(myValue, min); |
No doubt that's by far the slowest way you can do it. 3 assignments, two function calls, and who knows what's going on within Math. Oh well, it's never been my bottleneck so I haven't bothered.
|
See my work:OTC Software<br /> Currently Working On:Secret project... I edit JGO in production, because I simply don't waste time writing bugs
|
|
|
DzzD
|
 |
«
Reply #15 on:
2009-10-17 08:51:03 » |
|
I had this assumption that branching was kinda slow, and fiddling with bits was ultra super fast. I was going to proudly present my bits fiddler on here... Turns out the non-branching code is slower on my Intel Code2Quad.
and IMO, you just prove it again : 2 different cases for clamp() : v<vmin => 1 conditional v>min => 2 conditional so it performs 1 to 2 ops (conditional branches) only 1 cases for clampPLOT and it performs : >> 2 times << 2 times - 4 times & 2 times + 1 times = 1 times | 2 times ~ 1 times finally 15 ops for any input value for clampPOT() against 1 or 2, lets say 1.5 ops for clamp() ( also substraction and addition are usully a little slower then other ops ) this give a ratio of : 15/1.5 => according to your sample (28 ms / 19 ms ) and "grosso modo" a conditional branches is about 5 times slower than other ops and thats why unrolling loop always give huge improvment about zero clamp, why not : val *=1^(val >> 31);
|
|
|
|
Riven
« League of Dukes » JGO Kernel      Posts: 5866 Medals: 255
Hand over your head.
|
 |
«
Reply #16 on:
2009-10-17 09:27:53 » |
|
You're right, but I expected the difference to be much bigger, and on older CPUs, it was.
|
Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social rankings
|
|
|
Roquen
JGO Strike Force    Posts: 823 Medals: 25
|
 |
«
Reply #17 on:
2009-10-17 12:11:09 » |
|
There a typo or a C-ism:
val *=1^(val >>> 31);
or trade an 'and' for the 'mul':
val &= ~(val >> 31);
|
|
|
|
|
DzzD
|
 |
«
Reply #18 on:
2009-10-17 12:51:13 » |
|
There a typo or a C-ism:
val *=1^(val >>> 31);
or trade an 'and' for the 'mul':
val &= ~(val >> 31);
oups... this is a C-ism.... val &= ~(val >> 31); nice one, much faster than a mul
|
|
|
|
|