Java-Gaming.org Hi !
Featured games (83)
games approved by the League of Dukes
Games in Showcase (538)
Games in Android Showcase (132)
games submitted by our members
Games in WIP (600)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: [1]
  ignore  |  Print  
  clamp() within POT  (Read 4549 times)
0 Members and 1 Guest are viewing this topic.
Offline Riven
« League of Dukes »

« JGO Overlord »


Medals: 840
Projects: 4
Exp: 16 years


Hand over your head.


« Posted 2009-10-10 23:01:08 »

I was trying to speedup some boundschecks on data that had a POT dimension.

So I needed to clamp integers between 0..63, 0..127, 0..255, 0..511, etc.

clamp(val, 0, 511)
clampPOT(val, 9)


1  
2  
3  
4  
5  
6  
7  
8  
   public static final int clampPOT(int val, int bits)
   {
      // every negative value will become 0
      val = ((val >> 31) | (val - 1)) + 1;

      // every value >= (POT-1) will become (POT-1)
      return (((0 - (val & (-1 << bits))) >> 31) | val) & (~(-1 << bits));
   }


1  
2  
3  
4  
   public static final int clamp(int val, int min, int max)
   {
      return val < min ? min : (val > max ? max : val);
   }


I had this assumption that branching was kinda slow, and fiddling with bits was ultra super fast. I was going to proudly present my bits fiddler on here... Turns out the non-branching code is slower on my Intel Code2Quad.

Benchmark:
clamp(val, 0, 511) ==> 19ms
clampPOT(val, 9) ==> 28ms


It's not a benchmark-flaw, as my code gets slower in the real algorithm too.

Let this be a lesson for... all of us Roll Eyes

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social
Offline pjt33
« Reply #1 - Posted 2009-10-12 11:47:28 »

I discovered years ago that this is true for integer abs implementations too: (x < 0 ? -x : x) was faster than (x ^ (x >> 31)) + (x >>> 31). I even tried using Jasmin to make various versions of the non-branching version and they were all slower than the branching one, even though I'm pretty sure my test set was about 50% negative, so it's worst-case for branch prediction.
Offline Riven
« League of Dukes »

« JGO Overlord »


Medals: 840
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #2 - Posted 2009-10-12 12:11:47 »

Makes you wonder why loop-unrolling yields so much performance gain...

Either that, or the JIT is screwing up. I should bench this code in C.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline princec

« JGO Spiffy Duke »


Medals: 429
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #3 - Posted 2009-10-12 12:31:55 »

Branches aren't quite so expensive as they were once now that all these modernfangled CPUs have got branch prediction and parallel paths where they actually start executing the code on both sides of a conditional branch before the condition is evaluated.

Cas Smiley

Online Roquen
« Reply #4 - Posted 2009-10-12 12:36:04 »


You could reduce the number of bit operations (still slower through).  This is probably not the shortest, but as an example:

1  
2  
3  
4  
5  
6  
7  
8  
9  
public static int clampBitHacko(int x, int max)
{
  int t0 = (max-x) >> 31;   // t0 = (x > max) ?  -1 : 0
  int t1 = (    x) >> 31;   // t1 = (x <   0) ?  -1 : 0
  int t3 = t0 & max;        // t3 = (x > max) ? max : 0
  int t4 = ~(t0 | t1);      // t4 = -1 if in range, otherwise 0
 
  return (x & t4) + t3;
}



Probably slower: trade an 'if' for a shift for the in-range values.

1  
2  
3  
4  
5  
6  
7  
8  
9  
public static int clampPOTx(int val, int bits)
{
  int t = val >> bits;
 
  if (t == 0)
    return val;
 
  return (t < 0) ? 0 : 1<<bits;
}


WRT: Abs, your code is using two instead of one shift.  This is faster than compares on my boxes:
1  
2  
3  
4  
    int t = x >> 31;
    x ^= t;
    x -= t;
    return x;


The costs of branching keeps jumping around with different cores.  Also on small code in each block, the compiler may be able to convert into conditional moves and/or masking.
Offline Markus_Persson

JGO Wizard


Medals: 16
Projects: 19


Mojang Specifications


« Reply #5 - Posted 2009-10-13 08:43:49 »

What CPU is this on?

Play Minecraft!
Online Roquen
« Reply #6 - Posted 2009-10-13 09:20:48 »

If you're asking me about abs, then the last P4 stepping and a Core-2 Quad.  Where the difference is greater on the Core-2.
Offline Markus_Persson

JGO Wizard


Medals: 16
Projects: 19


Mojang Specifications


« Reply #7 - Posted 2009-10-13 10:04:07 »

I have some vague memory of integer barrel shifting not being available on some modern cpu (I think it was intel), that's why I ask.

So a i<<15 would be 15 times slower than a i<<1

Play Minecraft!
Offline Abuse

JGO Knight


Medals: 14


falling into the abyss of reality


« Reply #8 - Posted 2009-10-13 10:53:15 »

While i'm not entirely sure how branch prediction works in modern cpus, presumably there is a finite number of branches that can be predicted simultaneously?

So, while a micro-benchmark comparing a branching abs() against a non-branching abs() might indicate the branching version is faster; if you introduce it into a more complex app. where a high density of branching code already exists, your results may vary.

Make Elite IV:Dangerous happen! Pledge your backing at KICKSTARTER here! https://dl.dropbox.com/u/54785909/EliteIVsmaller.png
Offline princec

« JGO Spiffy Duke »


Medals: 429
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #9 - Posted 2009-10-13 11:08:54 »

Barrel shifting was removed on P4 and beyond.

Cas Smiley

Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Online Roquen
« Reply #10 - Posted 2009-10-13 12:21:23 »

@Markus_Persson: fixed and variable length shifts have the same cost, regardless of number of positions shifted on intel.

@Abuse: I'm saying that the non-branching version of abs is faster on both NetBurst and Core architectures. Didn't you mean the non-branching clamp?  Another big issue for clamp is the increased number of operations which are basically serial thus not allowing multiple issues to occur..so your point is valid: a non-branching version might be superior integrated in another piece of code.

As for branch predicitors, this stuff varies in a given family. I think Core has 16 and they are only 'consumed' on a taken branch, but this is from memory and most likely wrong.
Offline princec

« JGO Spiffy Duke »


Medals: 429
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #11 - Posted 2009-10-13 13:19:51 »

Ain't this exactly the sort of optimisation that JVMs are supposed to make based on architecture, rather than us plebian programmers?

Cas Smiley

Offline pjt33
« Reply #12 - Posted 2009-10-13 13:50:35 »

Ain't this exactly the sort of optimisation that JVMs are supposed to make based on architecture, rather than us plebian programmers?

Cas Smiley
I think replacing branches with bit hackery or vice versa is probably a bit much to expect from any optimiser.
Online Roquen
« Reply #13 - Posted 2009-10-13 14:07:26 »

Simple branches to bit-hacks isn't too bad.  It's a simple pattern match like for converting constant muls to addition-chains (and other tech), reducing bit-ops chains, etc.  The real trick is scheduling with a constantly moving target (a la arch changes).  I'm hoping that eventually one of the LLVM based projects will make enough progress to be usable.   
Offline Eli Delventhal

JGO Kernel


Medals: 42
Projects: 11
Exp: 10 years


Game Engineer


« Reply #14 - Posted 2009-10-13 14:32:54 »

Couldn't you use synchronized?  Huh Huh Huh

Tongue
Anyway, I actually haven't bothered to try to write a very fast clamp method, I typically just do this:
1  
2  
3  
myValue = whatever;
myValue = Math.min(myValue, max);
myValue = Math.max(myValue, min);

No doubt that's by far the slowest way you can do it. 3 assignments, two function calls, and who knows what's going on within Math. Oh well, it's never been my bottleneck so I haven't bothered.

See my work:
OTC Software
Offline DzzD
« Reply #15 - Posted 2009-10-17 12:51:03 »


I had this assumption that branching was kinda slow, and fiddling with bits was ultra super fast. I was going to proudly present my bits fiddler on here... Turns out the non-branching code is slower on my Intel Code2Quad.

and IMO, you just prove it again :

2 different cases for clamp() :
v<vmin => 1 conditional
v>min => 2 conditional

so it performs 1 to 2 ops (conditional branches)

only 1 cases for clampPLOT and it performs :
>> 2 times
<< 2 times
- 4 times
& 2 times
+ 1 times
= 1 times
| 2 times
~ 1 times

finally 15 ops for any input value for clampPOT() against 1 or 2, lets say 1.5 ops for clamp() ( also substraction and addition are usully a little slower then other ops )

this give a ratio of : 15/1.5  => according to your sample (28 ms / 19 ms ) and "grosso modo" a conditional branches is about 5 times slower than other ops and thats why unrolling loop always give huge improvment

about zero clamp, why not :
val *=1^(val >> 31);

Offline Riven
« League of Dukes »

« JGO Overlord »


Medals: 840
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #16 - Posted 2009-10-17 13:27:53 »

You're right, but I expected the difference to be much bigger, and on older CPUs, it was.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social
Online Roquen
« Reply #17 - Posted 2009-10-17 16:11:09 »

There a typo or a C-ism:

val *=1^(val >>> 31);

or trade an 'and' for the 'mul':

val &= ~(val >> 31);
Offline DzzD
« Reply #18 - Posted 2009-10-17 16:51:13 »

There a typo or a C-ism:

val *=1^(val >>> 31);

or trade an 'and' for the 'mul':

val &= ~(val >> 31);

oups... this is a C-ism....

Quote
val &= ~(val >> 31);

nice one, much faster than a mul

Pages: [1]
  ignore  |  Print  
 
 
You cannot reply to this message, because it is very, very old.

 

Add your game by posting it in the WIP section,
or publish it in Showcase.

The first screenshot will be displayed as a thumbnail.

rwatson462 (29 views)
2014-12-15 09:26:44

Mr.CodeIt (20 views)
2014-12-14 19:50:38

BurntPizza (40 views)
2014-12-09 22:41:13

BurntPizza (75 views)
2014-12-08 04:46:31

JscottyBieshaar (37 views)
2014-12-05 12:39:02

SHC (50 views)
2014-12-03 16:27:13

CopyableCougar4 (47 views)
2014-11-29 21:32:03

toopeicgaming1999 (113 views)
2014-11-26 15:22:04

toopeicgaming1999 (100 views)
2014-11-26 15:20:36

toopeicgaming1999 (30 views)
2014-11-26 15:20:08
Resources for WIP games
by kpars
2014-12-18 10:26:14

Understanding relations between setOrigin, setScale and setPosition in libGdx
by mbabuskov
2014-10-09 22:35:00

Definite guide to supporting multiple device resolutions on Android (2014)
by mbabuskov
2014-10-02 22:36:02

List of Learning Resources
by Longor1996
2014-08-16 10:40:00

List of Learning Resources
by SilverTiger
2014-08-05 19:33:27

Resources for WIP games
by CogWheelz
2014-08-01 16:20:17

Resources for WIP games
by CogWheelz
2014-08-01 16:19:50

List of Learning Resources
by SilverTiger
2014-07-31 16:29:50
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!