heisenbergman


«
Posted
20130610 09:06:04 » 

Just asking because I created this thread on TIG: Is there a known system of converting degrees to (x,y) coordinates? (which so blatantly exposes that I totally fail at trigonometry, but that's beside the point, lol ). If you'll notice somewhere in the middle of Page 1, someone drops by the thread and suggests that I could implement Cos/Sin lookup by storing the values into a table for future reference. He subsequently gets ripped to shreds by some of the members of TIG. Only later on by some serendipitous turn of fate (particularly by the timely necrobump of this thread) do I realize that he was in fact referencing Riven's sin/cos lookup thread here on JGO. That suggestion to use sin/cose lookup tables has gotten some hostile responses on that thread. Such as: It is NOT an alternative. Just forget about it, ok?
Reasoning: If you use sin/cos like this in normal code, the first level cache is usually filled with the current data. A tablebased approach would require parts of the table to be loaded to first level cache, and would flush other data from there. But the table needs to be in L1 cache to be efficient, even a L2 access would already take longer than the basic math to calculate a sin/cos. It's just a few multiplications and additions after all. Look up taylor series if you want to see how few operations exactly. And all of those happen in registers without memory access, some parts even parallelized by the CPU's OOO pipeline. What I see here is deprecated wisdom. Ten years ago a table might have saved the day. Today's CPUs are far more complex and incredibly more smart, and the burden has shifted heavily towards memory access. Now, I wonder... is this approach towards lookingup sin/cos values still relevant for performance today? Or do the posts calling this logic "deprecated wisdom" have some value?




Riven


«
Reply #1  Posted
20130610 10:23:32 » 

I don't really care about people using CAPITALS to make their point All I know is that in realworld use cases, my sin/cos lookup tables make a REAL difference in performance.

Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social



pjt33


«
Reply #2  Posted
20130610 10:53:39 » 

Now, I wonder... is this approach towards lookingup sin/cos values still relevant for performance today? There is only one way to get a useful answer to this: profile. I would start out using the standard library and only look at replacements if you see Math.sin or Math.cos actually showing up in your profiling reports.




Games published by our own members! Check 'em out!


princec


«
Reply #3  Posted
20130610 11:03:03 » 

+1 to what Riven said  we get significant performance boosts using LUTs. Many of the TIGs deriders probably forget that the intrinsic sin/cos functions have to be mathematically correct for all possible parameter values whereas we can make an optimisation knowing that we have a pretty small and finite set of values and a less stringent requirement for accuracy. Cas




Oskuro


«
Reply #4  Posted
20130610 14:02:21 » 

If you'll notice somewhere in the middle of Page 1, someone drops by the thread and suggests that I could implement Cos/Sin lookup by storing the values into a table for future reference. He subsequently gets ripped to shreds by some of the members of TIG.
That someone being me




heisenbergman


«
Reply #5  Posted
20130610 14:29:34 » 

oh hai thar Oskuro! I should have known you'd be here. I appreciate the contribution to that thread. I might not be using this approach for now but it's nice to be aware of it.




Oskuro


«
Reply #6  Posted
20130610 14:44:37 » 

You're welcome!
I essentially made the suggestion because I'm adopting the approach for my game, so I kind of had it on my mind.
Wish I had it running so I could do some profiling, though.




Roquen


«
Reply #7  Posted
20130610 15:37:00 » 

It depends on the real usage patterns, including how memory is being accessed otherwise and what you're comparing it against. My experience is no..but that doesn't mean a alot.




Cero


«
Reply #8  Posted
20130610 16:28:42 » 

I don't really care about people using CAPITALS to make their point Sry I do that too. Reason is in simple chats/irc you dont have formatting often times, and its a habit =D




Oskuro


«
Reply #9  Posted
20130610 21:00:15 » 

I'm still having trouble understanding how exactly is Riven using the SIN_MASK to filter angle values.
I kind of understand it, but not enough to tinker with it so I can clamp other value ranges.




Games published by our own members! Check 'em out!


relminator


«
Reply #10  Posted
20130611 01:40:23 » 

I'm still having trouble understanding how exactly is Riven using the SIN_MASK to filter angle values.
I kind of understand it, but not enough to tinker with it so I can clamp other value ranges.
SIN_MASK is a powof2value1. Basically you construct your lut to have a pow of two max index so that you don't use the slower mod % or an if then else to check for out of range. AND is much faster than a DIV. I would hazard a guess that riven converted his angles to Bradians (binary radians where a full revolution is a POT) Been using this trick in some of my older games and gfx demos. There's also another trick to minimize memory use by 1/8 by "octantifying" angles since sin and cos are periodic (actually 1/16 if you factor the fact that sin and cos are complimentary so you can just calculate either by adding 1/2pi ti the index). This is not useful on the pc though as we have memory to burn.




Roquen


«
Reply #11  Posted
20130611 05:26:25 » 

"memory to burn" is both true and very untrue. Memory size isn't so important...memory motion is very very important.




relminator


«
Reply #12  Posted
20130611 06:05:49 » 

"memory to burn" is both true and very untrue. Memory size isn't so important...memory motion is very very important.
I was speaking in terms of size instead of speed.




Roquen


«
Reply #13  Posted
20130611 12:10:01 » 

Well the real answer that that thread is: don't use angles...angles hate you, so hate 'em back.




Riven


«
Reply #14  Posted
20130611 13:38:08 » 

I'd like to say that 'my' lookup tables take 16KB of memory, or 4 pages. That means that they are basically always in L2 cache, and not big enough to push other data out. They can even pretty much fully stay within L1 cache, in your inner loop. 1 2 3 4
 math=202,067us full[]=7,586us half[]=12,851us taylor=26,783us math=201,610us full[]=7,701us half[]=12,878us taylor=26,448us math=202,474us full[]=7,672us half[]=12,886us taylor=26,438us math=202,321us full[]=7,592us half[]=12,864us taylor=26,306us 
I know it's microbenchmark, but it's rotating vectors, which is a pretty common use case: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192
 import java.util.Random;
public class FastMathBench { public static void main(String[] args) { System.out.println(Math.sin(0.5f)); System.out.println(sinFull(0.5f)); System.out.println(sinHalf(0.5f)); System.out.println(sinTaylor(0.5f)); System.out.println();
System.out.println(Math.sin(4.5f)); System.out.println(sinFull(4.5f)); System.out.println(sinHalf(4.5f)); System.out.println(sinTaylor(4.5f)); System.out.println();
float[] angles = new float[1024 * 1024]; float[] vectors = new float[angles.length * 2];
Random r = new Random(235); for (int i = 0; i < angles.length; i++) { angles[i] = (float) (r.nextDouble() * Math.PI * 2.0); vectors[i * 2 + 0] = r.nextFloat(); vectors[i * 2 + 1] = r.nextFloat(); }
for (int i = 0; i < 16; i++) { long t0 = System.nanoTime(); benchMathSin(vectors, angles); long t1 = System.nanoTime(); benchLookupFullSin(vectors, angles); long t2 = System.nanoTime(); benchLookupHalfSin(vectors, angles); long t3 = System.nanoTime(); benchLookupTaylorSin(vectors, angles); long t4 = System.nanoTime();
long tMath = (t1  t0) / 1000; long tLUT1 = (t2  t1) / 1000; long tLUT2 = (t3  t2) / 1000; long tTayl = (t4  t3) / 1000;
System.out.println("math=" + tMath + "us\tfull[]=" + tLUT1 + "us\thalf[]=" + tLUT2 + "us\ttaylor=" + tTayl + "us"); } }
private static void benchMathSin(float[] vectors, float[] angles) { for (int i = 0; i < angles.length; i++) { float x = vectors[i * 2 + 0]; float y = vectors[i * 2 + 1]; float a = angles[i];
float cos = (float) Math.cos(a); float sin = (float) Math.sin(a);
vectors[i * 2 + 0] = x * cos  y * sin; vectors[i * 2 + 1] = y * cos + x * sin; } }
private static void benchLookupFullSin(float[] vectors, float[] angles) { for (int i = 0; i < angles.length; i++) { float x = vectors[i * 2 + 0]; float y = vectors[i * 2 + 1]; float a = angles[i];
float cos = sinFull(a); float sin = cosFull(a);
vectors[i * 2 + 0] = x * cos  y * sin; vectors[i * 2 + 1] = y * cos + x * sin; } }
private static void benchLookupHalfSin(float[] vectors, float[] angles) { for (int i = 0; i < angles.length; i++) { float x = vectors[i * 2 + 0]; float y = vectors[i * 2 + 1]; float a = angles[i];
float cos = cosHalf(a); float sin = sinHalf(a);
vectors[i * 2 + 0] = x * cos  y * sin; vectors[i * 2 + 1] = y * cos + x * sin; } }
private static void benchLookupTaylorSin(float[] vectors, float[] angles) { for (int i = 0; i < angles.length; i++) { float x = vectors[i * 2 + 0]; float y = vectors[i * 2 + 1]; float a = angles[i];
float cos = cosTaylor(a); float sin = sinTaylor(a);
vectors[i * 2 + 0] = x * cos  y * sin; vectors[i * 2 + 1] = y * cos + x * sin; } }
private static final float RAD, DEG, SIN_TO_COS; private static final int SIN_BITS, SIN_MASK, SIN_MASK2, SIN_COUNT, SIN_COUNT2; private static final float radFull, radToIndex; private static final float degFull, degToIndex; private static final float[] sinFull, sinHalf;
static { RAD = (float) Math.PI / 180.0f; DEG = 180.0f / (float) Math.PI; SIN_TO_COS = (float) (Math.PI * 0.5f);
SIN_BITS = 12; SIN_MASK = ~(1 << SIN_BITS); SIN_MASK2 = SIN_MASK >> 1; SIN_COUNT = SIN_MASK + 1; SIN_COUNT2 = SIN_MASK2 + 1;
radFull = (float) (Math.PI * 2.0); degFull = (float) (360.0); radToIndex = SIN_COUNT / radFull; degToIndex = SIN_COUNT / degFull;
sinFull = new float[SIN_COUNT]; for (int i = 0; i < SIN_COUNT; i++) { sinFull[i] = (float) Math.sin((i + Math.min(1, i % (SIN_COUNT / 4)) * 0.5) / SIN_COUNT * radFull); }
sinHalf = new float[SIN_COUNT2]; for (int i = 0; i < SIN_COUNT2; i++) { sinHalf[i] = (float) Math.sin((i + Math.min(1, i % (SIN_COUNT / 4)) * 0.5) / SIN_COUNT * radFull); } }
public static final float sinFull(float rad) { return sinFull[(int) (rad * radToIndex) & SIN_MASK]; }
public static final float cosFull(float rad) { return sinFull(rad + SIN_TO_COS); }
public static final float sinHalf(float rad) { int index1 = (int) (rad * radToIndex) & SIN_MASK; int index2 = index1 & SIN_MASK2; int mul = ((index1 == index2) ? +1 : 1); return sinHalf[index2] * mul; }
public static final float cosHalf(float rad) { return sinHalf(rad + SIN_TO_COS); }
public static final float sinTaylor(float rad) {
double x = rad;
double x2 = x * x; double x3 = x2 * x; double x5 = x2 * x3; double x7 = x2 * x5; double x9 = x2 * x7; double x11 = x2 * x9; double x13 = x2 * x11; double x15 = x2 * x13; double x17 = x2 * x15;
double val = x; val = x3 * 0.16666666666666666666666666666667; val += x5 * 0.00833333333333333333333333333333; val = x7 * 1.984126984126984126984126984127e4; val += x9 * 2.7557319223985890652557319223986e6; val = x11 * 2.5052108385441718775052108385442e8; val += x13 * 1.6059043836821614599392377170155e10; val = x15 * 7.6471637318198164759011319857881e13; val += x17 * 2.8114572543455207631989455830103e15; return (float) val; }
public static final float cosTaylor(float rad) { return sinTaylor(rad + SIN_TO_COS); } } 

Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social



relminator


«
Reply #15  Posted
20130611 14:13:37 » 

Slightly offtopic: Could luts being faster in java be attributed to how Math.xxx behave? Ie. returning doubles instead of floats? I remember when I coded this 256 byte demo: www.pouet.net/prod.php?which=24821That st0 returns a 32bit float. I'm not sure about the newer pc architectures though.




Roquen


«
Reply #16  Posted
20130611 14:35:46 » 

@Riven: exactly. If you don't need continuous sin/cos and you're intensely calling the tablebased function then the big hit comes up front and after that table will be (on newer cards) all in L1. You have a ~5 cycle latency to get a 32bit out of L1 into a register which will likely to be hidden behind other ALU ops. So win in that situation. WRT the taylor series, you can get within a 3ulp accurate result with about 6 muls using minimax, ignoring range reduction and you're benchmark version isn't formed as Horner, so you're already doing twice as many muls as needed and have a very long dependency chain.




Riven


«
Reply #17  Posted
20130611 14:38:34 » 

I actually reduced the dependency chain, but HotSpot didn't like it, and produced slower native code.

Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social



Oskuro


«
Reply #18  Posted
20130611 14:41:36 » 

SIN_MASK is a powof2value1. Basically you construct your lut to have a pow of two max index so that you don't use the slower mod % or an if then else to check for out of range. AND is much faster than a DIV.
Actually, I understand that. My problem is understanding why the MASK works (Yes, extreme ignorance of binary arithmetics, sorry). For example, in my game I'm simplifying angles so there are only 16 possible options, so my mask is built like: 1 2 3 4 5
 int MASK = ~(1 << 4);
public static double lut_sin(int value) { return sin_lut[value & MASK]; } 
Which means that whatever the integer value I punch in, I get and index value between 0 and 15 (sorry if the code doesn't match this assumption, I don't have the actual working source at hand now). So I'm shifting 1 four bits, and then negating it. I get that. My guess is that shifting the 1 is to have the sign bit into the resulting value. What I don't understand is how that results in the angle value being parsed correctly despite it being negative or larger than the boundary. And yes, I'm experimenting and coming to a conclusion myself, but wouldn't mind some help, I love these bitarithmetic tricks




Roquen


«
Reply #19  Posted
20130611 14:47:50 » 

@Riven: interesting. Can you spew out the asm?




Riven


«
Reply #20  Posted
20130611 14:48:15 » 

WRT the taylor series, you can get within a 3ulp accurate result with about 6 muls using minimax, ignoring range reduction
This is 10% slower than my 'slow' taylor version. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
 public static final float sinTaylor2(float rad) { double x = rad; double a0 = +1.0; double a1 = 1.666666666640169148537065260055e1; double a2 = +8.333333316490113523036717102793e3; double a3 = 1.984126600659171392655484413285e4; double a4 = +2.755690114917374804474016589137e6; double a5 = 2.502845227292692953118686710787e8; double a6 = +1.538730635926417598443354215485e10; double x2 = x * x; return (float) (x * (a0 + x2 * (a1 + x2 * (a2 + x2 * (a3 + x2 * (a4 + x2 * (a5 + x2 * a6))))))); }
public static final float cosTaylor2(float rad) { return sinTaylor2(rad + SIN_TO_COS); } 

Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social



Riven


«
Reply #21  Posted
20130611 15:00:21 » 

@Riven: interesting. Can you spew out the asm?
No but this is my latest attempt at reducing the dependency chain (still 3% slower than original) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
 public static final float sinTaylor(float rad) { double x = rad;
double x2 = x * x; double x6 = x2 * x2 * x2; double x3 = x * x * x; double x5 = x * x * x3; double x7 = x6 * x; double x9 = x6 * x3; double x11 = x6 * x5; double x13 = x6 * x7; double x15 = x6 * x9; double x17 = x6 * x11;
double val1 = x  x3 * 0.16666666666666666666666666666667 + x5 * 0.00833333333333333333333333333333  x7 * 1.984126984126984126984126984127e4 + x9 * 2.7557319223985890652557319223986e6;
double val2 = x11 * 2.5052108385441718775052108385442e8 + x13 * 1.6059043836821614599392377170155e10  x15 * 7.6471637318198164759011319857881e13 + x17 * 2.8114572543455207631989455830103e15;
return (float) (val1 + val2); } 

Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social



Nate


«
Reply #22  Posted
20130611 18:27:19 » 





Riven


«
Reply #23  Posted
20130611 18:33:19 » 

As you may have noticed, I adjusted my version with the edge case for straight corners.

Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social



Nate


«
Reply #24  Posted
20130612 00:09:42 » 

Cool. Where do you keep your latest version? I use + half PI for cosine to double the accuracy with the same memory at the cost of + half PI. Since everything is jammed inside the MathUtils class, I used static classes so the tables aren't initialized unless actually used.




Riven


«
Reply #25  Posted
20130612 00:31:34 » 

I use + half PI for cosine to double the accuracy with the same memory at the cost of + half PI. If you looked at my code, you'd have seen I did the same I never needed more accuracy than 12 bits, so I just halved memory, when I ditched the cos lookup table. Since everything is jammed inside the MathUtils class, I used static classes so the tables aren't initialized unless actually used.
That's actually a nice feature, but only really useful for the atan2LUT. Every single game will need fast sin/cos, so I don't really see the point in lazy loading the sinLUT. Cool. Where do you keep your latest version? I didn't change that much... I think we went through every 'diff' by now

Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social



Nate


«
Reply #26  Posted
20130612 03:39:34 » 

That's actually a nice feature, but only really useful for the atan2LUT. Every single game will need fast sin/cos, so I don't really see the point in lazy loading the sinLUT.
I tend to use the LUT in games and libgdx itself, but other libgdx contributors don't. In fact they seem to actively change libgdx to not use the LUT. Eg, scene2d no longer uses it. I haven't investigated their reasoning, but I assume they don't feel it is accurate enough. Maybe when compounded?




Roquen


«
Reply #27  Posted
20130612 13:18:17 » 

Protip: Don't come home at 4am, decide not to go bed and instead write some code...it might look like this: http://pastebin.javagaming.org/a71aa0c1f6a. That code is totally broken...I didn't even check that the minmax method hadn't converged. But, now a few hours later I realize that it doesn't really matter. The point of throwing that together was to be able to do a sidebyside comparison of a polynomial of a given degree with table based in Riven's microbench marks as data points...so only the form and number of term really matters. All of these are formulated in Horner, so no attempt to break up dependency chains (thankfully I can't image what that might have ended up looking like). What I see using 64 bit build 1.7.0_21b11, with: X:MaxInlineSize=100 XX:CompileThreshold=1000. I set the first because it might be why Riven is seeing a slowdown when breaking dependency chains and you probably do want to use a higher value than the default in general (39 is what I think it is and the number is the count of java bytecodes). The second option is to be mostly avoided as it will cut short collecting of profiling information and cause routines that probably don't been to be compiled to be, well, compiled. bad_sin_0 is just an equivalent to Riven's original power series (minus doubles and nested) and is slightly faster than "half". bad_sin_3 is about the midpoint of full and half. This form on +/{pi, pi/2, pi/4} have relative error values of ~{.0000167044, 5.31348*10 ^{9} , 4.54084*10 ^{12}} bad_sin_5 is about the same speed as "full". This form on +/{pi, pi/2, pi/4} have relative error values of ~{.0194862, .000108162, 1.5071*10 ^{6}} Well, assuming I didn't screw up again.




