sproingie


«
Reply #30  Posted
20120429 02:17:08 » 

But the sum is useless at testing accuracy. The only thing it can tell you if it's really your approximation is really broken and nothing else.
If it summed absolute values, then it would be a reasonable indicator of cumulative error compared to the reference implementation. It's not the most useful thing in the world, but I wouldn't call it completely useless. You need something to keep the loop from being optimized out anyway, may as well have some value. Yeesh, it's not even my algorithm, don't ask me to die on the hill of defending a throwaway microbenchmark.




Roquen


«
Reply #31  Posted
20120429 06:22:09 » 

Hehe. Remember that when I'm responded that I'm not only addressing the person I'm responding to. Microbenchmarks are hard to do right. Doubly so when attempting to compare a software implementation vs. hardware. WRT: error analysis. This just isn't the way to draw any conclusions. A sum of absolute errors significantly better but still not very meaningful. Like in Cas's rant about overusage of test units, doing sum of errors is like attempt to show correctness by a bunch of black box tests when some small number of white box tests will tell you the exact information. If you don't know how to do that, then for singles you can brute force it by walking all values in the range using ulp or nextAfter.
For folks that don't get why a sum of errors is useless consider:
f(x) = if (x<0) return 0 else return 1; a(x) = return 1/2;
or
f(x) = cos(x) g(x) = sin(x)




DrZoidberg


«
Reply #32  Posted
20120429 09:37:13 » 

Sin and cos are very fast between PI/4 and +PI/4 because modern CPUs have instructions for sin and cos built in and the JVM uses that. However outside that range the CPU instructions aren't always perfectly accurate so the JVM calculates sin and cos in software instead. That gives us a very easy way to write a fastsin and fastcos method that's much better than any lookup table. 1 2 3 4 5 6 7 8 9 10
 public static double sin(double x) { while(x > PI) x = 2*PI; while(x < PI) x += 2*PI; if(x >= PI/4 && x <= PI/4) return Math.sin(x); else if(x > PI/4 && x < 3*PI/4) return Math.cos(xPI/2); else if(x >= 3*PI/4 && x <= PI) return Math.sin(PIx); else if(x < PI/4 && x > 3*PI/4) return Math.cos(x+PI/2); else if(x <= 3*PI/4 && x >= PI) return Math.sin(PI+x); else return Math.sin(x); } 




Games published by our own members! Check 'em out!


Roquen


«
Reply #33  Posted
20120429 10:18:14 » 

At the risk of sounding like a two year old: "no!" I know of no architecture where sin/cos is software or hardware based where this wouldn't be much slower.




princec


«
Reply #34  Posted
20120429 15:25:10 » 

It will be slower; but more accurate. It's true that the x86 FP code is only accurate to PI/8 radians and beyond that it tends to drift. Or at least, so I read, some years ago. So if you want fast and accurate, you've got to tailor your input by design. Cas




DrZoidberg


«
Reply #35  Posted
20120429 16:38:39 » 

At the risk of sounding like a two year old: "no!" I know of no architecture where sin/cos is software or hardware based where this wouldn't be much slower.
Did you test that theory? Here is my benchmark. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
 import static java.lang.Math.PI;
public class Trig { public static void main(String[] args) { long start = System.nanoTime(); double test = 0; for(int i=0; i<100000000; i++) { test += sin(3.15); } long end = System.nanoTime(); System.out.println("Time = "+((endstart)/1e6)+"ms"); System.out.println(test); start = System.nanoTime(); test = 0; for(int i=0; i<100000000; i++) { test += Math.sin(3.15); } end = System.nanoTime(); System.out.println("Time = "+((endstart)/1e6)+"ms"); System.out.println(test); } public static double sin(double x) { while(x > PI) x = 2*PI; while(x < PI) x += 2*PI; if(x >= PI/4 && x <= PI/4) return Math.sin(x); else if(x > PI/4 && x < 3*PI/4) return Math.cos(xPI/2); else if(x >= 3*PI/4 && x <= PI) return Math.sin(PIx); else if(x < PI/4 && x > 3*PI/4) return Math.cos(x+PI/2); else if(x <= 3*PI/4 && x >= PI) return Math.sin(PI+x); else return Math.sin(x); } } 
the output(using Java 1.7.0_03 64bit): 1 2 3 4
 Time = 81.661547ms 840724.7369070747 Time = 2811.498313ms 840724.7369070747 




Spasi


«
Reply #36  Posted
20120429 20:18:12 » 

At the risk of sounding like a two year old: "no!" I know of no architecture where sin/cos is software or hardware based where this wouldn't be much slower.
Did you test that theory? Here is my benchmark. I'm seeing the performance difference in that range too. But if it were so simple, the builtin sin/cos would do the same thing you do (ignoring numerical accuracy issues). Of course it's not that simple, your code has almost half performance with randomized input. Which is the general case that JDK methods are optimized for. Here's a minimum set of rules that microbenchmarks need to obey in order to be useful:  Do something useful with every loop result, to avoid noop code that gets optimized away.  Use randomized input, perhaps biased a bit towards the most frequent input range.  Don't use Random in the benchmarking loop, generate input ahead of time and store it in an array. Preferably the array should easily fit in the CPU caches, use % addressing if necessary for long loops.  Extract the loop code in its own method and call it enough times to warm it up, before timing anything. OSR can be suboptimal.




Roquen


«
Reply #37  Posted
20120430 06:39:19 » 

Although the benchmark wasn't properly formulated, there's still something rotten in Denmark. I'm away from home for another week and will have to look at this once I'm back. The gap in the numbers is way too large. If hotspot is using x87 and x87 sin/cos have changed to spitting out some microcodes to reduce diefootprint (bad me for not keep up if that's the case) then an SSE argument reduction should be used. Actually I rather hope that sin/cos are implemented in SSE as that allows for scheduling with surrounding code. Either case argument reduction shouldn't have this large of an impact on execution. Even assuming that hotspot has range analysis and the fast version is getting optimized into a sin/cos(x+CONST) call.




Riven


«
Reply #38  Posted
20120430 16:23:53 » 

The while loop could be replaced with some statement involving two modulos. Branching is way too expensive here.

Hi, appreciate more people! Î£ â™¥ = Â¾ Learn how to award medals... and work your way up the social rankings



ZomB
Senior Newbie


«
Reply #39  Posted
20120501 05:16:36 » 

A side side note. At least on my implementation, Sun/Oracle Java's for Windows, in rt.jar (the part of the JVM that is in pure Java) all Math.xxx methods delegate to StrictMath.xxx. So directly calling StrictMath.xxx saves one call in performance.




Games published by our own members! Check 'em out!


Roquen


«
Reply #40  Posted
20120501 05:51:02 » 

all Math.xxx methods delegate to StrictMath.xxx. So directly calling StrictMath.xxx saves one call in performance.
Man I "am" the two year ago of this thread. NO! SEE: Beware of antioptimizations.Even if this were the case, HotSpot is pretty aggressive at inline small code chunks. The while loop could be replaced with some statement involving two modulos. Branching is way too expensive here.
Or better yet, decide on a reasonable range and ignore input outside of it. I'd say that anything outside of [2pi,2pi] has to be the callers problem.




ra4king


«
Reply #41  Posted
20120501 22:17:10 » 

A side side note. At least on my implementation, Sun/Oracle Java's for Windows, in rt.jar (the part of the JVM that is in pure Java) all Math.xxx methods delegate to StrictMath.xxx. So directly calling StrictMath.xxx saves one call in performance.
StrictMath is almost never called from Math, the JVM inlines it with its own faster native implementation. EDIT: Hehe looks like Roquen's link already says that ^^




Jimmt


«
Reply #42  Posted
20120502 01:24:14 » 

Did you get the idea from rainbow tables?




ZomB
Senior Newbie


«
Reply #43  Posted
20120502 08:29:26 » 

Who got what idea? I don't see anything in this thread even remotely related to rainbow tables. A side side note. At least on my implementation, Sun/Oracle Java's for Windows, in rt.jar (the part of the JVM that is in pure Java) all Math.xxx methods delegate to StrictMath.xxx. So directly calling StrictMath.xxx saves one call in performance.
StrictMath is almost never called from Math, the JVM inlines it with its own faster native implementation. EDIT: Hehe looks like Roquen's link already says that ^^ I didn't know that. It isn't flagged as native so I never expected the JVM would generate exceptions for those. I did some quick tests, and indeed the trig functions are in the order of 80 times faster when calling Math. On the other hand, some functions like min/max are 35% faster when calling StrictMath (I haven't tested others yet). Edit: A more advanced test (using my second test type, and preventing code from being optimized away) shows now that StrictMath is about 10% faster than Math. Edit2: (StrictMath is ...) sin,cos: 50% slower tan: 10% faster atan: about the same sqrt: 9000% slower cbrt: about the same floor: about 10% slower ceil: about the same max,min: about the same And I think we're drifting too much off topic. shouldn't someone make a new topic for this?




ra4king


«
Reply #44  Posted
20120502 12:10:01 » 

Or we could stop repeating the same thing over and over....and simply use Riven's methods for fast sin/cos/tan/atan (@Riven do you have fast implementations for asin and acos?)




kappa


«
Reply #45  Posted
20120502 12:52:06 » 

Edit2: (StrictMath is ...) sin,cos: 50% slower tan: 10% faster atan: about the same sqrt: 9000% slower cbrt: about the same floor: about 10% slower ceil: about the same max,min: about the same
Always seems to bother me a bit that Math.sqrt is such a super slow operation, a quick google search seems to show that there is indeed a fast integer square root using similar techniques to the above here. Should make a nice addition to the collection of fast hacky maths methods.




Roquen


«
Reply #46  Posted
20120502 12:56:28 » 

The issue is that optimal approximation is dependent on alot of factors and most people want one version of each function that "just works". That's simply not the way to maximize performance. As I've mentioned before I've never seen a tablebased method which outperforms (on nonembedded systems) numeric methods. But that's "my" usecase YMMV. If you really want fast, you have to know your input range and what kind (abs or rel error) and the bound on that error and potentially access patterns.




Roquen


«
Reply #47  Posted
20120502 13:19:38 » 

WRT: trig & inversetrig. Remember what the man said: "The best optimization is to not need to call anything." If you're calling inverse trig functions, my question is: "do you really want the angle? I mean really, really want the angle". For example if you're storing your objects orientations as angles...therein lies a problem. Another is if you're converting to polar, manipulating the angle then converting back.
WRT: Sqrt. All roots are a pain as the terms of the Talyor series decrease very slowing, so they are very hard to approximate with a polynomial or rational of polynomials. If you almost know the answer, then they are very easy (like Newton's method).




Jimmt


«
Reply #48  Posted
20120502 15:39:43 » 

@ZomB the original poster said he used a "lookuptable." Rainbow tables are lookup tables for hash codes.




Riven


«
Reply #49  Posted
20120502 15:55:01 » 

(@Riven do you have fast implementations for asin and acos?)
Would you use them? Anyway, I don't know about their performance, but given their narrow valid range, it's trivial to fill lookup tables.

Hi, appreciate more people! Î£ â™¥ = Â¾ Learn how to award medals... and work your way up the social rankings



ra4king


«
Reply #50  Posted
20120502 18:21:30 » 

(@Riven do you have fast implementations for asin and acos?)
Would you use them? Haha that's a good question, I don't see why I would ever use acos and asin in a performance critical application




Nate


«
Reply #51  Posted
20120503 03:06:57 » 

Always seems to bother me a bit that Math.sqrt is such a super slow operation, a quick google search seems to show that there is indeed a fast integer square root using similar techniques to the above here. Should make a nice addition to the collection of fast hacky maths methods. Math.sqrt is quite fast. I once tried some of the alternatives and couldn't find anything faster.




Roquen


«
Reply #52  Posted
20120503 10:19:01 » 

Fast/slow. It's all relative. To put some numbers on it in terms of instruction latency/throughput (both in cycles) for modern(ish) Intel. Ranges are to cover a reasonable set of hardware.
SQRTSD: 2039/1439 (one complaint double sqrt) On unspecified input ranges these can drop to ~6/~10. SQRTSS: 2332/2332 (one compliant single sqrt) On unspecified input ranges these can drop to ~6/~10. RSQRTSS: 36/14 (approximate 1/sqrt for singles. We can't do this...big fat sadface)
So what I intended to say about sqrt being "hard" is that unlike lots of other functions reducing the range or precision to get a faster version is much more difficult. About the only way you can succeed is if you know that it's very near a specific value (like when renormalizing). That old SGI trick (popularized by ID) should always be slower that simply using the hardware.




ZomB
Senior Newbie


«
Reply #53  Posted
20120507 02:56:25 » 

@ZomB the original poster said he used a "lookuptable." Rainbow tables are lookup tables for hash codes.
Rainbow tables are clever structures that eliminate the need for memoryeating lookup tables (think in the range of several terabytes) (@Riven do you have fast implementations for asin and acos?)
Would you use them? Haha that's a good question, I don't see why I would ever use acos and asin in a performance critical application How about fractals / fractal art? (actually you'd need the highest possible precision here so the statement is void) Fast/slow. It's all relative. To put some numbers on it in terms of instruction latency/throughput (both in cycles) for modern(ish) Intel. Ranges are to cover a reasonable set of hardware.
SQRTSD: 2039/1439 (one complaint double sqrt) On unspecified input ranges these can drop to ~6/~10. SQRTSS: 2332/2332 (one compliant single sqrt) On unspecified input ranges these can drop to ~6/~10. RSQRTSS: 36/14 (approximate 1/sqrt for singles. We can't do this...big fat sadface)
So what I intended to say about sqrt being "hard" is that unlike lots of other functions reducing the range or precision to get a faster version is much more difficult. About the only way you can succeed is if you know that it's very near a specific value (like when renormalizing). That old SGI trick (popularized by ID) should always be slower that simply using the hardware.
I sometimes use "float d = 1f / (float)Math.sqrt(dx * dx + dy * dy);" (more often double actually), so hopefully the compiler uses RSQRTSS there.




ZomB
Senior Newbie


«
Reply #54  Posted
20120507 09:59:36 » 

Fast/slow. It's all relative. To put some numbers on it in terms of instruction latency/throughput (both in cycles) for modern(ish) Intel. Ranges are to cover a reasonable set of hardware.
SQRTSD: 2039/1439 (one complaint double sqrt) On unspecified input ranges these can drop to ~6/~10. SQRTSS: 2332/2332 (one compliant single sqrt) On unspecified input ranges these can drop to ~6/~10. RSQRTSS: 36/14 (approximate 1/sqrt for singles. We can't do this...big fat sadface)
So what I intended to say about sqrt being "hard" is that unlike lots of other functions reducing the range or precision to get a faster version is much more difficult. About the only way you can succeed is if you know that it's very near a specific value (like when renormalizing). That old SGI trick (popularized by ID) should always be slower that simply using the hardware.
After some googling I found out these are MMX instructions, and Java can't use those afaik. Normal math happens using FPU instructions. I copied this from "Intel Architecture Optimization Reference Manual": "Additionally, while the multicycle floatingpoint instructions, fdiv and fsqrt, execute in the floatingpoint unit pipe, integer instructions can be executed in parallel." So it pays off to do lots of nonfpmath in innerloops that use Math.sqrt and /. Separating (or keeping) those in different innerloops would be bad. And from "How to optimize for the Pentium family of microprocessors" I read that FSQRT uses 70 cycles and cannot overlap integer multiplication instructions (they will block the pipeline). No idea about modern processors however. They should be much more efficient in this, especially the new Core Ix and Celeron Sandy Bridge CPUs.




Roquen


«
Reply #55  Posted
20120507 10:20:51 » 

I'm pretty sure that HotSpot does not use RSQRTSS, but even if it does it won't approach that speed as it would have to perform some newton steps to increase the accuracy (since in java you can't relax things like this.) It would need to do one for single and two for double (don't quote me on this.)
HotSpot does use SSE instructions for a fair amount of FP computation...it simply uses the SISD version (*SS, *SD). All basic operations are performed in SSE.




