Java-Gaming.org    
Featured games (79)
games approved by the League of Dukes
Games in Showcase (477)
Games in Android Showcase (109)
games submitted by our members
Games in WIP (536)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: 1 [2]
  ignore  |  Print  
  13.8x faster atan2 (updated)  (Read 16184 times)
0 Members and 1 Guest are viewing this topic.
Offline sproingie

JGO Kernel


Medals: 202



« Reply #30 - Posted 2012-04-29 04:17:08 »

But the sum is useless at testing accuracy.  The only thing it can tell you if it's really your approximation is really broken and nothing else.

If it summed absolute values, then it would be a reasonable indicator of cumulative error compared to the reference implementation. It's not the most useful thing in the world, but I wouldn't call it completely useless.  You need something to keep the loop from being optimized out anyway, may as well have some value.

Yeesh, it's not even my algorithm, don't ask me to die on the hill of defending a throwaway microbenchmark.   Roll Eyes
Offline Roquen
« Reply #31 - Posted 2012-04-29 08:22:09 »

Hehe.  Remember that when I'm responded that I'm not only addressing the person I'm responding to.  Microbenchmarks are hard to do right.  Doubly so when attempting to compare a software implementation vs. hardware.  WRT: error analysis.  This just isn't the way to draw any conclusions.  A sum of absolute errors significantly better but still not very meaningful.  Like in Cas's rant about overusage of test units, doing sum of errors is like attempt to show correctness by a bunch of black box tests when some small number of white box tests will tell you the exact information.  If you don't know how to do that, then for singles you can brute force it by walking all values in the range using ulp or nextAfter.

For folks that don't get why a sum of errors is useless consider:

f(x) = if (x<0) return 0 else return 1;
a(x) = return 1/2;

or

f(x) = cos(x)
g(x) = sin(x)
Offline DrZoidberg

Senior Member


Medals: 15



« Reply #32 - Posted 2012-04-29 11:37:13 »

Sin and cos are very fast between -PI/4 and +PI/4 because modern CPUs have instructions for sin and cos built in and the JVM uses that. However outside that range the CPU instructions aren't always perfectly accurate so the JVM calculates sin and cos in software instead. That gives us a very easy way to write a fastsin and fastcos method that's much better than any lookup table.

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
public static double sin(double x) {
    while(x > PI) x -= 2*PI;
    while(x < -PI) x += 2*PI;
    if(x >= -PI/4 && x <= PI/4) return Math.sin(x);
    else if(x > PI/4 && x < 3*PI/4) return Math.cos(x-PI/2);
    else if(x >= 3*PI/4 && x <= PI) return Math.sin(PI-x);
    else if(x < -PI/4 && x > -3*PI/4) return -Math.cos(x+PI/2);
    else if(x <= -3*PI/4 && x >= -PI) return -Math.sin(PI+x);
    else return Math.sin(x);
}
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline Roquen
« Reply #33 - Posted 2012-04-29 12:18:14 »

At the risk of sounding like a two year old: "no!"  I know of no architecture where sin/cos is software or hardware based where this wouldn't be much slower.
Offline princec

JGO Kernel


Medals: 343
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #34 - Posted 2012-04-29 17:25:10 »

It will be slower; but more accurate. It's true that the x86 FP code is only accurate to PI/8 radians and beyond that it tends to drift. Or at least, so I read, some years ago. So if you want fast and accurate, you've got to tailor your input by design.

Cas Smiley

Offline DrZoidberg

Senior Member


Medals: 15



« Reply #35 - Posted 2012-04-29 18:38:39 »

At the risk of sounding like a two year old: "no!"  I know of no architecture where sin/cos is software or hardware based where this wouldn't be much slower.

Did you test that theory?
Here is my benchmark.

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
import static java.lang.Math.PI;

public class Trig {
    public static void main(String[] args) {
        long start = System.nanoTime();
        double test = 0; //to prevent the JVM from
                        //optimising the loop away
       for(int i=0; i<100000000; i++) {
            test += sin(3.15);
        }
        long end = System.nanoTime();
        System.out.println("Time = "+((end-start)/1e6)+"ms");
        System.out.println(test);
        start = System.nanoTime();
        test = 0;
        for(int i=0; i<100000000; i++) {
            test += Math.sin(3.15);
        }
        end = System.nanoTime();
        System.out.println("Time = "+((end-start)/1e6)+"ms");
        System.out.println(test);
    }
   
    public static double sin(double x) {
        while(x > PI) x -= 2*PI;
        while(x < -PI) x += 2*PI;
        if(x >= -PI/4 && x <= PI/4) return Math.sin(x);
        else if(x > PI/4 && x < 3*PI/4) return Math.cos(x-PI/2);
        else if(x >= 3*PI/4 && x <= PI) return Math.sin(PI-x);
        else if(x < -PI/4 && x > -3*PI/4) return -Math.cos(x+PI/2);
        else if(x <= -3*PI/4 && x >= -PI) return -Math.sin(PI+x);
        else return Math.sin(x);
    }
}


the output(using Java 1.7.0_03 64bit):
1  
2  
3  
4  
Time = 81.661547ms
-840724.7369070747
Time = 2811.498313ms
-840724.7369070747
Online Spasi
« Reply #36 - Posted 2012-04-29 22:18:12 »

At the risk of sounding like a two year old: "no!"  I know of no architecture where sin/cos is software or hardware based where this wouldn't be much slower.

Did you test that theory?
Here is my benchmark.

I'm seeing the performance difference in that range too. But if it were so simple, the built-in sin/cos would do the same thing you do (ignoring numerical accuracy issues). Of course it's not that simple, your code has almost half performance with randomized input. Which is the general case that JDK methods are optimized for.

Here's a minimum set of rules that micro-benchmarks need to obey in order to be useful:

- Do something useful with every loop result, to avoid no-op code that gets optimized away.
- Use randomized input, perhaps biased a bit towards the most frequent input range.
- Don't use Random in the benchmarking loop, generate input ahead of time and store it in an array. Preferably the array should easily fit in the CPU caches, use % addressing if necessary for long loops.
- Extract the loop code in its own method and call it enough times to warm it up, before timing anything. OSR can be sub-optimal.
Offline Roquen
« Reply #37 - Posted 2012-04-30 08:39:19 »

Although the benchmark wasn't properly formulated, there's still something rotten in Denmark.  I'm away from home for another week and will have to look at this once I'm back.  The gap in the numbers is way too large.  If hotspot is using x87 and x87 sin/cos have changed to spitting out some microcodes to reduce die-footprint (bad me for not keep up if that's the case) then an SSE argument reduction should be used.  Actually I rather hope that sin/cos are implemented in SSE as that allows for scheduling with surrounding code.  Either case argument reduction shouldn't have this large of an impact on execution.  Even assuming that hotspot has range analysis and the fast version is getting optimized into a sin/cos(x+CONST) call.
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 757
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #38 - Posted 2012-04-30 18:23:53 »

The while loop could be replaced with some statement involving two modulos. Branching is way too expensive here.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline Zom-B

Senior Newbie





« Reply #39 - Posted 2012-05-01 07:16:36 »

A side side note. At least on my implementation, Sun/Oracle Java's for Windows, in rt.jar (the part of the JVM that is in pure Java) all Math.xxx methods delegate to StrictMath.xxx. So directly calling StrictMath.xxx saves one call in performance.
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline Roquen
« Reply #40 - Posted 2012-05-01 07:51:02 »

all Math.xxx methods delegate to StrictMath.xxx. So directly calling StrictMath.xxx saves one call in performance.

Man I "am" the two year ago of this thread.  NO!  SEE: Beware of anti-optimizations.

Even if this were the case, HotSpot is pretty aggressive at inline small code chunks.

Quote
The while loop could be replaced with some statement involving two modulos. Branching is way too expensive here.
Or better yet, decide on a reasonable range and ignore input outside of it.  I'd say that anything outside of [-2pi,2pi] has to be the callers problem.
Offline ra4king

JGO Kernel


Medals: 342
Projects: 2
Exp: 5 years


I'm the King!


« Reply #41 - Posted 2012-05-02 00:17:10 »

A side side note. At least on my implementation, Sun/Oracle Java's for Windows, in rt.jar (the part of the JVM that is in pure Java) all Math.xxx methods delegate to StrictMath.xxx. So directly calling StrictMath.xxx saves one call in performance.
StrictMath is almost never called from Math, the JVM inlines it with its own faster native implementation.

EDIT: Hehe looks like Roquen's link already says that ^^ Smiley

Offline Jimmt
« League of Dukes »

JGO Kernel


Medals: 128
Projects: 4
Exp: 3 years



« Reply #42 - Posted 2012-05-02 03:24:14 »

Did you get the idea from rainbow tables?
Offline Zom-B

Senior Newbie





« Reply #43 - Posted 2012-05-02 10:29:26 »

Who got what idea? I don't see anything in this thread even remotely related to rainbow tables.



A side side note. At least on my implementation, Sun/Oracle Java's for Windows, in rt.jar (the part of the JVM that is in pure Java) all Math.xxx methods delegate to StrictMath.xxx. So directly calling StrictMath.xxx saves one call in performance.
StrictMath is almost never called from Math, the JVM inlines it with its own faster native implementation.

EDIT: Hehe looks like Roquen's link already says that ^^ Smiley

I didn't know that. It isn't flagged as native so I never expected the JVM would generate exceptions for those.

I did some quick tests, and indeed the trig functions are in the order of 80 times faster when calling Math.  Cranky

On the other hand, some functions like min/max are 35% faster when calling StrictMath (I haven't tested others yet).

Edit: A more advanced test (using my second test type, and preventing code from being optimized away) shows now that StrictMath is about 10% faster than Math.  Huh

Edit2: (StrictMath is ...)
sin,cos: 50% slower
tan: 10% faster
atan: about the same
sqrt: 9000% slower
cbrt: about the same
floor: about 10% slower
ceil: about the same
max,min: about the same

And I think we're drifting too much off topic. shouldn't someone make a new topic for this?
Offline ra4king

JGO Kernel


Medals: 342
Projects: 2
Exp: 5 years


I'm the King!


« Reply #44 - Posted 2012-05-02 14:10:01 »

Or we could stop repeating the same thing over and over....and simply use Riven's methods for fast sin/cos/tan/atan (@Riven do you have fast implementations for asin and acos?)

Offline kappa
« League of Dukes »

JGO Kernel


Medals: 75
Projects: 15


★★★★★


« Reply #45 - Posted 2012-05-02 14:52:06 »

Edit2: (StrictMath is ...)
sin,cos: 50% slower
tan: 10% faster
atan: about the same
sqrt: 9000% slower
cbrt: about the same
floor: about 10% slower
ceil: about the same
max,min: about the same
Always seems to bother me a bit that Math.sqrt is such a super slow operation, a quick google search seems to show that there is indeed a fast integer square root using similar techniques to the above here. Should make a nice addition to the collection of fast hacky maths methods.
Offline Roquen
« Reply #46 - Posted 2012-05-02 14:56:28 »

The issue is that optimal approximation is dependent on alot of factors and most people want one version of each function that "just works".  That's simply not the way to maximize performance.  As I've mentioned before I've never seen a table-based method which out-performs (on non-embedded systems) numeric methods.  But that's "my" use-case YMMV.  If you really want fast, you have to know your input range and what kind (abs or rel error) and the bound on that error and potentially access patterns.
Offline Roquen
« Reply #47 - Posted 2012-05-02 15:19:38 »

WRT: trig & inverse-trig.  Remember what the man said: "The best optimization is to not need to call anything."  If you're calling inverse trig functions, my question is:  "do you really want the angle?  I mean really, really want the angle".  For example if you're storing your objects orientations as angles...therein lies a problem.  Another is if you're converting to polar, manipulating the angle then converting back.

WRT: Sqrt.  All roots are a pain as the terms of the Talyor series decrease very slowing, so they are very hard to approximate with a polynomial or rational of polynomials.  If you almost know the answer, then they are very easy (like Newton's method).
Offline Jimmt
« League of Dukes »

JGO Kernel


Medals: 128
Projects: 4
Exp: 3 years



« Reply #48 - Posted 2012-05-02 17:39:43 »

@Zom-B the original poster said he used a "lookup-table." Rainbow tables are lookup tables for hash codes.
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 757
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #49 - Posted 2012-05-02 17:55:01 »

(@Riven do you have fast implementations for asin and acos?)
Would you use them?

Anyway, I don't know about their performance, but given their narrow valid range, it's trivial to fill lookup tables.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline ra4king

JGO Kernel


Medals: 342
Projects: 2
Exp: 5 years


I'm the King!


« Reply #50 - Posted 2012-05-02 20:21:30 »

(@Riven do you have fast implementations for asin and acos?)
Would you use them?
Haha that's a good question, I don't see why I would ever use acos and asin in a performance critical application Grin

Offline Nate

JGO Kernel


Medals: 145
Projects: 4
Exp: 14 years


Esoteric Software


« Reply #51 - Posted 2012-05-03 05:06:57 »

Always seems to bother me a bit that Math.sqrt is such a super slow operation, a quick google search seems to show that there is indeed a fast integer square root using similar techniques to the above here. Should make a nice addition to the collection of fast hacky maths methods.

Math.sqrt is quite fast. I once tried some of the alternatives and couldn't find anything faster.

Offline Roquen
« Reply #52 - Posted 2012-05-03 12:19:01 »

Fast/slow.  It's all relative.  To put some numbers on it in terms of instruction latency/throughput (both in cycles) for modern(ish) Intel.  Ranges are to cover a reasonable set of hardware.

SQRTSD: 20-39/14-39  (one complaint double sqrt)  On unspecified input ranges these can drop to ~6/~10.
SQRTSS: 23-32/23-32  (one compliant single sqrt)   On unspecified input ranges these can drop to ~6/~10.
RSQRTSS: 3-6/1-4       (approximate 1/sqrt for singles. We can't do this...big fat sadface)

So what I intended to say about sqrt being "hard" is that unlike lots of other functions reducing the range or precision to get a faster version is much more difficult.  About the only way you can succeed is if you know that it's very near a specific value (like when renormalizing).  That old SGI trick (popularized by ID) should always be slower that simply using the hardware.
Offline Zom-B

Senior Newbie





« Reply #53 - Posted 2012-05-07 04:56:25 »

@Zom-B the original poster said he used a "lookup-table." Rainbow tables are lookup tables for hash codes.
Rainbow tables are clever structures that eliminate the need for memory-eating lookup tables (think in the range of several terabytes)

(@Riven do you have fast implementations for asin and acos?)
Would you use them?
Haha that's a good question, I don't see why I would ever use acos and asin in a performance critical application Grin
How about fractals / fractal art?  Wink (actually you'd need the highest possible precision here so the statement is void)

Fast/slow.  It's all relative.  To put some numbers on it in terms of instruction latency/throughput (both in cycles) for modern(ish) Intel.  Ranges are to cover a reasonable set of hardware.

SQRTSD: 20-39/14-39  (one complaint double sqrt)  On unspecified input ranges these can drop to ~6/~10.
SQRTSS: 23-32/23-32  (one compliant single sqrt)   On unspecified input ranges these can drop to ~6/~10.
RSQRTSS: 3-6/1-4       (approximate 1/sqrt for singles. We can't do this...big fat sadface)

So what I intended to say about sqrt being "hard" is that unlike lots of other functions reducing the range or precision to get a faster version is much more difficult.  About the only way you can succeed is if you know that it's very near a specific value (like when renormalizing).  That old SGI trick (popularized by ID) should always be slower that simply using the hardware.
I sometimes use "float d = 1f / (float)Math.sqrt(dx * dx + dy * dy);" (more often double actually), so hopefully the compiler uses RSQRTSS there.
Offline Zom-B

Senior Newbie





« Reply #54 - Posted 2012-05-07 11:59:36 »

Fast/slow.  It's all relative.  To put some numbers on it in terms of instruction latency/throughput (both in cycles) for modern(ish) Intel.  Ranges are to cover a reasonable set of hardware.

SQRTSD: 20-39/14-39  (one complaint double sqrt)  On unspecified input ranges these can drop to ~6/~10.
SQRTSS: 23-32/23-32  (one compliant single sqrt)   On unspecified input ranges these can drop to ~6/~10.
RSQRTSS: 3-6/1-4       (approximate 1/sqrt for singles. We can't do this...big fat sadface)

So what I intended to say about sqrt being "hard" is that unlike lots of other functions reducing the range or precision to get a faster version is much more difficult.  About the only way you can succeed is if you know that it's very near a specific value (like when renormalizing).  That old SGI trick (popularized by ID) should always be slower that simply using the hardware.

After some googling I found out these are MMX instructions, and Java can't use those afaik. Normal math happens using FPU instructions. I copied this from "Intel Architecture Optimization Reference Manual":

"Additionally, while the multi-cycle floating-point instructions, fdiv and fsqrt, execute in the floating-point unit pipe, integer instructions can be executed in parallel."

So it pays off to do lots of non-fp-math in inner-loops that use Math.sqrt and /. Separating (or keeping) those in different inner-loops would be bad.

And from "How to optimize for the Pentium family of microprocessors" I read that FSQRT uses 70 cycles and cannot overlap integer multiplication instructions (they will block the pipeline). No idea about modern processors however. They should be much more efficient in this, especially the new Core Ix and Celeron Sandy Bridge CPUs.
Offline Roquen
« Reply #55 - Posted 2012-05-07 12:20:51 »

I'm pretty sure that HotSpot does not use RSQRTSS, but even if it does it won't approach that speed as it would have to perform some newton steps to increase the accuracy (since in java you can't relax things like this.)  It would need to do one for single and two for double (don't quote me on this.)

HotSpot does use SSE instructions for a fair amount of FP computation...it simply uses the SISD version (*SS, *SD).  All basic operations are performed in SSE.
Pages: 1 [2]
  ignore  |  Print  
 
 
You cannot reply to this message, because it is very, very old.

 

Add your game by posting it in the WIP section,
or publish it in Showcase.

The first screenshot will be displayed as a thumbnail.

CogWheelz (18 views)
2014-07-30 21:08:39

Riven (27 views)
2014-07-29 18:09:19

Riven (16 views)
2014-07-29 18:08:52

Dwinin (14 views)
2014-07-29 10:59:34

E.R. Fleming (35 views)
2014-07-29 03:07:13

E.R. Fleming (13 views)
2014-07-29 03:06:25

pw (44 views)
2014-07-24 01:59:36

Riven (45 views)
2014-07-23 21:16:32

Riven (30 views)
2014-07-23 21:07:15

Riven (31 views)
2014-07-23 20:56:16
List of Learning Resources
by SilverTiger
2014-07-31 18:29:50

List of Learning Resources
by SilverTiger
2014-07-31 18:26:06

List of Learning Resources
by SilverTiger
2014-07-31 13:54:12

HotSpot Options
by dleskov
2014-07-08 03:59:08

Java and Game Development Tutorials
by SwordsMiner
2014-06-14 00:58:24

Java and Game Development Tutorials
by SwordsMiner
2014-06-14 00:47:22

How do I start Java Game Development?
by ra4king
2014-05-17 11:13:37

HotSpot Options
by Roquen
2014-05-15 09:59:54
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!