Java-Gaming.org Hi !
Featured games (90)
games approved by the League of Dukes
Games in Showcase (744)
Games in Android Showcase (225)
games submitted by our members
Games in WIP (825)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: [1]
  ignore  |  Print  
  Fast Atan2 [Again]  (Read 2849 times)
0 Members and 1 Guest are viewing this topic.
Offline mooman219
« Posted 2017-08-30 10:58:29 »

Background:
Back at it with some Atan2 benchmarks. I found the source for my old benchmarks and wanted to run this again with some different settings. The purpose of this post is to find a faster way to do Math.atan2(y,x). To compare various methods, I setup a simple JMH benchmark and accuracy test. Criticisms and new atan2 submissions are appreciated.


Outcome:
Fastest implementation with lookup table: Icecore
Fastest implementation without lookup table: Diamond

Note: Fastest doesn't mean most accurate. Please take into account the average and largest error for each algorithm.


Relevant Data:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
A lower average means higher accuracy.
Apache       : Average Error 0.00000 / Largest Error 0.00000 : 36,012,001 samples : 1488.165 ±   7.285  ms/op
Default      : Average Error 0.00000 / Largest Error 0.00000 : 36,012,001 samples : 1773.411 ± 193.228  ms/op
Diamond      : Average Error 0.04318 / Largest Error 0.07112 : 36,012,001 samples :   47.299 ±   1.467  ms/op
DSPAccurate  : Average Error 0.00344 / Largest Error 0.01015 : 36,012,001 samples :   72.539 ±   3.702  ms/op
DSPFast      : Average Error 0.04318 / Largest Error 0.07111 : 36,012,001 samples :   56.547 ±   0.662  ms/op
Icecore      : Average Error 0.00038 / Largest Error 0.00098 : 36,012,001 samples :   58.265 ±   0.686  ms/op
Kappa        : Average Error 0.00224 / Largest Error 0.00468 : 36,012,001 samples :   67.622 ±  21.004  ms/op
Riven        : Average Error 0.00291 / Largest Error 0.00787 : 36,012,001 samples :   99.701 ±   2.846  ms/op
Baseline     :                                               : 36,012,001 samples :   36.155 ±   4.251  ms/op


Note: Keep in mind these are micro benchmarks and not indicative of real world performance; the JMH can only do so much to make sure the JIT doesn't interfere. There may be a fair amount of branch prediction and caching that's going on, and lookup tables might just be sitting in a cpu cache.


Source:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
47  
48  
49  
50  
51  
52  
53  
54  
55  
56  
57  
58  
59  
60  
61  
62  
63  
64  
65  
66  
67  
68  
69  
70  
71  
72  
73  
74  
75  
76  
77  
78  
79  
80  
81  
82  
83  
84  
85  
86  
87  
88  
89  
90  
91  
92  
93  
94  
95  
96  
97  
98  
99  
100  
101  
102  
103  
104  
105  
106  
107  
108  
109  
110  
111  
112  
113  
114  
115  
116  
117  
118  
119  
120  
121  
122  
123  
124  
125  
126  
127  
128  
129  
130  
131  
132  
133  
134  
135  
136  
137  
138  
139  
140  
141  
142  
143  
144  
145  
146  
147  
148  
149  
150  
151  
152  
153  
154  
155  
156  
157  
158  
159  
160  
161  
162  
163  
164  
165  
166  
167  
168  
169  
170  
171  
172  
173  
174  
175  
176  
177  
178  
179  
180  
181  
182  
183  
184  
185  
186  
187  
188  
189  
190  
191  
192  
193  
194  
195  
196  
197  
198  
199  
200  
201  
202  
203  
204  
205  
206  
207  
208  
209  
210  
211  
212  
213  
214  
215  
216  
217  
218  
219  
220  
221  
222  
223  
224  
225  
226  
227  
228  
229  
230  
231  
232  
233  
234  
235  
236  
237  
238  
239  
240  
241  
242  
243  
244  
245  
246  
247  
248  
249  
250  
251  
252  
253  
254  
255  
256  
257  
258  
259  
260  
261  
262  
263  
264  
265  
266  
267  
268  
269  
270  
271  
272  
273  
274  
275  
276  
277  
278  
279  
280  
281  
282  
283  
284  
285  
286  
287  
288  
289  
290  
291  
292  
293  
294  
295  
296  
297  
298  
299  
300  
301  
302  
303  
304  
305  
306  
307  
308  
309  
310  
311  
312  
313  
314  
315  
316  
317  
318  
319  
320  
321  
322  
323  
324  
325  
326  
327  
328  
329  
330  
331  
332  
333  
334  
335  
336  
337  
338  
339  
340  
341  
342  
343  
344  
345  
346  
347  
348  
349  
350  
351  
352  
353  
354  
355  
356  
357  
358  
359  
360  
361  
362  
363  
364  
365  
366  
367  
368  
369  
370  
371  
372  
373  
374  
375  
376  
377  
378  
379  
380  
381  
382  
383  
384  
385  
386  
387  
388  
389  
390  
391  
392  
393  
394  
395  
396  
397  
398  
399  
400  
401  
402  
403  
404  
405  
406  
407  
408  
409  
410  
411  
412  
413  
414  
415  
416  
417  
418  
419  
420  
421  
422  
423  
424  
425  
426  
427  
428  
429  
430  
431  
432  
433  
434  
435  
436  
437  
438  
439  
440  
441  
442  
443  
444  
445  
446  
447  
448  
449  
450  
451  
452  
453  
454  
455  
456  
457  
458  
459  
460  
461  
462  
463  
464  
465  
466  
467  
468  
469  
470  
471  
472  
473  
474  
475  
476  
477  
478  
479  
480  
481  
482  
483  
484  
485  
486  
487  
public class Atan2 {

    ///////////////////////////////////////
    // Benchmark Settings
    ///////////////////////////////////////

    private static final double LOW_D = -3;
    private static final double HIGH_D = 3;
    private static final double INC_D = 0.001D;
    private static final float LOW_F = (float) LOW_D;
    private static final float HIGH_F = (float) HIGH_D;
    private static final float INC_F = (float) INC_D;

    public static Options buildOptions() {
        return new OptionsBuilder()
                .include(Atan2.class.getSimpleName())
                .warmupTime(new TimeValue(2, TimeUnit.SECONDS))
                .warmupIterations(4)
                .measurementTime(new TimeValue(2, TimeUnit.SECONDS))
                .measurementIterations(4)
                .forks(1)
                .mode(Mode.AverageTime)
                .timeUnit(TimeUnit.MILLISECONDS)
                .threads(1)
                .build();
    }

    ///////////////////////////////////////
    // Accuracy
    ///////////////////////////////////////

    public static void printError() {
        Atan2Benchmark[] benchmarks = new Atan2Benchmark[]{
            new Apache(),
            new Default(),
            new Diamond(),
            new DSPAccurate(),
            new DSPFast(),
            new Icecore(),
            new Kappa(),
            new Riven(),};

        System.out.println(String.format("A lower average means higher accuracy."));
        for (Atan2Benchmark benchmark : benchmarks) {
            benchmark.printError(LOW_D, HIGH_D, INC_D);
        }
    }

    ///////////////////////////////////////
    // Baseline
    ///////////////////////////////////////

    @Benchmark
    public double baseline() {
        double sum = 0;
        for (double x = LOW_D; x < HIGH_D; x += INC_D) {
            for (double y = LOW_D; y < HIGH_D; y += INC_D) {
                sum += x + y;
            }
        }
        return sum;
    }

    ///////////////////////////////////////
    // Default atan2
    ///////////////////////////////////////

    public static final class Default extends Atan2Benchmark {

        public Default() {
            super("Default");
        }

        @Override
        public float test(float x, float y) {
            return (float) Math.atan2(x, y);
        }
    }

    @Benchmark
    public double atan2_default() {
        double sum = 0;
        for (double x = LOW_D; x < HIGH_D; x += INC_D) {
            for (double y = LOW_D; y < HIGH_D; y += INC_D) {
                sum += Math.atan2(x, y);
            }
        }
        return sum;
    }

    ///////////////////////////////////////
    // Apache's FastMath.atan2 ( http://commons.apache.org/proper/commons-math/javadocs/api-3.3/org/apache/commons/math3/util/FastMath.html )
    ///////////////////////////////////////

    public static final class Apache extends Atan2Benchmark {

        public Apache() {
            super("Apache");
        }

        @Override
        public float test(float x, float y) {
            return (float) FastMath.atan2(x, y);
        }
    }

    @Benchmark
    public double atan2_apache() {
        double sum = 0;
        for (double x = LOW_D; x < HIGH_D; x += INC_D) {
            for (double y = LOW_D; y < HIGH_D; y += INC_D) {
                sum += FastMath.atan2(x, y);
            }
        }
        return sum;
    }

    ///////////////////////////////////////
    // Riven's atan2 ( http://www.java-gaming.org/index.php?topic=14647.0 )
    ///////////////////////////////////////

    public static final class Riven extends Atan2Benchmark {

        public Riven() {
            super("Riven");
        }

        @Override
        public float test(float x, float y) {
            return Riven.atan2(x, y);
        }

        private static final int ATAN2_BITS = 7;

        private static final int ATAN2_BITS2 = ATAN2_BITS << 1;
        private static final int ATAN2_MASK = ~(-1 << ATAN2_BITS2);
        private static final int ATAN2_COUNT = ATAN2_MASK + 1;
        private static final int ATAN2_DIM = (int) Math.sqrt(ATAN2_COUNT);

        private static final float INV_ATAN2_DIM_MINUS_1 = 1.0f / (ATAN2_DIM - 1);

        private static final float[] atan2 = new float[ATAN2_COUNT];

        static {
            for (int i = 0; i < ATAN2_DIM; i++) {
                for (int j = 0; j < ATAN2_DIM; j++) {
                    float x0 = (float) i / ATAN2_DIM;
                    float y0 = (float) j / ATAN2_DIM;
                    atan2[j * ATAN2_DIM + i] = (float) Math.atan2(y0, x0);
                }
            }
        }

        public static final float atan2(float y, float x) {
            float add, mul;

            if (x < 0.0f) {
                if (y < 0.0f) {
                    x = -x;
                    y = -y;

                    mul = 1.0f;
                } else {
                    x = -x;
                    mul = -1.0f;
                }
                add = -3.141592653f;
            } else {
                if (y < 0.0f) {
                    y = -y;
                    mul = -1.0f;
                } else {
                    mul = 1.0f;
                }
                add = 0.0f;
            }

            float invDiv = 1.0f / (((x < y) ? y : x) * INV_ATAN2_DIM_MINUS_1);

            int xi = (int) (x * invDiv);
            int yi = (int) (y * invDiv);

            return (atan2[yi * ATAN2_DIM + xi] + add) * mul;
        }
    }

    @Benchmark
    public float atan2_riven() {
        float sum = 0;
        for (float x = LOW_F; x < HIGH_F; x += INC_F) {
            for (float y = LOW_F; y < HIGH_F; y += INC_F) {
                sum += Riven.atan2(x, y);
            }
        }
        return sum;
    }

    ///////////////////////////////////////
    // DSP's atan2 ( http://dspguru.com/dsp/tricks/fixed-point-atan2-with-self-normalization )
    ///////////////////////////////////////

    public static final class DSPFast extends Atan2Benchmark {

        public DSPFast() {
            super("DSPFast");
        }

        @Override
        public float test(float x, float y) {
            return DSPFast.atan2(x, y);
        }

        private static final float PI_2 = 1.5707963F; // Math.PI / 2
        private static final float PI_4 = 0.7853982F; // Math.PI / 4
        private static final float PI_3_4 = 2.3561945F; // (Math.PI / 4) * 3
        private static final float MINUS_PI_2 = -1.5707963F; // Math.PI / -2

        public static final float atan2(float y, float x) {
            float r;
            float abs_y = Float.intBitsToFloat(Float.floatToRawIntBits(y) << 1 >>> 1);
            if (x == 0.0F) {
                if (y > 0.0F) {
                    return PI_2;
                }
                if (y == 0.0F) {
                    return 0.0f;
                }
                return MINUS_PI_2;
            } else if (x > 0) {
                r = (x - abs_y) / (x + abs_y);
                r = PI_4 - PI_4 * r;
            } else {
                r = (x + abs_y) / (abs_y - x);
                r = PI_3_4 - PI_4 * r;
            }
            return y < 0 ? -r : r;
        }
    }

    public static final class DSPAccurate extends Atan2Benchmark {

        public DSPAccurate() {
            super("DSPAccurate");
        }

        @Override
        public float test(float x, float y) {
            return DSPAccurate.atan2(x, y);
        }

        private static final float PI_2 = 1.5707963F; // Math.PI / 2
        private static final float PI_4 = 0.7853982F; // Math.PI / 4
        private static final float PI_3_4 = 2.3561945F; // (Math.PI / 4) * 3
        private static final float MINUS_PI_2 = -1.5707963F; // Math.PI / -2

        public static final float atan2(float y, float x) {
            float r;
            float c;
            float abs_y = Float.intBitsToFloat(Float.floatToRawIntBits(y) << 1 >>> 1);
            if (x == 0.0F) {
                if (y > 0.0F) {
                    return PI_2;
                }
                if (y == 0.0F) {
                    return 0.0f;
                }
                return MINUS_PI_2;
            } else if (x > 0) {
                r = (x - abs_y) / (x + abs_y);
                c = PI_4;
            } else {
                r = (x + abs_y) / (abs_y - x);
                c = PI_3_4;
            }
            r = 0.1963F * r * r * r - 0.9817F * r + c;
            return y < 0 ? -r : r;
        }
    }

    @Benchmark
    public float atan2_dsp_fast() {
        float sum = 0;
        for (float x = LOW_F; x < HIGH_F; x += INC_F) {
            for (float y = LOW_F; y < HIGH_F; y += INC_F) {
                sum += DSPFast.atan2(x, y);
            }
        }
        return sum;
    }

    @Benchmark
    public float atan2_dsp_accurate() {
        float sum = 0;
        for (float x = LOW_F; x < HIGH_F; x += INC_F) {
            for (float y = LOW_F; y < HIGH_F; y += INC_F) {
                sum += DSPAccurate.atan2(x, y);
            }
        }
        return sum;
    }

    ///////////////////////////////////////
    // kappa's atan2 ( http://www.java-gaming.org/topics/extremely-fast-atan2/36467/msg/346112/view.html#msg346112 )
    ///////////////////////////////////////

    public static final class Kappa extends Atan2Benchmark {

        public Kappa() {
            super("Kappa");
        }

        @Override
        public float test(float x, float y) {
            return Kappa.atan2(x, y);
        }

        private static final float PI = 3.1415927f;
        private static final float PI_2 = PI / 2f;
        private static final float MINUS_PI_2 = -PI_2;
        private static final float C_M = 0.2808722f;
        private static final float C_A = 0.2808722f;

        public static final float atan2(float y, float x) {
            if (x == 0.0f) {
                if (y > 0.0f) {
                    return PI_2;
                }
                if (y == 0.0f) {
                    return 0.0f;
                }
                return MINUS_PI_2;
            }

            final float atan;
            final float z = y / x;
            if (Math.abs(z) < 1.0f) {
                atan = z / (z * z * C_M + 1.0f);
                if (x < 0.0f) {
                    return (y < 0.0f) ? atan - PI : atan + PI;
                }
                return atan;
            } else {
                atan = PI_2 - z / (z * z + C_A);
                return (y < 0.0f) ? atan - PI : atan;
            }
        }
    }

    @Benchmark
    public float atan2_kappa() {
        float sum = 0;
        for (float x = LOW_F; x < HIGH_F; x += INC_F) {
            for (float y = LOW_F; y < HIGH_F; y += INC_F) {
                sum += Kappa.atan2(x, y);
            }
        }
        return sum;
    }

    ///////////////////////////////////////
    // Icecore's atan2 ( http://www.java-gaming.org/topics/extremely-fast-atan2/36467/msg/346145/view.html#msg346145 )
    ///////////////////////////////////////

    public static final class Icecore extends Atan2Benchmark {

        public Icecore() {
            super("Icecore");
        }

        @Override
        public float test(float x, float y) {
            return Icecore.atan2(x, y);
        }

        private static final float PI = (float) Math.PI;
        private static final float PI_2 = PI / 2;
        private static final float PI_NEG_2 = -PI_2;
        private static final int SIZE = 1024;
        private static final float ATAN2[];

        static {
            final int Size_Ar = SIZE + 1;
            ATAN2 = new float[Size_Ar];
            for (int i = 0; i <= SIZE; i++) {
                double d = (double) i / SIZE;
                double x = 1;
                double y = x * d;
                ATAN2[i] = (float) Math.atan2(y, x);
            }
        }

        public static final float atan2(float y, float x) {
            if (y < 0) {
                if (x < 0) {
                    // (y < x) because == (-y > -x)
                    if (y < x) {
                        return PI_NEG_2 - ATAN2[(int) (x / y * SIZE)];
                    } else {
                        return ATAN2[(int) (y / x * SIZE)] - PI;
                    }
                } else {
                    y = -y;
                    if (y > x) {
                        return ATAN2[(int) (x / y * SIZE)] - PI_2;
                    } else {
                        return -ATAN2[(int) (y / x * SIZE)];
                    }
                }
            } else {
                if (x < 0) {
                    x = -x;
                    if (y > x) {
                        return PI_2 + ATAN2[(int) (x / y * SIZE)];
                    } else {
                        return PI - ATAN2[(int) (y / x * SIZE)];
                    }
                } else {
                    if (y > x) {
                        return PI_2 - ATAN2[(int) (x / y * SIZE)];
                    } else {
                        return ATAN2[(int) (y / x * SIZE)];
                    }
                }
            }
        }
    }

    @Benchmark
    public float atan2_icecore() {
        float sum = 0;
        for (float x = LOW_F; x < HIGH_F; x += INC_F) {
            for (float y = LOW_F; y < HIGH_F; y += INC_F) {
                sum += Icecore.atan2(x, y);
            }
        }
        return sum;
    }

    ///////////////////////////////////////
    // Diamond's atan2 ( https://stackoverflow.com/questions/1427422/cheap-algorithm-to-find-measure-of-angle-between-vectors/14675998#14675998 )
    ///////////////////////////////////////

    public static final class Diamond extends Atan2Benchmark {

        public Diamond() {
            super("Diamond");
        }

        @Override
        public float test(float x, float y) {
            return Diamond.atan2(x, y);
        }

        private static final float PI_2 = 1.5707963F; // Math.PI / 2

        public static final float atan2(float y, float x) {
            float angle;
            if (y == 0f && x >= 0f) {
                return 0;
            } else if (y >= 0f) {
                if (x >= 0f) {
                    angle = y / (x + y);
                } else {
                    angle = 1f - x / (-x + y);
                }
            } else {
                if (x < 0f) {
                    angle = -2f + y / (x + y);
                } else {
                    angle = -1f + x / (x - y);
                }
            }
            return angle * PI_2;
        }
    }

    @Benchmark
    public float atan2_diamond() {
        float sum = 0;
        for (float x = LOW_F; x < HIGH_F; x += INC_F) {
            for (float y = LOW_F; y < HIGH_F; y += INC_F) {
                sum += Diamond.atan2(x, y);
            }
        }
        return sum;
    }
}



Updates:
1: Added baseline benchmark. Cleaned up DSPFast/DSPAccurate implementation to look more sane with no performance loss.
2: Improved Kappa's accuracy with no performance loss. Updated Icecore's submission. Retested all algorithms.
3: Added the Diamond implementation.
Offline Icecore
« Reply #1 - Posted 2017-08-30 18:07:29 »

I also have this - it's more Cache friendly, but have more OPs ^^
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
private static final float Pi = (float) Math.PI;
private static final float Pi_H = Pi / 2;
//4 x 4000 = 16 kb L1 Cache
private static final int Size_Ac = 4000;
private static final float Atan2[];

static{
   final int Size_Ar = Size_Ac + 1;
   Atan2 = new float[Size_Ar];
   for(int i = 0; i <= Size_Ac; i++){
      double d = (double) i / Size_Ac;
      double x = 1;
      double y = x * d;
      Atan2[i] = (float) Math.atan2(y, x);
   }
}

static public final float atan2_1(float y, float x){
   if(y < 0){
      if(x < 0){
         // (y < x) because == (-y > -x)
         if(y < x) return -(Pi_H + Atan2[(int) (x / y * Size_Ac)]);
         else return Atan2[(int) (y / x * Size_Ac)] - Pi;
      }
      else{
         y = -y;
         if(y > x) return Atan2[(int) (x / y * Size_Ac)] - Pi_H;
         else return -Atan2[(int) (y / x * Size_Ac)];
      }
   }
   else{
      if(x < 0){
         x = -x;
         if(y > x) return Pi_H + Atan2[(int) (x / y * Size_Ac)];
         else return Pi - Atan2[(int) (y / x * Size_Ac)];
      }
      else{
         if(y > x) return Pi_H - Atan2[(int) (x / y * Size_Ac)];
         else return Atan2[(int) (y / x * Size_Ac)];
      }
   }
}

Last known State: Reassembled in Cyberspace
End Transmission....
..
.
Journey began Now)
Offline mooman219
« Reply #2 - Posted 2017-08-30 19:11:49 »

@Icecore
It appears to be slower.

1  
2  
3  
4  
Benchmark                  Mode  Cnt   Score   Error  Units
ATAN2.atan2_icecore        avgt    4  53.079 ± 0.157  ms/op
ATAN2.atan2_icecore_cache  avgt    4  58.945 ± 2.317  ms/op
ATAN2.baseline             avgt    4  34.576 ± 0.274  ms/op


Source: https://gist.github.com/mooman219/7a881b2d043354d44036e0df35d1e0b5
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline Icecore
« Reply #3 - Posted 2017-09-01 09:28:17 »

It have more accuracy because use bigger array and use less memory because its 1 array – not 8 )
(1000 x 8 x 4(f) = 32kb   4000 x 4(f) = 16 kb)
More array size more accuracy ^^

also may try this - less on one OP =)
1  
2  
3  
++      private static final float Pi_H_M = -Pi_H;
--      if(y < x) return -(Pi_H + Atan2[(int) (x / y * Size_Ac)]);
++      if(y < x) return Pi_H_M - Atan2[(int) (x / y * Size_Ac)];

Last known State: Reassembled in Cyberspace
End Transmission....
..
.
Journey began Now)
Offline mooman219
« Reply #4 - Posted 2017-09-09 00:49:33 »

I added your changes and updated the benchmark @icecore. I also improved the accuracy of Kappa's atan2. The 0.28 constant seemed suspicious, so I played around with it, getting a lower average error and lower largest error with the constant 0.2808722 instead.
Offline mooman219
« Reply #5 - Posted 2017-09-09 07:54:20 »

Had to fix some issues, but if you're willing to deal with an average error of 2.4 degrees, and a max error of 4.07 degrees, then boy do I have the algorithm for you. Adapted from "encoding 2d angles without trigonometry", I bring you the diamond method. It's the fastest by far.

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
private static final float PI_2 = 1.5707963F; // Math.PI / 2

public static final float atan2(float y, float x) {
   float angle;
   if (x == 0f && y == 0f) {
      return 0;
   }
   if (y >= 0f) {
      if (x >= 0f) {
         angle = y / (x + y);
      } else {
         angle = 1f - x / (-x + y);
      }
   } else {
      if (x < 0f) {
         angle = -y / (-x - y) - 2f;
      } else {
         angle = x / (x - y) - 1f;
      }
   }
   return angle * PI_2;
}
Offline Roquen

JGO Kernel


Medals: 516



« Reply #6 - Posted 2017-09-09 11:38:04 »

Broken record time.  Any benchmark that doesn't reflect an actual usage pattern is wishful thinking.  If you're linearly walking through angles you can compute with tiny errors in a couple of products and adds (I'm tired and was thinking forward here, opps)..no lookups no branches.
Offline mooman219
« Reply #7 - Posted 2017-09-09 20:49:23 »

Broken record time.  Any benchmark that doesn't reflect an actual usage pattern is wishful thinking.  If you're linearly walking through angles you can compute with tiny errors in a couple of products and adds (I'm tired and was thinking forward here, opps)..no lookups no branches.

Completely fair and expected. I first made the post with the disclaimer below the results that this is a microbenchmark and has caveats. A LUT could improve the performance of your algorithm, but create some cache pollution that slows down other aspects of your program. LUTs aside, this post isn't a comprehensive search for the best Atan2, it takes a couple and tests some metrics on them. If you're writing a game, this will let you swap in various algorithms and you can see how they perform. It'll hopefully give you some reassurance that you're using a reasonable algorithm when compared to others.
Offline Roquen

JGO Kernel


Medals: 516



« Reply #8 - Posted 2017-09-11 09:00:58 »

My expectation is that the branches will be the much bigger problem in most cases.  If you're making enough atan/atan2 calls to be statistically important then it's pretty likely they will bunch up in time and the cost of loading the LUT to cache won't be a killer.  For the branches it's very unlikely that the probability of each won't be in the neighborhood of .5  so ~n/2 branch mispredictions per call where 'n' is the number of branches along a path.

EDIT: The way the benchmark is structured means the branch will statistically always be correctly predicted.
Offline Icecore
« Reply #9 - Posted 2017-09-12 10:04:53 »

Let make next Contest "Fast Sqrt" Smiley
one from the Baselines https://en.wikipedia.org/wiki/Fast_inverse_square_root
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
float Q_rsqrt( float number )
{
   long i;
   float x2, y;
   const float threehalfs = 1.5F;

   x2 = number * 0.5F;
   y  = number;
   i  = * ( long * ) &y;                       // evil floating point bit level hacking
   i  = 0x5f3759df - ( i >> 1 );               // what the f**k?
   y  = * ( float * ) &i;
   y  = y * ( threehalfs - ( x2 * y * y ) );   // 1st iteration
//   y  = y * ( threehalfs - ( x2 * y * y ) );   // 2nd iteration, this can be removed

   return y;
}

Last known State: Reassembled in Cyberspace
End Transmission....
..
.
Journey began Now)
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline kappa
« League of Dukes »

JGO Kernel


Medals: 119
Projects: 15


★★★★★


« Reply #10 - Posted 2017-09-12 10:52:41 »

Let make next Contest "Fast Sqrt" Smiley
From my tests in the past, Java's Math.sqrt() is pretty damn fast these days, only worth it if you can beat that method substantially.
Offline Abuse

JGO Ninja


Medals: 60


falling into the abyss of reality


« Reply #11 - Posted 2017-09-12 13:03:08 »

Let make next Contest "Fast Sqrt" Smiley
From my tests in the past, Java's Math.sqrt() is pretty damn fast these days, only worth it if you can beat that method substantially.

My understanding is that the above approximation is no-longer relevant 'cos we have x86 instructional support (in various forms).
Offline Icecore
« Reply #12 - Posted 2017-09-12 16:44:04 »

Quote
From my tests in the past, Java's Math.sqrt() is pretty damn fast these days
...
we have x86 instructional support
Yes, ok, my mistake - why i even trying...
– no one cares)

Last known State: Reassembled in Cyberspace
End Transmission....
..
.
Journey began Now)
Offline SiestaGuru

Senior Newbie


Medals: 2



« Reply #13 - Posted 2017-12-06 11:01:17 »

These are the fastest I could find, they're slight modifications of algos made by others and found on this website.
Basically I removed a couple of tiny unnecessary calculations, like unneeded assignments and taking a negative of some value when it could just as easily be restructured to not need to do that.

With lookup table, change the SIZE for higher precision if desired.
At the current Size of 2520, the average error is ~0.0000819   max error ~0.000489
Dies at x=0,y=0 without an extra check

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
47  
48  
49  
50  
51  
52  
53  
54  
55  
56  
57  
58  
59  
60  
61  
62  
63  
64  
65  
66  
67  
68  
69  
   private static final int  SIZE = 2520; 

  //Based on a method by Zom-B
      public static final float atan2(float y, float x){
        if (x >= 0)
        {
            if (y >= 0)
            {
                if (x >= y)
                     //Add a check for x=0,y=0 here if needed
                    return ATAN2_TABLE_PPY[(int)(SIZE * y / x)];
                else
                    return ATAN2_TABLE_PPX[(int)(SIZE * x / y)];
            }
            else
            {
                if (x >= -y)
                    return ATAN2_TABLE_PNY[(int)(EZIS * y / x)];
                else
                    return ATAN2_TABLE_PNX[(int)(EZIS * x / y)];
            }
        }
        else if (y >= 0)
        {
            if (-x >= y)
                return ATAN2_TABLE_NPY[(int)(EZIS * y / x)];
            else
                return ATAN2_TABLE_NPX[(int)(EZIS * x / y)];
        }
        else
        {
            if (x <= y)
                return ATAN2_TABLE_NNY[(int)(SIZE * y / x)];
            else
                return ATAN2_TABLE_NNX[(int)(SIZE * x / y)];
        }
    }

private static final float PI = (float)Math.PI;


//If the static initiation is too much, you can also pre-calculate arrays and just hardcode the values in.

private static final float        STRETCH            = PI;
private static final float        NEGATIVE_STRETCH            = -PI;
private static final int        EZIS            = -SIZE;
private static final float[]    ATAN2_TABLE_PPY    = new float[SIZE + 1];
private static final float[]    ATAN2_TABLE_PPX    = new float[SIZE + 1];
private static final float[]    ATAN2_TABLE_PNY    = new float[SIZE + 1];
private static final float[]    ATAN2_TABLE_PNX    = new float[SIZE + 1];
private static final float[]    ATAN2_TABLE_NPY    = new float[SIZE + 1];
private static final float[]    ATAN2_TABLE_NPX    = new float[SIZE + 1];
private static final float[]    ATAN2_TABLE_NNY    = new float[SIZE + 1];
private static final float[]    ATAN2_TABLE_NNX    = new float[SIZE + 1];
static
{
    for (int i = 0; i <= SIZE; i++)
    {
        float f = (((float)i) + 0.5f) / SIZE;  //the + 0.5f is to account for the fact that the int cast in the atan2 function rounds down
        ATAN2_TABLE_PPY[i] = (float)(StrictMath.atan(f) * STRETCH / PI);
        ATAN2_TABLE_PPX[i] = STRETCH * 0.5f - ATAN2_TABLE_PPY[i];
        ATAN2_TABLE_PNY[i] = -ATAN2_TABLE_PPY[i];
        ATAN2_TABLE_PNX[i] = ATAN2_TABLE_PPY[i] + NEGATIVE_STRETCH * 0.5f;
        ATAN2_TABLE_NPY[i] = STRETCH - ATAN2_TABLE_PPY[i];
        ATAN2_TABLE_NPX[i] = ATAN2_TABLE_PPY[i] + STRETCH * 0.5f;
        ATAN2_TABLE_NNY[i] = ATAN2_TABLE_PPY[i] - STRETCH;
        ATAN2_TABLE_NNX[i] = NEGATIVE_STRETCH * 0.5f - ATAN2_TABLE_PPY[i];
    }
}


And without lookup table. This is faster, but rather inaccurate with an average error of ~0.021, max error for general usage ~0.071. However, if x and y are both 0 it'll die.

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
private static final float PI = (float)Math.PI;
private static final float PIOVER4 = PI / 4f;
private static final float PITHREEQUARTERS = 3f * PIOVER4;

   //Based on https://dspguru.com/dsp/tricks/fixed-point-atan2-with-self-normalization/
    public static final float atan2(float y, float x) {
        if (y >=0) {
            if (x >= 0) {
                //Add a check for x=0,y=0 here if needed
                return PIOVER4 - PIOVER4 * ((x - y) / (x + y));
            } else {
                return PITHREEQUARTERS - PIOVER4 * ((x + y) / (y - x));
            }
        } else if (x >= 0) {
            return PIOVER4 * ((x + y) / (x - y)) - PIOVER4;
        } else {
            return PIOVER4 * ((y - x) / (y + x)) - PITHREEQUARTERS;
        }
    }

    
If you want to go really extreme, there may also be a tiny possible optimization which will be hardware dependent. On some architectures subtraction can take a tiny amount more than addition afaik. So instead of subtracting PIOVER4, it may be worth trying    + NEGATIVE_PIOVER4.   Though there may be a downside in that caching gets harder with 2 variables.
Offline Icecore
« Reply #14 - Posted 2017-12-06 21:40:49 »

1  
  //Based on a method by Zom-B

I f**** can’t believe….

How this even possible Its very weird
http://www.java-gaming.org/topics/13-8x-faster-atan2-updated/14647/view.html

because I develop same code after 5 years
(after many iterations)
 – I never see it or something similar
even
1  
2  
3  
4  
5  
6  
my
//(y < x) because == (-y > -x)
if (y < x)

Zom-B
if (x <= y) // (-x >= -y)

/////
if its was best solution why it was not included to test from start?
or maybe its fake (forum modify???) - not looks so

I even don’t know…

Or maybe its time Travelers XD
Zom-B – You cool, cheers, you’re the best =)
Maybe its even me? ^^ (if its really me - Zom-B : ha-ha I understand the humor behind it XDDDD)

p.s
its my last post
I leave this name and all previous (same as forum)
it was decision long before this accident (but this – show that I do all right)
Maybe we see again – or maybe not, who knows Wink

Last known State: Reassembled in Cyberspace
End Transmission....
..
.
Journey began Now)
Offline SiestaGuru

Senior Newbie


Medals: 2



« Reply #15 - Posted 2017-12-06 23:31:17 »

Haha, it does look quite similar. I was wondering if one of you had taken some bits off the other. I suppose it's just that doing a structure like that is sort of a natural conclusion to reach because it limits the amount of operations you need to do. Great minds think alike?
Mine ends up being a bit faster for my test case than yours, but I imagine it's possible that it may depend on the JIT compiler and architecture because while it saves on operations, it needs more arrays. I haven't tested whether Zom-B's version is faster, it might not be since it does an unnecessary +0.5 which could just be baked into the array generator.  I did notice that your version didn't account for the (int) cast rounding down, there's quite a bit of an average and max accuracy difference so it's probably worth doing.



Oh yeah for Math.Sqrt, it's almost unbeatable, but I found two approximation methods that are faster.
One is just a simple lookup table for the expected range of input values with a small check for anything above if you think the value can exceed the range.
Make sure to account for the int cast rounding down in the table generation again

1  
2  
3  
4  
5  
6  
7  
8  
    //Uses a lookup table with 3600 elements
    public static float sqrt(float x) {
        if(x >= 100f){
            return (float)Math.sqrt(x);
        }else{
            return sqrtTable[(int)(x*36f)];
        }
    }


The others are these abominations, which don't require tables and are slightly faster but have worse accuracy.
See http://h14s.p5r.org/2012/09/0x5f3759df.html for an explanation
For at least the inverted one you can add a step of Newton's method for more accuracy, but doing that makes the speed advantage disappear.

1  
2  
3  
4  
5  
6  
public static float sqrt(float x) {
        return Float.intBitsToFloat(0x1fbd1df5 + (Float.floatToIntBits(x) >> 1));
}
public static float invertedSqrt(float x) {
        return  Float.intBitsToFloat(0x5f3759df - (Float.floatToIntBits(x) >> 1));
}    


In both cases my benchmarks give me an about a tenfold speed increase compared to Math.sqrt(). However, Math.sqrt is comparatively so light, that the difference will usually seem way less because things like the benchmark-loop, method calls etc. are very significant. In my original benchmark I got only a 3-fold increase because I hadn't filtered out those factors.
So the tenfold increase isn't nearly as significant as it sounds, and in the vast majority of cases, Math.sqrt is more than enough.

Oh and btw,  don't use Math.hypot() to calculate a hypotenuse.  Math.sqrt(dx*dx + dy*dy) is waaay faster for whatever reason.

Edit: Corrections on the sqrt methods
Offline princec

« JGO Spiffy Duke »


Medals: 982
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #16 - Posted 2017-12-07 09:03:27 »

Math.hypot() is much more accurate (unnecessarily so for our purposes).

Cas Smiley

Offline SiestaGuru

Senior Newbie


Medals: 2



« Reply #17 - Posted 2017-12-07 12:32:58 »

Found another one worth considering. This one is based on the Icecore's method and uses only one table. The difference here is that by using signum, it avoids all but one if/else statement and so can do better at branch prediction with random data.
On some tests, this gets the highest speed of any of the lookup table methods, but does worse on others. Its best environment seems to be a lot of executions on random data. If there's a low amount of executions or the data is structured, other methods are better.
This happens to work fine for x=0,y=0


    
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
    public static final float atan2BranchpredictionHelper(float y, float x) {
        //Basically Math.signum(), except they're slightly faster and can't return 0
        float signX = Float.intBitsToFloat((Float.floatToIntBits(x) & DESTROY_ALL_BUT_SIGN) | SETTO1);
        float signY = Float.intBitsToFloat((Float.floatToIntBits(y) & DESTROY_ALL_BUT_SIGN) | SETTO1);

        if(Math.abs(y) >= Math.abs(x)){
            return signY*HALFPI - signX*signY*ATAN[(int) (Math.abs(x) / Math.abs(y) * SIZE)];
        }else{
            return (signX-1)*HALFPI_NEG*signY + signX*signY*ATAN[(int) (Math.abs(y) / Math.abs(x) * SIZE)];
        }
    }

    static {
        final int DESTROY_ALL_BUT_SIGN = 0b10000000000000000000000000000000;
        final int SETTO1 = Float.floatToIntBits(1f);
        final int  SIZE = 1024;
        final float HALFPI = (float)Math.PI / 2f;
        final float HALFPI_NEG = -HALFPI;
        final int Size_Ar = SIZE + 1;
        ATAN = new float[Size_Ar];
        for (int i = 0; i <= SIZE; i++) {
            ATAN[i] = (float) Math.atan2((((double)i)+0.5) / SIZE, 1);
        }
        ATAN[0] = 0;

    }




For reference, my nanosecs  benchmarks for 1 million and  100 million executions of different methods (includes the computations necessary for running through the loop)
The winner changes depending on the execution count, so I have no idea which one would win in a real-world scenario.

This new method seems terrible on low execution counts and is even worse than the default, but somehow becomes the winner at very high execution counts.
Icecore's takes over my 8-array method at some point (once caching kicks in properly?)
The arrayless version is always fast

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
10,000:
Math.atan2  1113710  #4                      111.4 ns / calc
Icecore  750496    #3                            75.0  ns / calc    
My 8-array lookup (based on Zom-Bs)  551749   #2       55.2     ns / calc    
This new method    2819284  #5                    281.9  ns / calc    
The array-less approximation   414997    #1        41.5    ns / calc
Dummy method (return x+y)   247613

1,000,000:
Math.atan2   87623524  #5                   87.6   ns / calc
Icecore   19311236  #3                      19.3   ns / calc
My 8-array lookup (based on Zom-Bs)  21043795  #4      21.0   ns / calc                
This new method   14324703  #2                 14.3   ns / calc    
The array-less approximation  13263142   #1         13.3   ns / calc  
Dummy method (return x+y) 6273097  

100,000,000:
Math.atan2  8471197796  #5                  84.7   ns / calc
Icecore  1446387298  #3                         14.5   ns / calc
My 8-array lookup (based on Zom-Bs)  1707782769  #4      17.1   ns / calc
This new method  932633308   #1                      9.3  ns / calc
The array-less approximation   1004009940  #2       10.0 ns/calc
Dummy method (return x+y)  138700292              

Errors on 1 million executions, using the same size array (1024) for all lookup tables. The only real difference among the lookup tables is correcting the int cast
Average:
This new method   0.000191705          #1/#2
Icecore   0.00038336442        #3
Arrayless  0.043115597          #4
Eight arrays  0.000191705             #1/#2

Max errors:
This new method  0.0004.8828125        #1/#2
Icecore  0.0009.765129          #3
Arrayless   0.07111478             #4
Eight arrays  0.0004.8828125          #1/#2









Offline Roquen

JGO Kernel


Medals: 516



« Reply #18 - Posted 2017-12-11 13:22:55 »

Yea: hypot avoids overflow & underflow and is significantly more accurate than sqrt of sum of squares...and like princec said it's very unlikely you care.

Those 'fast' square roots are good to about 11 or 12 bits and so are only useful for horseshoes and handgrenades kinda computations.  Take Haswell/Broadwell: sqrtss is 1 uOP to issue, completes in 11 cycles and is ulp correct.  (Yeah, it's too bad java doesn't expose ~sqrt and ~1/sqrt, well and a long list of other things).

Have I ever mentioned that microbenchmarks are only as meaningful as the testing scheme and how well they reflect an an usage?
Pages: [1]
  ignore  |  Print  
 
 

 
Ecumene (145 views)
2017-09-30 02:57:34

theagentd (213 views)
2017-09-26 18:23:31

cybrmynd (292 views)
2017-08-02 12:28:51

cybrmynd (284 views)
2017-08-02 12:19:43

cybrmynd (294 views)
2017-08-02 12:18:09

Sralse (287 views)
2017-07-25 17:13:48

Archive (966 views)
2017-04-27 17:45:51

buddyBro (1092 views)
2017-04-05 03:38:00

CopyableCougar4 (1663 views)
2017-03-24 15:39:42

theagentd (1425 views)
2017-03-24 15:32:08
Java Gaming Resources
by philfrei
2017-12-05 19:38:37

Java Gaming Resources
by philfrei
2017-12-05 19:37:39

Java Gaming Resources
by philfrei
2017-12-05 19:36:10

Java Gaming Resources
by philfrei
2017-12-05 19:33:10

List of Learning Resources
by elect
2017-03-13 14:05:44

List of Learning Resources
by elect
2017-03-13 14:04:45

SF/X Libraries
by philfrei
2017-03-02 08:45:19

SF/X Libraries
by philfrei
2017-03-02 08:44:05
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!