Java-Gaming.org    
Featured games (79)
games approved by the League of Dukes
Games in Showcase (477)
Games in Android Showcase (108)
games submitted by our members
Games in WIP (536)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: [1] 2
  ignore  |  Print  
  13.8x faster atan2 (updated)  (Read 16164 times)
0 Members and 1 Guest are viewing this topic.
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 749
Projects: 4
Exp: 16 years


Hand over your head.


« Posted 2006-08-11 12:04:21 »

Hi folks,

I've created an algorithm that allows atan2 to be calculated through a lookup-table.

All values for Y and X are allowed. You can control the RAM-usage/accuracy by setting the bits used in both dimensions.


I used 7 bits per dimension - that means 2^7 (128) values for each dimension: so 128*128 = 16k entries in lookup-table
this results in an accuracy of 0,0078125 (1.0 / 128) in each dimension after the coords are normalized


Updated
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
47  
48  
49  
50  
51  
52  
53  
54  
55  
56  
57  
58  
59  
60  
61  
62  
63  
64  
65  
66  
67  
68  
69  
70  
71  
72  
73  
74  
75  
76  
77  
78  
79  
80  
81  
82  
83  
84  
85  
86  
   private static final int ATAN2_BITS = 7;

   private static final int ATAN2_BITS2 = ATAN2_BITS << 1;
   private static final int ATAN2_MASK = ~(-1 << ATAN2_BITS2);
   private static final int ATAN2_COUNT = ATAN2_MASK + 1;
   private static final int ATAN2_DIM = (int) Math.sqrt(ATAN2_COUNT);

   private static final float INV_ATAN2_DIM_MINUS_1 = 1.0f / (ATAN2_DIM - 1);
   private static final float DEG = 180.0f / (float) Math.PI;

   private static final float[] atan2 = new float[ATAN2_COUNT];



   static
   {
      for (int i = 0; i < ATAN2_DIM; i++)
      {
         for (int j = 0; j < ATAN2_DIM; j++)
         {
            float x0 = (float) i / ATAN2_DIM;
            float y0 = (float) j / ATAN2_DIM;

            atan2[j * ATAN2_DIM + i] = (float) Math.atan2(y0, x0);
         }
      }
   }


   /**
    * ATAN2
    */


   public static final float atan2Deg(float y, float x)
   {
      return FastMath.atan2(y, x) * DEG;
   }

   public static final float atan2DegStrict(float y, float x)
   {
      return (float) Math.atan2(y, x) * DEG;
   }

   public static final float atan2(float y, float x)
   {
      float add, mul;

      if (x < 0.0f)
      {
         if (y < 0.0f)
         {
            x = -x;
            y = -y;

            mul = 1.0f;
         }
         else
         {
            x = -x;
            mul = -1.0f;
         }

         add = -3.141592653f;
      }
      else
      {
         if (y < 0.0f)
         {
            y = -y;
            mul = -1.0f;
         }
         else
         {
            mul = 1.0f;
         }

         add = 0.0f;
      }

      float invDiv = 1.0f / (((x < y) ? y : x) * INV_ATAN2_DIM_MINUS_1);

      int xi = (int) (x * invDiv);
      int yi = (int) (y * invDiv);

      return (atan2[yi * ATAN2_DIM + xi] + add) * mul;
   }


The lengthy if/else-chain might seem inefficient, but it is really the fastest way to normalize X and Y.


Benchmark:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
      float min = -100;
      float max = +100;
      float step = 0.12f;

      for (int i = 0; i < 8; i++)
      {
         long t0A = System.nanoTime() / 1000000L;
         float sumA = 0.0f;
         for (float y = min; y < max; y += step)
            for (float x = min; x < max; x += step)
               sumA += FastMath.atan2(y, x);
         long t1A = System.nanoTime() / 1000000L;

         long t0B = System.nanoTime() / 1000000L;
         float sumB = 0.0f;
         for (float y = min; y < max; y += step)
            for (float x = min; x < max; x += step)
               sumB += Math.atan2(y, x);
         long t1B = System.nanoTime() / 1000000L;

         System.out.println();
         System.out.println("FastMath: " + (t1A - t0A) + "ms, sum=" + sumA);
         System.out.println("JavaMath: " + (t1B - t0B) + "ms, sum=" + sumB);
         System.out.println("factor: " + ((float) (t1B - t0B) / (t1A - t0A)));
      }



Results:
1  
2  
3  
4  
5  
6  
7  
8  
9  
...

FastMath: 84ms, sum=-2705.0369
JavaMath: 1161ms, sum=-2690.9863
factor: 13.821428

FastMath: 84ms, sum=-2705.0369
JavaMath: 1164ms, sum=-2690.9863
factor: 13.857142


Have fun! (or beat me!)

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline Kova

Senior Member





« Reply #1 - Posted 2006-08-11 14:01:07 »

great work Riven... how about you do same for other trigonometry ops? Smiley
How about fast accuracy comparison, for your 7 bit example, in what digit it starts to differ (be inaccurate) from java's atan2? I don't really understand how it works yet (although I just glanced at it) so I can't understand your accuracy explaination.
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 749
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #2 - Posted 2006-08-11 16:44:02 »

great work Riven... how about you do same for other trigonometry ops? Smiley

I think these are pretty common... anyway:

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
47  
48  
49  
50  
51  
52  
53  
54  
55  
56  
57  
58  
59  
60  
61  
62  
63  
64  
65  
66  
67  
68  
69  
70  
71  
   private static final float RAD = (float) Math.PI / 180.0F;

   private static final int SIN_BITS = 12;
   private static final int SIN_MASK = ~(-1 << SIN_BITS);
   private static final int SIN_COUNT = SIN_MASK + 1;

   private static final float radFull = (float) (Math.PI * 2.0);
   private static final float degFull = (float) (360.0);
   private static final float radToIndex = SIN_COUNT / radFull;
   private static final float degToIndex = SIN_COUNT / degFull;

   private static final float[] sin = new float[SIN_COUNT];
   private static final float[] cos = new float[SIN_COUNT];

   static
   {
      for (int i = 0; i < SIN_COUNT; i++)
      {
         sin[i] = (float) Math.sin((float) (i + 0.5f) / SIN_COUNT * radFull);
         cos[i] = (float) Math.cos((float) (i + 0.5f) / SIN_COUNT * radFull);
      }
   }



   /**
    * SIN / COS (RAD)
    */


   public static final float sin(float rad)
   {
      return sin[(int) (rad * radToIndex) & SIN_MASK];
   }

   public static final float cos(float rad)
   {
      return cos[(int) (rad * radToIndex) & SIN_MASK];
   }



   /**
    * SIN / COS (DEG)
    */


   public static final float sinDeg(float deg)
   {
      return sin[(int) (deg * degToIndex) & SIN_MASK];
   }

   public static final float cosDeg(float deg)
   {
      return cos[(int) (deg * degToIndex) & SIN_MASK];
   }



   /**
    * SIN / COS (DEG - STRICT)
    */


   public static final float sinDegStrict(float deg)
   {
      return (float) Math.sin(deg * RAD);
   }

   public static final float cosDegStrict(float deg)
   {
      return (float) Math.cos(deg * RAD);
   }
}


My quite solid benchmarks say it is about 60x faster than Math operations, but I expect that to be some weird (say: unreal) optimisation in the JIT, although I tried to make that impossible.

It's fair to assume a factor 10 speedup though.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 749
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #3 - Posted 2006-08-11 17:11:25 »

About the update:

1. doubled the accuracy while using the same bit-count
2. made some small optimisations, resulting in 40-50% faster code

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline Mads

JGO Ninja


Medals: 26
Projects: 3
Exp: 6 years


One for all!


« Reply #4 - Posted 2011-07-26 19:08:40 »

Sorry for bumbing this  Clueless

Is there a way to get your newest versions of your fast math operations? I know there's quite a few, or I might have dreamt it some time. Let's hope!

When is the accuracy a problem with this? How many pixels off can we get, on a 1000 x 1000 plane? I suspect it to not be a problem at all?

Offline Riven
« League of Dukes »

JGO Overlord


Medals: 749
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #5 - Posted 2011-07-26 19:13:23 »

http://riven8192.blogspot.com/2009/08/fastmath-atan2-lookup-table.html

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline Spasi
« Reply #6 - Posted 2011-07-26 20:45:42 »

This is faster and a bit more accurate (at least on the -100.0f to 100.0f range you're testing).

1  
2  
3  
4  
5  
6  
7  
int i;
if ( x < y )
   i = ATAN2_DIM_MINUS_1 * ATAN2_DIM + (int)(x * (ATAN2_DIM_MINUS_1 / y));
else
   i = (int)(y * (ATAN2_DIM_MINUS_1 / x)) * ATAN2_DIM + ATAN2_DIM_MINUS_1;

return (atan2[i] + add) * mul;
Offline Zom-B

Senior Newbie





« Reply #7 - Posted 2012-04-26 13:12:57 »

I improved it. Now it uses O(N) memory instead of O(N2) and because it eliminates unnecessary assignments, multiplications, additions and 2-dimensional addressing, it's a lot faster. I increased the table size for more precision.

The theory:


I added the option to not only return results from -Pi to Pi, but any range you want.

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
47  
48  
49  
50  
51  
52  
53  
54  
55  
56  
57  
58  
59  
60  
61  
62  
63  
64  
65  
66  
67  
68  
69  
70  
71  
72  
73  
74  
75  
76  
77  
78  
79  
80  
81  
82  
83  
84  
85  
86  
87  
88  
89  
90  
91  
92  
93  
94  
95  
96  
97  
98  
99  
100  
101  
102  
103  
104  
105  
public class FastMath
{
    public static void main(String[] args)
    {
        float min = -100;
        float max = +100;
        float step = 0.12f;

        for (int i = 0; i < 8; i++)
        {
            long t0A = System.nanoTime() / 1000000L;
            float sumA = 0.0f;
            for (float y = min; y < max; y += step)
                for (float x = min; x < max; x += step)
                    sumA += atan2(y, x);
            long t1A = System.nanoTime() / 1000000L;

            long t0B = System.nanoTime() / 1000000L;
            float sumB = 0.0f;
            for (float y = min; y < max; y += step)
                for (float x = min; x < max; x += step)
                    sumB += Math.atan2(y, x);
            long t1B = System.nanoTime() / 1000000L;

            System.out.println();
            System.out.println("FastMath: " + (t1A - t0A) + "ms, sum=" + sumA);
            System.out.println("JavaMath: " + (t1B - t0B) + "ms, sum=" + sumB);
            System.out.println("factor: " + (float)(t1B - t0B) / (t1A - t0A));
        }
    }

    private static final int           SIZE                 = 1024;
    private static final float        STRETCH            = Math.PI;
    // Output will swing from -STRETCH to STRETCH (default: Math.PI)
   // Useful to change to 1 if you would normally do "atan2(y, x) / Math.PI"

    // Inverse of SIZE
   private static final int        EZIS            = -SIZE;
    private static final float[]    ATAN2_TABLE_PPY    = new float[SIZE + 1];
    private static final float[]    ATAN2_TABLE_PPX    = new float[SIZE + 1];
    private static final float[]    ATAN2_TABLE_PNY    = new float[SIZE + 1];
    private static final float[]    ATAN2_TABLE_PNX    = new float[SIZE + 1];
    private static final float[]    ATAN2_TABLE_NPY    = new float[SIZE + 1];
    private static final float[]    ATAN2_TABLE_NPX    = new float[SIZE + 1];
    private static final float[]    ATAN2_TABLE_NNY    = new float[SIZE + 1];
    private static final float[]    ATAN2_TABLE_NNX    = new float[SIZE + 1];

    static
    {
        for (int i = 0; i <= SIZE; i++)
        {
            float f = (float)i / SIZE;
            ATAN2_TABLE_PPY[i] = (float)(StrictMath.atan(f) * STRETCH / StrictMath.PI);
            ATAN2_TABLE_PPX[i] = STRETCH * 0.5f - ATAN2_TABLE_PPY[i];
            ATAN2_TABLE_PNY[i] = -ATAN2_TABLE_PPY[i];
            ATAN2_TABLE_PNX[i] = ATAN2_TABLE_PPY[i] - STRETCH * 0.5f;
            ATAN2_TABLE_NPY[i] = STRETCH - ATAN2_TABLE_PPY[i];
            ATAN2_TABLE_NPX[i] = ATAN2_TABLE_PPY[i] + STRETCH * 0.5f;
            ATAN2_TABLE_NNY[i] = ATAN2_TABLE_PPY[i] - STRETCH;
            ATAN2_TABLE_NNX[i] = -STRETCH * 0.5f - ATAN2_TABLE_PPY[i];
        }
    }

    /**
     * ATAN2
     */


    public static final float atan2(float y, float x)
    {
        if (x >= 0)
        {
            if (y >= 0)
            {
                if (x >= y)
                    return ATAN2_TABLE_PPY[(int)(SIZE * y / x + 0.5)];
                else
                    return ATAN2_TABLE_PPX[(int)(SIZE * x / y + 0.5)];
            }
            else
            {
                if (x >= -y)
                    return ATAN2_TABLE_PNY[(int)(EZIS * y / x + 0.5)];
                else
                    return ATAN2_TABLE_PNX[(int)(EZIS * x / y + 0.5)];
            }
        }
        else
        {
            if (y >= 0)
            {
                if (-x >= y)
                    return ATAN2_TABLE_NPY[(int)(EZIS * y / x + 0.5)];
                else
                    return ATAN2_TABLE_NPX[(int)(EZIS * x / y + 0.5)];
            }
            else
            {
                if (x <= y) // (-x >= -y)
                   return ATAN2_TABLE_NNY[(int)(SIZE * y / x + 0.5)];
                else
                    return ATAN2_TABLE_NNX[(int)(SIZE * x / y + 0.5)];
            }
        }
    }
}
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 749
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #8 - Posted 2012-04-26 13:28:26 »

Nice! I'll do some benchmarks when I get home. My atan2 was basically a huge hack, that would explode when x or y approached zero.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Online Roquen
« Reply #9 - Posted 2012-04-26 14:25:08 »

WARNING: micro-benchmarking heavily table based is very misleading as the tables will be remain cached in the benchmark..were that is unlikely under real usage patterns.  Likewise walking linearly through the test range will cause branch predictors to be getting "right" virtually 100% of the time, where in real usage this will not be the case.
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline Zom-B

Senior Newbie





« Reply #10 - Posted 2012-04-26 14:52:19 »

I just benchmarked my real application and it went from 20fps to 70fps. It's not only atan2, the 4th operation in the following sequence is atan2. Other operations include loading an image (memory intensive).

processing times in milliseconds.

before:
ms=1.298769   1.79073   1.291225   31.39561   0.686121   1.968686   0.729702      1.721866

after:
ms=1.283403   1.842692   1.211607   4.929677   0.801498   2.235479   0.755683      1.747324

From 31ms to 5ms is a 6x improvement.

My raw code+benchmark give this:

FastMath: 42ms, sum=-2688.061
JavaMath: 562ms, sum=-2690.9863
factor: 13.380953

FastMath: 40ms, sum=-2688.061
JavaMath: 519ms, sum=-2690.9863
factor: 12.975


In contrast, on my machine, Riven's original algorithm+benchmark was 8.5x faster out-of-the-box, as opposed to the claimed 13.8x
Online Roquen
« Reply #11 - Posted 2012-04-26 15:32:46 »

Hum.  Why are you calling atan2 so much?  That seems excessive.
Offline Spasi
« Reply #12 - Posted 2012-04-26 16:01:00 »

Zom-B, your version is definitely more accurate and requires half the look-up table size, but it's also slower than Riven's version (~38% slower). Also, the JavaMath timing in your results looks suspiciously high; on which JVM and what kind of CPU did you run the test on? Did you use the -server flag?
Offline Zom-B

Senior Newbie





« Reply #13 - Posted 2012-04-26 20:01:28 »

I use JDK 1.7 with Eclipse Indigo on Core2(duo) 2.5GHz and 32-bit windows XP. I don't use the -server flag.

I don't see how my version could possibly be slower, unless you copied the early double variant which I soon replaced with the float variant.

My algorithm uses 3 comparisons, 1 multiplication, 1 division, 1 addition, 1 table look-up and 0~1 negations.

Riven's version uses 3 comparisons, 4 multiplications, 1 division, 1 addition, 1 table look-up, 0~2 negations, 5~7 assignments and 2 casts.
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 749
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #14 - Posted 2012-04-26 20:05:21 »

As usual, and already mentioned in this thread, caching is the bottleneck in lookup tables. By making your access patterns very specific, you can make either of the two 'win' in performance.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline sproingie

JGO Kernel


Medals: 202



« Reply #15 - Posted 2012-04-26 21:32:11 »

Just FYI:

$ javac FastMath.java && java -cp . FastMath
FastMath.java:33: error: possible loss of precision
    private static final float        STRETCH            = Math.PI;
                                                               ^
  required: float
  found:    double
1 error

Offline sproingie

JGO Kernel


Medals: 202



« Reply #16 - Posted 2012-04-26 21:50:21 »

Here it is with random data and 10 million iterations:


Running FastMath

FastMath: 1173ms, sum=157309.11
JavaMath: 2090ms, sum=158855.06
factor: 1.7817562


Modified code below:

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
47  
48  
49  
50  
51  
52  
53  
54  
55  
56  
57  
58  
59  
60  
61  
62  
63  
64  
65  
66  
67  
68  
69  
70  
71  
72  
73  
74  
75  
76  
77  
78  
79  
80  
81  
82  
83  
84  
85  
86  
87  
88  
89  
90  
91  
92  
93  
94  
95  
96  
97  
98  
99  
100  
101  
102  
103  
104  
105  
106  
107  
108  
109  
110  
111  
import java.util.Random;

public class FastMath
{
    static Random rng = new Random();

    public static void main(String[] args)
    {
        int iterations = 10000000; // 10 million
       float min = -100;
        float max = +100;
        float step = 0.12f;


        long t0A = System.nanoTime() / 1000000L;
        float sumA = 0.0f;
        for (int i = 0; i < iterations; i++)
            sumA += atan2(randFloat(min,max), randFloat(min,max));
        long t1A = System.nanoTime() / 1000000L;
       
        long t0B = System.nanoTime() / 1000000L;
        float sumB = 0.0f;
        for (int i = 0; i < iterations; i++)
            sumB += Math.atan2(randFloat(min,max), randFloat(min,max));
        long t1B = System.nanoTime() / 1000000L;
       
        System.out.println();
        System.out.println("FastMath: " + (t1A - t0A) + "ms, sum=" + sumA);
        System.out.println("JavaMath: " + (t1B - t0B) + "ms, sum=" + sumB);
        System.out.println("factor: " + (float)(t1B - t0B) / (t1A - t0A));
    }

    private static float randFloat(float min, float max) {
        return min + (float)((rng.nextDouble() * ((double)max - (double)min)) + 1);
    }


    private static final int           SIZE                 = 1024;
    private static final float         STRETCH            = (float)Math.PI;
    // Output will swing from -STRETCH to STRETCH (default: Math.PI)
   // Useful to change to 1 if you would normally do "atan2(y, x) / Math.PI"

    // Inverse of SIZE
   private static final int        EZIS            = -SIZE;
    private static final float[]    ATAN2_TABLE_PPY    = new float[SIZE + 1];
    private static final float[]    ATAN2_TABLE_PPX    = new float[SIZE + 1];
    private static final float[]    ATAN2_TABLE_PNY    = new float[SIZE + 1];
    private static final float[]    ATAN2_TABLE_PNX    = new float[SIZE + 1];
    private static final float[]    ATAN2_TABLE_NPY    = new float[SIZE + 1];
    private static final float[]    ATAN2_TABLE_NPX    = new float[SIZE + 1];
    private static final float[]    ATAN2_TABLE_NNY    = new float[SIZE + 1];
    private static final float[]    ATAN2_TABLE_NNX    = new float[SIZE + 1];

    static
    {
        for (int i = 0; i <= SIZE; i++)
        {
            float f = (float)i / SIZE;
            ATAN2_TABLE_PPY[i] = (float)(StrictMath.atan(f) * STRETCH / StrictMath.PI);
            ATAN2_TABLE_PPX[i] = STRETCH * 0.5f - ATAN2_TABLE_PPY[i];
            ATAN2_TABLE_PNY[i] = -ATAN2_TABLE_PPY[i];
            ATAN2_TABLE_PNX[i] = ATAN2_TABLE_PPY[i] - STRETCH * 0.5f;
            ATAN2_TABLE_NPY[i] = STRETCH - ATAN2_TABLE_PPY[i];
            ATAN2_TABLE_NPX[i] = ATAN2_TABLE_PPY[i] + STRETCH * 0.5f;
            ATAN2_TABLE_NNY[i] = ATAN2_TABLE_PPY[i] - STRETCH;
            ATAN2_TABLE_NNX[i] = -STRETCH * 0.5f - ATAN2_TABLE_PPY[i];
        }
    }

    /**
     * ATAN2
     */


    public static final float atan2(float y, float x)
    {
        if (x >= 0)
        {
            if (y >= 0)
            {
                if (x >= y)
                    return ATAN2_TABLE_PPY[(int)(SIZE * y / x + 0.5)];
                else
                    return ATAN2_TABLE_PPX[(int)(SIZE * x / y + 0.5)];
            }
            else
            {
                if (x >= -y)
                    return ATAN2_TABLE_PNY[(int)(EZIS * y / x + 0.5)];
                else
                    return ATAN2_TABLE_PNX[(int)(EZIS * x / y + 0.5)];
            }
        }
        else
        {
            if (y >= 0)
            {
                if (-x >= y)
                    return ATAN2_TABLE_NPY[(int)(EZIS * y / x + 0.5)];
                else
                    return ATAN2_TABLE_NPX[(int)(EZIS * x / y + 0.5)];
            }
            else
            {
                if (x <= y) // (-x >= -y)
                   return ATAN2_TABLE_NNY[(int)(SIZE * y / x + 0.5)];
                else
                    return ATAN2_TABLE_NNX[(int)(SIZE * x / y + 0.5)];
            }
        }
    }
}


Offline sproingie

JGO Kernel


Medals: 202



« Reply #17 - Posted 2012-04-27 00:45:56 »

I forgot to re-seed the RNG between each test.  So I decided to fix that, but rather than do the easy thing, I went with a big array of samples instead, so now it's 10 million iterations cycling over an array of 50,000 random floats.  The results were interesting: FastMath's relative performance more than doubles past its previous benchmark. 

FastMath:    366ms, sum=294547.25
JavaMath:   1333ms, sum=294536.22
factor: 3.6420765

Offline Riven
« League of Dukes »

JGO Overlord


Medals: 749
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #18 - Posted 2012-04-27 00:53:52 »

Not (too) surprising (to me) actually.

java.util.Random is relatively slow, compared to calculating the index and fetching the value of an entry of a lookup table.

But as always, it proves that micro benchmarks are hard.



Having said that, how do mine and his code compare to eachother on your system (and in your testcase) ?

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline sproingie

JGO Kernel


Medals: 202



« Reply #19 - Posted 2012-04-27 01:03:11 »

Sure Random is expensive, but both should have been billed equally.  Cache effect I guess.

I'm just using ZomB's version, didn't actually try the first chunk of code, thought they were basically the same.  I'll try yours out when I get some time.
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 749
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #20 - Posted 2012-04-27 01:09:40 »

Both are not effected equaly, for the following reason:


Let's argue that A is 10x as fast as B (A is 100ms, B is 1000ms).

Now we're going to add an overhead of 250ms: (let's name it: 'random number calculation overhead')
benchmark of A = 100ms + 250ms = 350ms
benchmark of B = 1000ms + 250ms = 1250ms
-> conclusion: "A is 3.57x as fast as B"

Let's instead add an overhead of 100ms: (let's call it: 'array lookup overhead')
benchmark of A = 100ms + 100ms = 200ms
benchmark of B = 1000ms + 100ms = 1100ms
-> conclusion: "A is 5.50x as fast as B"

Finally, add an overhead of 10000ms:
benchmark of A = 100ms + 10000ms = 10100ms
benchmark of B = 1000ms + 10000ms = 11000ms
-> conclusion: "A and B are about as fast"



The magnitude of the overhead matters, and can't be considered to be a factor that is 'billed equally'.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline sproingie

JGO Kernel


Medals: 202



« Reply #21 - Posted 2012-04-27 01:15:40 »

Oh nuts, you're right.  I would have to run a control and subtract it from both. 

Offline Riven
« League of Dukes »

JGO Overlord


Medals: 749
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #22 - Posted 2012-04-27 01:17:08 »

It's funny how we're solving complex logic problems, and once in a while screwup with the basics Smiley Kiss

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline Zom-B

Senior Newbie





« Reply #23 - Posted 2012-04-27 06:23:51 »

In the past I have found a very elegant and reliable way to do microbenchmarking while compensating for the effects of overhead.

Calculate the time it takes one loop without the function being tested, and one with.

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
   public static void main(String[] args)
      int samples = 1000000;
      long l1;
      long l2;
      int n = 0;

      for (int tries = 0; tries < 10; tries++)
      {
         l1 = System.nanoTime();
         for (int i = 0; i < samples; i++)
         {}
         l1 = System.nanoTime() - l1;

         l2 = System.nanoTime();
         for (int i = 0; i < samples; i++)
         {
            n++;
         }
         l2 = System.nanoTime() - l2;

         System.out.println("t0 = " + l1 + "\tt=" + l2 + "\tspeed=" + samples * 1e9f / (l2 - l1));
      }
   }

Result:
t0 = 1251556   t=779708   speed=-2.11932659E9
t0 = 548115   t=838654   speed=3.44187878E9
t0 = 546159   t=752889   speed=4.837227E9
t0 = 531632   t=784457   speed=3.95530496E9
t0 = 523251   t=883911   speed=2.77269453E9
t0 = 546159   t=807924   speed=3.82022042E9
t0 = 543924   t=804013   speed=3.84483763E9
t0 = 545879   t=803454   speed=3.88236442E9
t0 = 543365   t=804013   speed=3.83659187E9
t0 = 546159   t=804012   speed=3.87817856E9


My 2.5GHz Intel Core (duo) can do about 3.88 billion times n++ in a single thread.

When removing n++ (making both loops the same), the result is:
t0 = 1259099   t=602311   speed=-1.52256128E9
t0 = 544762   t=556216   speed=8.730574E10
t0 = 551467   t=540851   speed=-9.4197441E10
t0 = 666007   t=552864   speed=-8.8383724E9
t0 = 545041   t=553982   speed=1.11844311E11
t0 = 545600   t=556496   speed=9.1776795E10
t0 = 546159   t=603708   speed=1.73764956E10
t0 = 561245   t=540571   speed=-4.8369934E10
t0 = 553701   t=542527   speed=-8.9493463E10
t0 = 551467   t=607619   speed=1.78088038E10


which are ridiculously high numbers (30 times higher) meaning this overhead compensation is fairly reliable relative to the measuring noise. (increasing samples increases accuracy)

Now on to atan2.

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
   public static void main(String[] args)
   {
      Random RND = new Random();

      int samples = 1000000;
      long l1;
      long l2;
      long l3;
      float x;
      float y;

      for (int tries = 0; tries < 10; tries++)
      {
         l1 = System.nanoTime();
         for (int i = 0; i < samples; i++)
         {
            x = (RND.nextFloat() - 0.5f) * 200;
            y = (RND.nextFloat() - 0.5f) * 200;
         }
         l1 = System.nanoTime() - l1;

         l2 = System.nanoTime();
         for (int i = 0; i < samples; i++)
         {
            x = (RND.nextFloat() - 0.5f) * 200;
            y = (RND.nextFloat() - 0.5f) * 200;
            FastMath.atan2(x, y);
         }
         l2 = System.nanoTime() - l2;

         l3 = System.nanoTime();
         for (int i = 0; i < samples; i++)
         {
            x = (RND.nextFloat() - 0.5f) * 200;
            y = (RND.nextFloat() - 0.5f) * 200;
            StrictMath.atan2(x, y);
         }
         l3 = System.nanoTime() - l3;

         float fast = samples * 1e9f / (l2 - l1);
         float slow = samples * 1e9f / (l3 - l1);
         System.out.println("speed fast=" + fast + "\tspeed slow=" + slow + "\tfactor=" + fast / slow);
      }
   }

result:
speed fast=2.5838188E7   speed slow=3102130.2   factor=8.329176
speed fast=4.2304908E7   speed slow=6922386.0   factor=6.111319
speed fast=3.6428208E7   speed slow=6051220.5   factor=6.019977
speed fast=3.3007322E7   speed slow=5712438.0   factor=5.7781496
speed fast=3.6745696E7   speed slow=6234034.0   factor=5.8943686
speed fast=4.6704136E7   speed slow=6152503.5   factor=7.5910783
speed fast=3.9707864E7   speed slow=6488173.5   factor=6.1200376
speed fast=5.0430332E7   speed slow=6682681.5   factor=7.5464215
speed fast=4.770246E7   speed slow=6730517.0   factor=7.087488
speed fast=4.857506E7   speed slow=6979997.5   factor=6.9591804


I've also invented another, even more accurate, but more limited, microbenchmarking technique. Call the function once in the first loop and 10 times in the other, then solve for the run time of one execution. This is not applicable for atan2 because you can't call it multiple times in a row and expect the performance to be flat (and also calling random in between makes the technique useless).
Offline 65K
« Reply #24 - Posted 2012-04-27 09:08:04 »

In the past I have found a very elegant and reliable way to do microbenchmarking while compensating for the effects of overhead.

Calculate the time it takes one loop without the function being tested, and one with.

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
   public static void main(String[] args)
      int samples = 1000000;
      long l1;
      long l2;
      int n = 0;

      for (int tries = 0; tries < 10; tries++)
      {
         l1 = System.nanoTime();
         for (int i = 0; i < samples; i++)
         {}
         l1 = System.nanoTime() - l1;

         l2 = System.nanoTime();
         for (int i = 0; i < samples; i++)
         {
            n++;
         }
         l2 = System.nanoTime() - l2;

         System.out.println("t0 = " + l1 + "\tt=" + l2 + "\tspeed=" + samples * 1e9f / (l2 - l1));
      }
   }


Measuring such loops does not mean a lot, because as being dead code, they are subject for elimination.
Compare runs with client and server vm.
 

Online Roquen
« Reply #25 - Posted 2012-04-27 10:04:04 »

The only situation where I'd use large tables for an approximation is if the target is interpreted.  Data-flow is too big an issue in modern architectures. sproingie's first test is the most reasonable so far and still it shows the "fast" version in too good a light under reasonable usage.  It succeeds in causing branch predictors to "stub their toes" and higher level caches needing to refill BUT lower level caches will very quickly have all the data.  The second test is a step backwards. Note that if you think too deeply about this stuff it will really make your head hurt because (for instance) compiled versions under light temporal access patterns the version that needs to load the least data (including code) will probably be the winner...and you also have to note that loading data to core A can have an impact on the performance of core B that might be wanting to access the memory architecture as well.

I forgot to re-seed the RNG between each test. 
If you need to reseed, then either the test is broken or random is.  The only reason to reseed is to keep people from complaining about the test not being "fair"....and a couple lines of code makes it worth not needing to explain some statistic.

Quote
So I decided to fix that, but rather than do the easy thing, I went with a big array of samples instead, so now it's 10 million iterations cycling over an array of 50,000 random floats.  The results were interesting: FastMath's relative performance more than doubles past its previous benchmark. 
So what your doing is causing the hardware version to be drastically slowed down by introducing memory stalls (that code chunk has virtually nothing to do to hide the stall).  The impact on the "fast" version will be less marked: out-of-order execution, hiding of stalls, etc.  This seems to me to be a highly unlikely situation: input from linear arrays of random data.  Even if the input did looks somewhat like this, it seems like there would be other work happening at the same time (again same points: out-of-order, hiding stalls, etc.)

Yes dead code is useless, but don't forget legal transforms as well.  So this bias computation could be off:
1  
2  
3  
4  
5  
6  
7  
        l1 = System.nanoTime();
         for (int i = 0; i < samples; i++)
         {
            x = (RND.nextFloat() - 0.5f) * 200;
            y = (RND.nextFloat() - 0.5f) * 200;
         }
         l1 = System.nanoTime() - l1;


can be legally transformed into:

1  
2  
3  
4  
5  
6  
        l1 = System.nanoTime();
         for (int i = 0; i < samples; i++) {
            RND.nextFloat();
            RND.nextFloat();
         }
         l1 = System.nanoTime() - l1;


WRT: StrictMath - this package is bordering on useless.  At least I can't think of a valid place to use it.
Offline Chromanoid

Junior Member


Medals: 3



« Reply #26 - Posted 2012-04-27 11:11:45 »

For microbenchmarking you might want to look at http://code.google.com/p/caliper/
Offline sproingie

JGO Kernel


Medals: 202



« Reply #27 - Posted 2012-04-27 17:42:03 »

The reason I would need to re-seed is to get the same random data so that the "sum" is meaningful beyond avoiding the loop being optimized out, i.e. to compare the accuracy of the two methods.  Otherwise, I'd hope that 10 million randoms are distributed well enough.

As for the second, transforming linear arrays of numeric data is the bread and butter of visual algorithms.  If you're computing thousands of normals, it's going to have pretty similar access patterns (different trig functions of course).  It's rarely going to be as tight as a microbenchmark, but no one ever called microbenchmarks the last word.

As for StrictMath, it's being used to initialize the table to give it predictable behavior across platforms, that's all.  Seems the most appropriate place to use it, where you're only paying the cost of using it once up front.

Offline Cero
« Reply #28 - Posted 2012-04-27 18:42:28 »

duuuuh, so is the new atan2 now faster or not ? =D

Online Roquen
« Reply #29 - Posted 2012-04-28 21:21:38 »

.... so that the "sum" is meaningful beyond avoiding the loop being optimized out, i.e. to compare the accuracy of the two methods. 
But the sum is useless at testing accuracy.  The only thing it can tell you if it's really your approximation is really broken and nothing else.

Quote
As for the second, transforming linear arrays of numeric data is the bread and butter of visual algorithms.  If you're computing thousands of normals, it's going to have pretty similar access patterns (different trig functions of course). 
In which case there is virtually always other exploitable information.

Quote
As for StrictMath, it's being used to initialize the table to give it predictable behavior across platforms, that's all. 
Bit exact results for floats is crazy and was never the intent.  Remember in this case that we're talking about singles, so the results well be bit exact regardless.
Pages: [1] 2
  ignore  |  Print  
 
 
You cannot reply to this message, because it is very, very old.

 

Add your game by posting it in the WIP section,
or publish it in Showcase.

The first screenshot will be displayed as a thumbnail.

CogWheelz (16 views)
2014-07-30 21:08:39

Riven (23 views)
2014-07-29 18:09:19

Riven (15 views)
2014-07-29 18:08:52

Dwinin (12 views)
2014-07-29 10:59:34

E.R. Fleming (33 views)
2014-07-29 03:07:13

E.R. Fleming (12 views)
2014-07-29 03:06:25

pw (43 views)
2014-07-24 01:59:36

Riven (42 views)
2014-07-23 21:16:32

Riven (30 views)
2014-07-23 21:07:15

Riven (31 views)
2014-07-23 20:56:16
List of Learning Resources
by SilverTiger
2014-07-31 13:54:12

HotSpot Options
by dleskov
2014-07-08 03:59:08

Java and Game Development Tutorials
by SwordsMiner
2014-06-14 00:58:24

Java and Game Development Tutorials
by SwordsMiner
2014-06-14 00:47:22

How do I start Java Game Development?
by ra4king
2014-05-17 11:13:37

HotSpot Options
by Roquen
2014-05-15 09:59:54

HotSpot Options
by Roquen
2014-05-06 15:03:10

Escape Analysis
by Roquen
2014-04-29 22:16:43
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!