Java-Gaming.org    
Featured games (81)
games approved by the League of Dukes
Games in Showcase (481)
Games in Android Showcase (110)
games submitted by our members
Games in WIP (548)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
   Home   Help   Search   Login   Register   
  Show Posts
Pages: [1]
1  Game Development / Shared Code / Re: 13.8x faster atan2 (updated) on: 2012-05-07 09:59:36
Fast/slow.  It's all relative.  To put some numbers on it in terms of instruction latency/throughput (both in cycles) for modern(ish) Intel.  Ranges are to cover a reasonable set of hardware.

SQRTSD: 20-39/14-39  (one complaint double sqrt)  On unspecified input ranges these can drop to ~6/~10.
SQRTSS: 23-32/23-32  (one compliant single sqrt)   On unspecified input ranges these can drop to ~6/~10.
RSQRTSS: 3-6/1-4       (approximate 1/sqrt for singles. We can't do this...big fat sadface)

So what I intended to say about sqrt being "hard" is that unlike lots of other functions reducing the range or precision to get a faster version is much more difficult.  About the only way you can succeed is if you know that it's very near a specific value (like when renormalizing).  That old SGI trick (popularized by ID) should always be slower that simply using the hardware.

After some googling I found out these are MMX instructions, and Java can't use those afaik. Normal math happens using FPU instructions. I copied this from "Intel Architecture Optimization Reference Manual":

"Additionally, while the multi-cycle floating-point instructions, fdiv and fsqrt, execute in the floating-point unit pipe, integer instructions can be executed in parallel."

So it pays off to do lots of non-fp-math in inner-loops that use Math.sqrt and /. Separating (or keeping) those in different inner-loops would be bad.

And from "How to optimize for the Pentium family of microprocessors" I read that FSQRT uses 70 cycles and cannot overlap integer multiplication instructions (they will block the pipeline). No idea about modern processors however. They should be much more efficient in this, especially the new Core Ix and Celeron Sandy Bridge CPUs.
2  Game Development / Shared Code / Re: 13.8x faster atan2 (updated) on: 2012-05-07 02:56:25
@Zom-B the original poster said he used a "lookup-table." Rainbow tables are lookup tables for hash codes.
Rainbow tables are clever structures that eliminate the need for memory-eating lookup tables (think in the range of several terabytes)

Haha that's a good question, I don't see why I would ever use acos and asin in a performance critical application Grin
How about fractals / fractal art?  Wink (actually you'd need the highest possible precision here so the statement is void)

Fast/slow.  It's all relative.  To put some numbers on it in terms of instruction latency/throughput (both in cycles) for modern(ish) Intel.  Ranges are to cover a reasonable set of hardware.

SQRTSD: 20-39/14-39  (one complaint double sqrt)  On unspecified input ranges these can drop to ~6/~10.
SQRTSS: 23-32/23-32  (one compliant single sqrt)   On unspecified input ranges these can drop to ~6/~10.
RSQRTSS: 3-6/1-4       (approximate 1/sqrt for singles. We can't do this...big fat sadface)

So what I intended to say about sqrt being "hard" is that unlike lots of other functions reducing the range or precision to get a faster version is much more difficult.  About the only way you can succeed is if you know that it's very near a specific value (like when renormalizing).  That old SGI trick (popularized by ID) should always be slower that simply using the hardware.
I sometimes use "float d = 1f / (float)Math.sqrt(dx * dx + dy * dy);" (more often double actually), so hopefully the compiler uses RSQRTSS there.
3  Game Development / Shared Code / Re: 13.8x faster atan2 (updated) on: 2012-05-02 08:29:26
Who got what idea? I don't see anything in this thread even remotely related to rainbow tables.



A side side note. At least on my implementation, Sun/Oracle Java's for Windows, in rt.jar (the part of the JVM that is in pure Java) all Math.xxx methods delegate to StrictMath.xxx. So directly calling StrictMath.xxx saves one call in performance.
StrictMath is almost never called from Math, the JVM inlines it with its own faster native implementation.

EDIT: Hehe looks like Roquen's link already says that ^^ Smiley

I didn't know that. It isn't flagged as native so I never expected the JVM would generate exceptions for those.

I did some quick tests, and indeed the trig functions are in the order of 80 times faster when calling Math.  Cranky

On the other hand, some functions like min/max are 35% faster when calling StrictMath (I haven't tested others yet).

Edit: A more advanced test (using my second test type, and preventing code from being optimized away) shows now that StrictMath is about 10% faster than Math.  Huh

Edit2: (StrictMath is ...)
sin,cos: 50% slower
tan: 10% faster
atan: about the same
sqrt: 9000% slower
cbrt: about the same
floor: about 10% slower
ceil: about the same
max,min: about the same

And I think we're drifting too much off topic. shouldn't someone make a new topic for this?
4  Game Development / Shared Code / Re: 13.8x faster atan2 (updated) on: 2012-05-01 05:16:36
A side side note. At least on my implementation, Sun/Oracle Java's for Windows, in rt.jar (the part of the JVM that is in pure Java) all Math.xxx methods delegate to StrictMath.xxx. So directly calling StrictMath.xxx saves one call in performance.
5  Game Development / Shared Code / Re: 13.8x faster atan2 (updated) on: 2012-04-27 04:23:51
In the past I have found a very elegant and reliable way to do microbenchmarking while compensating for the effects of overhead.

Calculate the time it takes one loop without the function being tested, and one with.

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
   public static void main(String[] args)
      int samples = 1000000;
      long l1;
      long l2;
      int n = 0;

      for (int tries = 0; tries < 10; tries++)
      {
         l1 = System.nanoTime();
         for (int i = 0; i < samples; i++)
         {}
         l1 = System.nanoTime() - l1;

         l2 = System.nanoTime();
         for (int i = 0; i < samples; i++)
         {
            n++;
         }
         l2 = System.nanoTime() - l2;

         System.out.println("t0 = " + l1 + "\tt=" + l2 + "\tspeed=" + samples * 1e9f / (l2 - l1));
      }
   }

Result:
t0 = 1251556   t=779708   speed=-2.11932659E9
t0 = 548115   t=838654   speed=3.44187878E9
t0 = 546159   t=752889   speed=4.837227E9
t0 = 531632   t=784457   speed=3.95530496E9
t0 = 523251   t=883911   speed=2.77269453E9
t0 = 546159   t=807924   speed=3.82022042E9
t0 = 543924   t=804013   speed=3.84483763E9
t0 = 545879   t=803454   speed=3.88236442E9
t0 = 543365   t=804013   speed=3.83659187E9
t0 = 546159   t=804012   speed=3.87817856E9


My 2.5GHz Intel Core (duo) can do about 3.88 billion times n++ in a single thread.

When removing n++ (making both loops the same), the result is:
t0 = 1259099   t=602311   speed=-1.52256128E9
t0 = 544762   t=556216   speed=8.730574E10
t0 = 551467   t=540851   speed=-9.4197441E10
t0 = 666007   t=552864   speed=-8.8383724E9
t0 = 545041   t=553982   speed=1.11844311E11
t0 = 545600   t=556496   speed=9.1776795E10
t0 = 546159   t=603708   speed=1.73764956E10
t0 = 561245   t=540571   speed=-4.8369934E10
t0 = 553701   t=542527   speed=-8.9493463E10
t0 = 551467   t=607619   speed=1.78088038E10


which are ridiculously high numbers (30 times higher) meaning this overhead compensation is fairly reliable relative to the measuring noise. (increasing samples increases accuracy)

Now on to atan2.

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
   public static void main(String[] args)
   {
      Random RND = new Random();

      int samples = 1000000;
      long l1;
      long l2;
      long l3;
      float x;
      float y;

      for (int tries = 0; tries < 10; tries++)
      {
         l1 = System.nanoTime();
         for (int i = 0; i < samples; i++)
         {
            x = (RND.nextFloat() - 0.5f) * 200;
            y = (RND.nextFloat() - 0.5f) * 200;
         }
         l1 = System.nanoTime() - l1;

         l2 = System.nanoTime();
         for (int i = 0; i < samples; i++)
         {
            x = (RND.nextFloat() - 0.5f) * 200;
            y = (RND.nextFloat() - 0.5f) * 200;
            FastMath.atan2(x, y);
         }
         l2 = System.nanoTime() - l2;

         l3 = System.nanoTime();
         for (int i = 0; i < samples; i++)
         {
            x = (RND.nextFloat() - 0.5f) * 200;
            y = (RND.nextFloat() - 0.5f) * 200;
            StrictMath.atan2(x, y);
         }
         l3 = System.nanoTime() - l3;

         float fast = samples * 1e9f / (l2 - l1);
         float slow = samples * 1e9f / (l3 - l1);
         System.out.println("speed fast=" + fast + "\tspeed slow=" + slow + "\tfactor=" + fast / slow);
      }
   }

result:
speed fast=2.5838188E7   speed slow=3102130.2   factor=8.329176
speed fast=4.2304908E7   speed slow=6922386.0   factor=6.111319
speed fast=3.6428208E7   speed slow=6051220.5   factor=6.019977
speed fast=3.3007322E7   speed slow=5712438.0   factor=5.7781496
speed fast=3.6745696E7   speed slow=6234034.0   factor=5.8943686
speed fast=4.6704136E7   speed slow=6152503.5   factor=7.5910783
speed fast=3.9707864E7   speed slow=6488173.5   factor=6.1200376
speed fast=5.0430332E7   speed slow=6682681.5   factor=7.5464215
speed fast=4.770246E7   speed slow=6730517.0   factor=7.087488
speed fast=4.857506E7   speed slow=6979997.5   factor=6.9591804


I've also invented another, even more accurate, but more limited, microbenchmarking technique. Call the function once in the first loop and 10 times in the other, then solve for the run time of one execution. This is not applicable for atan2 because you can't call it multiple times in a row and expect the performance to be flat (and also calling random in between makes the technique useless).
6  Game Development / Shared Code / Re: 13.8x faster atan2 (updated) on: 2012-04-26 18:01:28
I use JDK 1.7 with Eclipse Indigo on Core2(duo) 2.5GHz and 32-bit windows XP. I don't use the -server flag.

I don't see how my version could possibly be slower, unless you copied the early double variant which I soon replaced with the float variant.

My algorithm uses 3 comparisons, 1 multiplication, 1 division, 1 addition, 1 table look-up and 0~1 negations.

Riven's version uses 3 comparisons, 4 multiplications, 1 division, 1 addition, 1 table look-up, 0~2 negations, 5~7 assignments and 2 casts.
7  Game Development / Shared Code / Re: 13.8x faster atan2 (updated) on: 2012-04-26 12:52:19
I just benchmarked my real application and it went from 20fps to 70fps. It's not only atan2, the 4th operation in the following sequence is atan2. Other operations include loading an image (memory intensive).

processing times in milliseconds.

before:
ms=1.298769   1.79073   1.291225   31.39561   0.686121   1.968686   0.729702      1.721866

after:
ms=1.283403   1.842692   1.211607   4.929677   0.801498   2.235479   0.755683      1.747324

From 31ms to 5ms is a 6x improvement.

My raw code+benchmark give this:

FastMath: 42ms, sum=-2688.061
JavaMath: 562ms, sum=-2690.9863
factor: 13.380953

FastMath: 40ms, sum=-2688.061
JavaMath: 519ms, sum=-2690.9863
factor: 12.975


In contrast, on my machine, Riven's original algorithm+benchmark was 8.5x faster out-of-the-box, as opposed to the claimed 13.8x
8  Game Development / Shared Code / Re: 13.8x faster atan2 (updated) on: 2012-04-26 11:12:57
I improved it. Now it uses O(N) memory instead of O(N2) and because it eliminates unnecessary assignments, multiplications, additions and 2-dimensional addressing, it's a lot faster. I increased the table size for more precision.

The theory:


I added the option to not only return results from -Pi to Pi, but any range you want.

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
47  
48  
49  
50  
51  
52  
53  
54  
55  
56  
57  
58  
59  
60  
61  
62  
63  
64  
65  
66  
67  
68  
69  
70  
71  
72  
73  
74  
75  
76  
77  
78  
79  
80  
81  
82  
83  
84  
85  
86  
87  
88  
89  
90  
91  
92  
93  
94  
95  
96  
97  
98  
99  
100  
101  
102  
103  
104  
105  
public class FastMath
{
    public static void main(String[] args)
    {
        float min = -100;
        float max = +100;
        float step = 0.12f;

        for (int i = 0; i < 8; i++)
        {
            long t0A = System.nanoTime() / 1000000L;
            float sumA = 0.0f;
            for (float y = min; y < max; y += step)
                for (float x = min; x < max; x += step)
                    sumA += atan2(y, x);
            long t1A = System.nanoTime() / 1000000L;

            long t0B = System.nanoTime() / 1000000L;
            float sumB = 0.0f;
            for (float y = min; y < max; y += step)
                for (float x = min; x < max; x += step)
                    sumB += Math.atan2(y, x);
            long t1B = System.nanoTime() / 1000000L;

            System.out.println();
            System.out.println("FastMath: " + (t1A - t0A) + "ms, sum=" + sumA);
            System.out.println("JavaMath: " + (t1B - t0B) + "ms, sum=" + sumB);
            System.out.println("factor: " + (float)(t1B - t0B) / (t1A - t0A));
        }
    }

    private static final int           SIZE                 = 1024;
    private static final float        STRETCH            = Math.PI;
    // Output will swing from -STRETCH to STRETCH (default: Math.PI)
   // Useful to change to 1 if you would normally do "atan2(y, x) / Math.PI"

    // Inverse of SIZE
   private static final int        EZIS            = -SIZE;
    private static final float[]    ATAN2_TABLE_PPY    = new float[SIZE + 1];
    private static final float[]    ATAN2_TABLE_PPX    = new float[SIZE + 1];
    private static final float[]    ATAN2_TABLE_PNY    = new float[SIZE + 1];
    private static final float[]    ATAN2_TABLE_PNX    = new float[SIZE + 1];
    private static final float[]    ATAN2_TABLE_NPY    = new float[SIZE + 1];
    private static final float[]    ATAN2_TABLE_NPX    = new float[SIZE + 1];
    private static final float[]    ATAN2_TABLE_NNY    = new float[SIZE + 1];
    private static final float[]    ATAN2_TABLE_NNX    = new float[SIZE + 1];

    static
    {
        for (int i = 0; i <= SIZE; i++)
        {
            float f = (float)i / SIZE;
            ATAN2_TABLE_PPY[i] = (float)(StrictMath.atan(f) * STRETCH / StrictMath.PI);
            ATAN2_TABLE_PPX[i] = STRETCH * 0.5f - ATAN2_TABLE_PPY[i];
            ATAN2_TABLE_PNY[i] = -ATAN2_TABLE_PPY[i];
            ATAN2_TABLE_PNX[i] = ATAN2_TABLE_PPY[i] - STRETCH * 0.5f;
            ATAN2_TABLE_NPY[i] = STRETCH - ATAN2_TABLE_PPY[i];
            ATAN2_TABLE_NPX[i] = ATAN2_TABLE_PPY[i] + STRETCH * 0.5f;
            ATAN2_TABLE_NNY[i] = ATAN2_TABLE_PPY[i] - STRETCH;
            ATAN2_TABLE_NNX[i] = -STRETCH * 0.5f - ATAN2_TABLE_PPY[i];
        }
    }

    /**
     * ATAN2
     */


    public static final float atan2(float y, float x)
    {
        if (x >= 0)
        {
            if (y >= 0)
            {
                if (x >= y)
                    return ATAN2_TABLE_PPY[(int)(SIZE * y / x + 0.5)];
                else
                    return ATAN2_TABLE_PPX[(int)(SIZE * x / y + 0.5)];
            }
            else
            {
                if (x >= -y)
                    return ATAN2_TABLE_PNY[(int)(EZIS * y / x + 0.5)];
                else
                    return ATAN2_TABLE_PNX[(int)(EZIS * x / y + 0.5)];
            }
        }
        else
        {
            if (y >= 0)
            {
                if (-x >= y)
                    return ATAN2_TABLE_NPY[(int)(EZIS * y / x + 0.5)];
                else
                    return ATAN2_TABLE_NPX[(int)(EZIS * x / y + 0.5)];
            }
            else
            {
                if (x <= y) // (-x >= -y)
                   return ATAN2_TABLE_NNY[(int)(SIZE * y / x + 0.5)];
                else
                    return ATAN2_TABLE_NNX[(int)(SIZE * x / y + 0.5)];
            }
        }
    }
}
Pages: [1]
 

Add your game by posting it in the WIP section,
or publish it in Showcase.

The first screenshot will be displayed as a thumbnail.

atombrot (27 views)
2014-08-19 09:29:53

Tekkerue (25 views)
2014-08-16 06:45:27

Tekkerue (23 views)
2014-08-16 06:22:17

Tekkerue (15 views)
2014-08-16 06:20:21

Tekkerue (22 views)
2014-08-16 06:12:11

Rayexar (61 views)
2014-08-11 02:49:23

BurntPizza (39 views)
2014-08-09 21:09:32

BurntPizza (31 views)
2014-08-08 02:01:56

Norakomi (37 views)
2014-08-06 19:49:38

BurntPizza (67 views)
2014-08-03 02:57:17
List of Learning Resources
by Longor1996
2014-08-16 10:40:00

List of Learning Resources
by SilverTiger
2014-08-05 19:33:27

Resources for WIP games
by CogWheelz
2014-08-01 16:20:17

Resources for WIP games
by CogWheelz
2014-08-01 16:19:50

List of Learning Resources
by SilverTiger
2014-07-31 16:29:50

List of Learning Resources
by SilverTiger
2014-07-31 16:26:06

List of Learning Resources
by SilverTiger
2014-07-31 11:54:12

HotSpot Options
by dleskov
2014-07-08 01:59:08
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!