Java-Gaming.org    
Featured games (79)
games approved by the League of Dukes
Games in Showcase (475)
Games in Android Showcase (106)
games submitted by our members
Games in WIP (530)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: 1 [2] 3
  ignore  |  Print  
  Arithmictic Performance  (Read 16066 times)
0 Members and 1 Guest are viewing this topic.
Offline Roquen
« Reply #30 - Posted 2009-08-08 11:02:56 »

FYI, some easy intros:

http://www.research.scea.com/gdc2003/fast-math-functions_p1.pdf
http://www.research.scea.com/gdc2003/fast-math-functions_p2.pdf


Note:
1) It's very rare for well designed approximation to not be much faster than a general library call.  These generalize calls are (almost always) designed to have well defined results for all inputs.  Also Java only supports common analytic functions for doubles.  So you could write a function mySin(float) such that mySin(x)==(float)Math.sin(x) for all values of 'x' which I would expect to be about 2x faster.  That can be improved by throwing out specific results for one or more of the followin: NaNs, denormals and outside of common or specifically required range.
2) The timing methods used here will tend to show table based methods in a more favoriable light than might be the case under real usage-patterns.
Offline DzzD
« Reply #31 - Posted 2009-08-08 13:17:49 »

You seem to have swapped the implementations Smiley

wow really sorry man, good you see it Smiley



Quote
Your benchmark was fundamentally flawed. It allowed the JVM to completely remove the Math.cos() call.

not mine I just copy/past the one posted in this thread

good catch as that was a big surprise to see that Math.cos was so fast ....and results seems now to be what we would all attempt them to be


EDIT :

just one point you are measuring the error between mine and java but Java does not give the exact result too ( as for 0.5 * PI where taylor is more accurate the original java cos ) but not that far yes... (anyway this cos taylor version use float so it should be pretty less accurate)

Offline DzzD
« Reply #32 - Posted 2009-08-08 14:10:09 »

about sqrt

as for me it is mainly used to normalize vector I use a lookup table of int where lookup[n]=2147483647/sqrt(n*k) //k is greater than 1 and allow to scale the range of input

it also exist a recurrent algorithm using dicotomic approch something like :

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
    static double approxSqrt(double n,double error)
    {
      double r=n/2;
      double r2=r*r;
      double d=r*0.5;
      int loop=0;
      do
      {
         System.out.println("("+loop+++") r="+r);
         if(r2>n)
             r-=d;
         else
            r+=d;
         r2=r*r;
         d*=0.5;
      }
      while(Math.abs(r2-n)>error);
      return r;
   }  


output for approxSqrt(100,0.1)

Quote
(0) r=50.0
(1) r=25.0
(2) r=12.5
(3) r=6.25
(4) r=9.375
(5) r=10.9375
(6) r=10.15625
(7) r=9.765625
(8 ) r=9.9609375
(9) r=10.05859375
(10) r=10.009765625
(11) r=9.9853515625
result=9.99755859375
result²=99.95117783546448

EDIT:

and the same sor int :

   
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
static int intSQRT(int input)
   {
      int r=input>>1;
      int r2=r*r;
      int d=r>>1;
      int loop=0;
      do
      {
         System.out.println("("+loop+++") r="+r + " r*r=" +(r*r));
         r+=(r2>input)?-d:d;
         r2=r*r;
         d>>=1;
      }
      while(r2!=input && d!=0);
      return r;
     
   }


output for sqrt(100);

Quote
(0) r=50 r*r=2500
(1) r=25 r*r=625
(2) r=13 r*r=169
(3) r=7 r*r=49
result int =10
result²=100

this is only the base idea, the first value of r should be optimised by using a lookup table or anyother way to make it smarter

Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline Roquen
« Reply #33 - Posted 2009-08-08 15:13:44 »

A better sin minimax example, the maximum error is 0x1.8p-23  at  x  ~1.466838  (again on +/- pi/2)

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
    x2 = x*x;
    r  = -2.39e-08f;
    r *= x2;
    r += 2.7526e-06f;
    r *= x2;
    r -= 1.98409e-04f;
    r *= x2;
    r += 8.3333315e-03f;
    r *= x2;
    r -= 1.666666664e-01f;
    r *= x2;
    r += 1.f;
    r *= x;
Offline DzzD
« Reply #34 - Posted 2009-08-08 15:19:09 »

A better sin minimax example, the maximum error is 0x1.8p-23  at  x  ~1.466838  (again on +/- pi/2)

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
    x2 = x*x;
    r  = -2.39e-08f;
    r *= x2;
    r += 2.7526e-06f;
    r *= x2;
    r -= 1.98409e-04f;
    r *= x2;
    r += 8.3333315e-03f;
    r *= x2;
    r -= 1.666666664e-01f;
    r *= x2;
    r += 1.f;
    r *= x;


ok but so the original java cos error is greater, no  ? also the version I usually use is based on double not float (wich for the computer I have tested run faster than float)

EDIT :

1  
2  
3  
4  
System.out.println("Math.cos(Math.PI*0.5))="+ Math.cos(Math.PI*0.5)  );

ouput this :
Math.cos(Math.PI*0.5))=6.123233995736766E-17


and should be 0.0

Offline Roquen
« Reply #35 - Posted 2009-08-08 16:14:51 »

The "error" you're seeing in an argument reduction issue, Math.PI/2 is not pi/2. (I agree your version make more sense).  Example:

double d = Math.PI*0.5;
System.out.printf("%s\n", Double.toHexString(Math.cos(d)));
System.out.printf("%s\n", Double.toHexString(Math.cos(Math.nextUp(d))));

Most hardware will execute floats faster than doubles (notable counter-example is intel-a-likes using x87 instead of SSE).

Note, I'm only attempting to compare truncated power-series vs. other polynomial approximation methods.  Truncated power series ignore finite precision and are centered on a single point and most NA methods take into account finite precision and target a range.  I'm ignoring stuff like argument reduction and dealing with any special values (NaN, +/-Infinity, denormals, -zero)

Stuck a sqrt(x) and 1/sqrt(x) here: http://www.java-gaming.org/topics/sqrt-x-and-1-sqrt-x/20997/view.html
Offline DzzD
« Reply #36 - Posted 2009-08-08 16:56:36 »

Most hardware will execute floats faster than doubles (notable counter-example is intel-a-likes using x87 instead of SSE).
sorry but I dont agree, IMHO double is probably faster or at least same speed (and for sure will become  faster)

you can check a nice discussion on gamedev about that (back in 2005) : http://www.gamedev.net/community/forums/topic.asp?topic_id=343479

according to this discussion for example double is native on an Xbox360




Offline Roquen
« Reply #37 - Posted 2009-08-08 17:24:02 »

sorry but I dont agree, IMHO double is probably or at least same speed (and for sure will become  faster)

Most of these comments are ill informed.  For example an SSE machine can perform 4 float ops faster than the 2 double ops that will fix in the same registers.  Example - dividing 4x4 floats has a throughput of 36 cycles and 2x2 doubles is 62 (for CPUID 69) - so 9 cycles/divide for float and 31 cycles/divide for doubles.  I know of no hardware in which doubles are faster than floats and don't expect to see it in my lifetime (as least for consumer hardware).
Offline DzzD
« Reply #38 - Posted 2009-08-08 19:00:36 »

I played with  Riven lookup version and try a little improvment on accuracy by adding linear interpolation on results, it is twice slower but a little bit more accurate

I juste replaced :
1  
2  
3  
4  
 public static final float cos(float rad)
   {
      return cos[(int) (rad * radToIndex) & SIN_MASK];
   }


by :

 
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
public static final float cos(float rad)
   {
     
        int idx=(int) (rad * radToIndex * 256f);
       
        float r=(idx&0xFF)*0.00390625f;
        idx>>=8;
        float c1=cos[idx & SIN_MASK];
        float c2=cos[idx+1 & SIN_MASK];
        return c1*(1f-r)+c2*r;
       
     
   }


and re-run the bench code

it give the following result :

before :
time between 290 to 300 us
Riven: -0.78881806, avg.error=6.071037300951922E-4, max.error=0.0022978310394030435

after adding linear interpolation :
time between 550 to 560 us
Riven: 0.9108251, avg.error=4.901244049417835E-4, max.error=7.731781248493941E-4


EDIT : maybe not really usefull Smiley but it make it act as a continuaous function rather than a sampled one

Offline Riven
« League of Dukes »

JGO Overlord


Medals: 742
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #39 - Posted 2009-08-08 19:15:53 »

Most of these comments are ill informed.  For example an SSE machine can perform 4 float ops faster than the 2 double ops that will fix in the same registers.  Example - dividing 4x4 floats has a throughput of 36 cycles and 2x2 doubles is 62 (for CPUID 69) - so 9 cycles/divide for float and 31 cycles/divide for doubles.  I know of no hardware in which doubles are faster than floats and don't expect to see it in my lifetime (as least for consumer hardware).


Right... forgetting the fact that Java has no SIMD? Besides that, the statement 'has a throughput of 36 cycles' makes no sense, at all.

Benchmark to show that float / double performance is nearly identical:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
47  
48  
49  
50  
51  
52  
53  
54  
55  
56  
57  
58  
59  
60  
61  
62  
63  
64  
65  
66  
67  
68  
69  
70  
71  
72  
73  
74  
75  
76  
77  
78  
79  
80  
81  
82  
83  
84  
85  
86  
87  
88  
89  
90  
91  
92  
93  
94  
95  
96  
97  
98  
99  
100  
101  
102  
103  
104  
 public static void main(String[] args)
   {
      int elems = 1024;
      int runs = 16;

      Random r = new Random();

      for (int k = 0; k < 8; k++)
      {
         float[] f1 = new float[elems];
         float[] f2 = new float[elems];
         float[] f4x4 = new float[16];
         for (int i = 0; i < 16; i++)
            f4x4[i] = r.nextFloat();
         long float_ms = benchFloat(f1, f2, f4x4, elems, runs);
         System.out.println("float performance: " + float_ms / 1000000L + "ms (midpoint)");

         double[] d1 = new double[elems];
         double[] d2 = new double[elems];
         double[] d4x4 = new double[16];
         for (int i = 0; i < 16; i++)
            d4x4[i] = r.nextDouble();
         long double_ms = benchDouble(d1, d2, d4x4, elems, runs);
         System.out.println("double performance: " + double_ms / 1000000L + "ms (midpoint)");
      }
   }

   public static long benchFloat(float[] f1, float[] f2, float[] mat, int elems, int runs)
   {
      long[] ts = new long[runs];
      for (int i = 0; i < ts.length; i++)
      {
         long a = System.nanoTime();
         for (float t1 = 0.0f; t1 < 1.0f; t1 += 0.01f)
         {
            for (float t2 = 0.0f; t2 < 1.0f; t2 += 0.02f)
            {
               fiddleFloat(t1, t2, elems, f1, f2, mat);
            }
         }
         long b = System.nanoTime();
         ts[i] = b - a;
      }
      Arrays.sort(ts);
      return ts[ts.length / 2];
   }

   public static long benchDouble(double[] d1, double[] d2, double[] mat, int elems, int runs)
   {
      long[] ts = new long[runs];
      for (int i = 0; i < ts.length; i++)
      {
         long a = System.nanoTime();
         for (double t1 = 0.0; t1 < 1.0; t1 += 0.01f)
         {
            for (double t2 = 0.0; t2 < 1.0; t2 += 0.02f)
            {
               fiddleDouble(t1, t2, elems, d1, d2, mat);
            }
         }
         long b = System.nanoTime();
         ts[i] = b - a;
      }
      Arrays.sort(ts);
      return ts[ts.length / 2];
   }

   public static float fiddleFloat(float t1, float t2, int elems, float[] op1, float[] op2, float[] m4x4)
   {
      float sum = 0.0f;
      for (int i = 0; i < elems; i++)
      {
         float f1 = op1[i];
         float f2 = op2[i];
         float diff1 = f2 - f1;
         float f3 = t1 * diff1 + f1;
         float diff2 = f3 - f2;

         sum += t2 * diff2 + f2;
         sum -= m4x4[0] * f1 + m4x4[1] * f2 + m4x4[2] * f3 + m4x4[3];
         sum += m4x4[4] * f1 + m4x4[5] * f2 + m4x4[6] * f3 + m4x4[7];
         sum -= m4x4[8] * f1 + m4x4[9] * f2 + m4x4[10] * f3 + m4x4[11];
      }
      return sum;
   }

   public static double fiddleDouble(double t1, double t2, int elems, double[] op1, double[] op2, double[] m4x4)
   {
      double sum = 0.0f;
      for (int i = 0; i < elems; i++)
      {
         double f1 = op1[i];
         double f2 = op2[i];
         double diff1 = f2 - f1;
         double f3 = t1 * diff1 + f1;
         double diff2 = f3 - f2;

         sum += t2 * diff2 + f2;
         sum -= m4x4[0] * f1 + m4x4[1] * f2 + m4x4[2] * f3 + m4x4[3];
         sum += m4x4[4] * f1 + m4x4[5] * f2 + m4x4[6] * f3 + m4x4[7];
         sum -= m4x4[8] * f1 + m4x4[9] * f2 + m4x4[10] * f3 + m4x4[11];
      }
      return sum;
   }


1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
float performance: 51ms (midpoint)
double performance: 52ms (midpoint)
float performance: 75ms (midpoint)
double performance: 52ms (midpoint)
float performance: 50ms (midpoint)
double performance: 51ms (midpoint)
float performance: 50ms (midpoint)
double performance: 52ms (midpoint)
float performance: 50ms (midpoint)
double performance: 50ms (midpoint)
float performance: 50ms (midpoint)
double performance: 51ms (midpoint)
float performance: 50ms (midpoint)
double performance: 52ms (midpoint)
float performance: 50ms (midpoint)
double performance: 72ms (midpoint)

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline Roquen
« Reply #40 - Posted 2009-08-09 13:53:50 »

There's a wide misconception that doubles are somehow 'better' than floats and this is simply not the case.  They simply have more precision and you pay for that in terms of memory usage and (usually) execution time (some ops will have same execution time for both).  It's true that for some usages the extra precision is required, but this is likewise true about doubles.  I find this somewhat strange because you don't find people thinking that 64-bit integers are somehow 'better' than lower bit width integers and they will happly use the one which in most appropriate for the given usage.

Quote from: Riven
Right... forgetting the fact that Java has no SIMD?

Not at all, they are simply not exposed at the high level.  It's the compiler's job to use these instructions (runtime in the case of a VM).  I'd expect vectorized support to improve as runtimes like Mono have been working on this front.  It'll probably require a new compiler framework as Sun's seems to be showing it's age (I haven't paid attention to what the next gen is up to.)  Both vmkit and Shark/Zero are based on LLVM, so they have "promise" in this respect. (Although LLVM needs work on auto-vectorization as well.) If we had a VM that performed a reasonable amount of vectorization, there would be a huge speed difference between floats and doubles on SIMD hardware.  But ignoring SIMDifing, scalar floating point operations are generally faster than double (and never slower) for all CPU architectures that I'm familiar with.  Couple this with the ever increasing speed gap between main memory and the CPU, moving data becomes a more and more of an issue.

Quote from: Riven
Besides that, the statement 'has a throughput of 36 cycles' makes no sense, at all.

Not sure what doesn't make sense. These are the measure of time required for the executional unit to complete the computation.

Quote from: Riven
Benchmark to show that float / double performance is nearly identical:

Your example appears to be memory bound. As a simple counter-example: the speed difference between float and double multiple is much narrower than divide, but if you take my last minimax sin example and simply change from float to double you're likely to see a speed difference on 1.2-2.0x depending on your hardware (and assuming Sun VM).

The question of performance of float vs. doubles is really a hardware question (and the indicated thread was about hardware) and my orginal statement was to that effect.  My follow-up was in reply to "IMHO double is probably faster or at least same speed (and for sure will become faster)" which I claim to be incorrect.  Is it possible to have two otherwise identical routines where a double version will be faster then the float? Yes, but that will be in the case where stalls of the double version are being more hidden by some other stall (probably memory read or write) that float version.  This will be a rare occurance and tightly coupled to an exact hardware configuration. This is more of a desktop issue for CPU with many functional units.

I'd be happy to be proven wrong with a counter-example or pointer to CPU specification! Wink
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 742
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #41 - Posted 2009-08-09 14:07:55 »

@SIMD
Not at all, they are simply not exposed at the high level.
FYI, the Sun JVM doesn't have any SIMD code. Maybe in System.arraycopy() but that's it.


Not sure what doesn't make sense.
Because 'throughput' cannot be expressed in 'cycles'. Throughput is the inverse of the amount of cycles it takes to perform some operation.

Your example appears to be memory bound.
Appears. Yeah. Vague handwaving is a perfectly solid foundation of an argument. To make all data fit in cache, reduce the 'elems' variable and test again...  Roll Eyes I hope I'm not asking too much.

Further, more than 75% of the operations are on [local variables + a 4x4 matrix], if that is going to be memory bound...



Running with 256 'elems' (2x float[256] = 2K RAM and 2x double[256] = 4K RAM)
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
float performance: 12ms (midpoint)
double performance: 13ms (midpoint)
float performance: 12ms (midpoint)
double performance: 13ms (midpoint)
float performance: 19ms (midpoint)
double performance: 19ms (midpoint)
float performance: 12ms (midpoint)
double performance: 12ms (midpoint)
float performance: 12ms (midpoint)
double performance: 12ms (midpoint)
float performance: 13ms (midpoint)
double performance: 19ms (midpoint)
float performance: 19ms (midpoint)
double performance: 19ms (midpoint)
float performance: 19ms (midpoint)
double performance: 19ms (midpoint)

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline DzzD
« Reply #42 - Posted 2009-08-09 14:18:05 »

Quote
I'd be happy to be proven wrong with a counter-example or pointer to CPU specification!


I will try to when I get some time :p

the transition from float to double is just the same story as the old transition (previous to FPU born) of cpu word from 16bit to 32bit . and the present/near futur is transition from 32 to 64 bit.

15/20 years ago accessing a 16bit word was faster and todays it is becomed slower as all the architecture have now reached and have been thinked for 32bit

Offline Roquen
« Reply #43 - Posted 2009-08-09 18:16:36 »

@Riven: Showing code that runs faster in float vs. double is easy.

On: Early P4, Last P4, Core Duo (2 core), Core Duo (4 core) - Sun JRE 1.6.0_13-b03:

  sin core is between 1.5-1.8 times faster in float.
  ortho polynomial is between 1.2-1.5 time faster.

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
47  
48  
49  
50  
51  
52  
53  
54  
55  
56  
57  
58  
59  
60  
61  
62  
63  
64  
65  
66  
67  
68  
69  
70  
71  
72  
73  
74  
75  
76  
77  
78  
79  
80  
81  
82  
83  
84  
85  
86  
87  
88  
89  
90  
91  
92  
93  
94  
95  
96  
97  
98  
99  
100  
101  
102  
103  
104  
105  
106  
107  
108  
109  
110  
111  
112  
113  
114  
115  
116  
117  
118  
119  
120  
121  
122  
123  
124  
125  
126  
127  
128  
129  
130  
131  
132  
133  
134  
135  
136  
137  
138  
139  
140  
141  
142  
143  
144  
145  
146  
147  
public class Foo
{
  public static float approxF(float x)
  {
    float x2, r;
   
    x2 = x*x;
    r  = -2.39e-08f;
    r *= x2;
    r += 2.7526e-06f;
    r *= x2;
    r -= 1.98409e-04f;
    r *= x2;
    r += 8.3333315e-03f;
    r *= x2;
    r -= 1.666666664e-01f;
    r *= x2;
    r += 1.f;
    r *= x;

    return r;
  }

  public static double approxD(double x)
  {
    double x2, r;
   
    x2 = x*x;
    r  = -2.39e-08f;
    r *= x2;
    r += 2.7526e-06f;
    r *= x2;
    r -= 1.98409e-04f;
    r *= x2;
    r += 8.3333315e-03f;
    r *= x2;
    r -= 1.666666664e-01f;
    r *= x2;
    r += 1.f;
    r *= x;

    return r;
  }

  public static double evalD(int n, double x)
  {
    double r = 1;
    double a = 1;
    double b = x;
    int    i;
   
    if (n >= 2) {
      for (i = 2; i <= n; i++) {
        r = ((2 * i - 1) * x * b - (i - 1) * a) / i;
        a = b;
        b = r;
      }
      return r;
    }
   
    if (n == 0)
      return a;
   
    return b;
  }

  public static float evalF(int n, float x)
  {
    float r = 1;
    float a = 1;
    float b = x;
    int   i;
   
    if (n >= 2) {
      for (i = 2; i <= n; i++) {
        r = ((2 * i - 1) * x * b - (i - 1) * a) / i;
        a = b;
        b = r;
      }
      return r;
    }
   
    if (n == 0)
      return a;
   
    return b;
  }

  public static void main(String[] args)
  {
    double x = 1.5707;
    int    i;

    try {
      if (args.length != 0)
   x = Double.parseDouble(args[0]);
    }
    catch(Exception e) {}

    float  f = approxF((float)x);
    double d = approxD(x);
    int    e = 100000;

    long t0 = System.nanoTime();

    for(i=0; i<e; i++)
      d = approxD(d);

    long t1 = System.nanoTime();
   
    for(i=0; i<e; i++)
      f = approxF(f);

    long t2 = System.nanoTime();

    t0 = t1-t0;
    t1 = t2-t1;
   
    System.out.printf("double = %d (x %f)\n", t0, (1.f*t0)/t1);
    System.out.printf("float  = %d\n", t1);

    e = 20000;
    d = evalD(10, d);
    f = evalF(10, f);

    t0 = System.nanoTime();

    for(i=10; i<e; i++)
      d += evalD(i, .332);
   
    t1 = System.nanoTime();
   
    for(i=10; i<e; i++)
      f += evalF(i, .332f);
   
    t2 = System.nanoTime();
   
    t0 = t1-t0;
    t1 = t2-t1;

    System.out.printf("double = %d (x %f)\n", t0, 1.0*t0/t1);
    System.out.printf("float  = %d\n", t1);

    System.out.printf("%f\n", f);
    System.out.printf("%f\n", d);
  }
}


@DzzD: I started programming in the 8-bit days, but that isn't my problem. I use doubles and multi-precision elements all the time...when needed. 

There are three important trends that form my opinion about doubles being unlikely to outperform floats on consumer hardware in my lifetime. The first two have already been mentioned: SIMD and speed gap with main memory.  The other is die size reductions.  Making the channels narrower requires increased energy consumption (and therefore heat).  Thus it is becoming more and more important to shut down subsystems which are not in usage to reduce heat (and battery drain in the case of notebooks).  Computation of doubles requires wider data paths and more stages (to slightly vuglarize).

In real life I do a fair amount of low level and optimization work so I read a fair number of CPU design specs and semi-keep up with hardware design research and am seeing nothing to contradict this opinon.  However I'll admit that I never expected to see anything like the new decimal formats from IEEE 754-2008 either.
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 742
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #44 - Posted 2009-08-09 18:40:24 »

@Riven: Showing code that runs faster in float vs. double is easy.

On: Early P4, Last P4, Core Duo (2 core), Core Duo (4 core) - Sun JRE 1.6.0_13-b03:

  sin core is between 1.5-1.8 times faster in float.
  ortho polynomial is between 1.2-1.5 time faster.

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
  public static double approxD(double x)
  {
    double x2, r;
   
    x2 = x*x;
    r  = -2.39e-08f;
    r *= x2;
    r += 2.7526e-06f;
    r *= x2;
    r -= 1.98409e-04f;
    r *= x2;
    r += 8.3333315e-03f;
    r *= x2;
    r -= 1.666666664e-01f;
    r *= x2;
    r += 1.f;
    r *= x;

    return r;
  }



Do you notice you are mixing doubles and floats here??

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline Roquen
« Reply #45 - Posted 2009-08-09 18:58:41 »

Opps. I'd like to say that was on purpose, but it wasn't.  Doesn't change the timing though.
Offline pjt33
« Reply #46 - Posted 2009-08-09 20:55:07 »

However I'll admit that I never expected to see anything like the new decimal formats from IEEE 745-2008 either.
Should that read IEEE 754-2008?
Offline Roquen
« Reply #47 - Posted 2009-08-09 21:52:27 »

Yeap, that should be 754.  Probably more commonly known by it's working title of 754r.
Offline Nate

JGO Kernel


Medals: 145
Projects: 4
Exp: 14 years


Esoteric Software


« Reply #48 - Posted 2009-08-09 22:51:48 »

For sqrt, I tried Float.floatToRawIntBits, but it is slow on Android. I also played with DzzD's code, but as he indicated, it needs some smarts to be fast enough. Here is Riven's benchmark code, modified to show DzzD's sqrt algorithm:

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
47  
48  
49  
50  
51  
52  
53  
54  
55  
56  
57  
58  
59  
60  
61  
62  
   public static void testSqr (boolean log) {
      int count = 50000;

      long s, e;

      float[] numbers = new float[count];
      for (int i = 0; i < count; i++)
         numbers[i] = (float)(Math.random() * 65536) + 1;
// for (int i = 0; i < count / 2; i++)
// numbers[i] = (float)(Math.random() * -65536);

      // ensure the JVM doesn't optimize those silly calls away!!
     double[] java_dst = new double[count];
      float[] dzzd_dst = new float[count];
      float[] roquen_dst = new float[count];

      s = System.nanoTime();
      for (int i = 0; i < count; i++)
         java_dst[i] = Math.sqrt(numbers[i]);
      e = System.nanoTime();
      if (log) System.out.println("FloatMath.sqrt:  " + (e - s) / 1000 + "us");

      s = System.nanoTime();
      for (int i = 0; i < count; i++)
         roquen_dst[i] = sqrtDzzD(numbers[i]);
      e = System.nanoTime();
      if (log) System.out.println("sqrtDzzD: " + (e - s) / 1000 + "us");

      if (log) {
         double dzzdAvgErr = 0.0;
         double dzzdMaxErr = 0.0;
         for (int i = 0; i < count; i++) {
            double dzzdErr = Math.abs(Math.sqrt(numbers[i]) - sqrtDzzD(numbers[i]));

            dzzdAvgErr += dzzdErr;
            if (dzzdErr > dzzdMaxErr) dzzdMaxErr = dzzdErr;
         }
         dzzdAvgErr /= count;

         System.out.println("Input: " + numbers[3]);
         System.out.println("DzzD: " + sqrtDzzD(numbers[3]) + ", avg.error=" + dzzdAvgErr + ", max.error=" + dzzdMaxErr);
         System.out.println("FloatMath: " + Math.sqrt(numbers[3]));
         System.out.println("~~~prevent opt. ~~~" + dzzd_dst[13] + "~~~" + java_dst[13] + "~~~" + roquen_dst[13]);
         System.out.println();
      }
   }

   static public float sqrtDzzD (float n) {
      if (n == 0) return 0;
      float r = n * 0.5f;
      float r2 = r * r;
      float d = r * 0.5f;
      float error;
      do {
         r += r2 > n ? -d : d;
         r2 = r * r;
         d *= 0.5f;
         if (d == 0) return r;
         error = r2 - n;
      } while ((error < 0 ? -error : error) > 1);
      return r;
   }

Offline Roquen
« Reply #49 - Posted 2009-08-10 09:05:10 »

On sqrt I have a couple of questions and maybe I can come up with a different initial guess for a N-R based method.

1) How many N-R steps can you perform before they equal the speed of the default sqrt? and 1/sqrt?
2) If they exist are Math.getExponent and Math.scaleb can be used instead of bit inspection, but I'd expect them to be slow as well (worth checking)
3) Speed of conversion: fp to int and int to fp.
4) Speed of either lead or trailing zero count of 32 bit ints.
Offline pjt33
« Reply #50 - Posted 2009-08-10 09:41:53 »

Opps. I'd like to say that was on purpose, but it wasn't.  Doesn't change the timing though.
Doesn't change the bytecode - javac will create double entries in the constant pool even though the constants themselves were floats, because they're extended before they're used.

Trailing zero count is pretty fast. Code is adapted from http://graphics.stanford.edu/~seander/bithacks.html
1  
2  
3  
4  
5  
6  
7  
8  
private static final int[] trailLut = new int[] {
    0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
    31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
  };

public static int countTrailingBits(int x) {
  return trailLut[((x & -x) * 0x77cb531) >>> 27];
}


Finding the integer log base 2 of an integer is also covered in the page linked, although it's a bit slower.
Offline Roquen
« Reply #51 - Posted 2009-08-10 09:57:14 »

I finally got 'unlazy' and did a quick web search on Android devices.  All that I saw are based on the ARM-11.  So lead zero counting is a hardware instruction and should be fast (unless someone forgot to hook it in).

I didn't get any idea about float support.  Not being able to find tech specs is a big pevee of mine.  The ARM-11 does not have an FPU, but ARM provides various FPUs as coprocessors.

I'll try to throw together a zero counting base guess version. (on the log-2, I guess you see where I'm going with the zero counting).
Offline pjt33
« Reply #52 - Posted 2009-08-10 12:27:13 »

I finally got 'unlazy' and did a quick web search on Android devices.  All that I saw are based on the ARM-11.  So lead zero counting is a hardware instruction and should be fast (unless someone forgot to hook it in).

I didn't get any idea about float support.  Not being able to find tech specs is a big pevee of mine.  The ARM-11 does not have an FPU, but ARM provides various FPUs as coprocessors.

I'll try to throw together a zero counting base guess version. (on the log-2, I guess you see where I'm going with the zero counting).
When you say "forgot to hook it in" - to what? There isn't a corresponding bytecode, so are you referring to a hypothetical android.util.IntMath? I don't see why it should be there - the spec surely isn't designed around a particular CPU?

Re log 2 - yes, it was pretty obvious Smiley  I implemented a fixed-point sqrt once, but I can't remember what I did about the initial guess; just that it involved a lookup table.
Offline Roquen
« Reply #53 - Posted 2009-08-10 12:54:55 »

When you say "forgot to hook it in" - to what? There isn't a corresponding bytecode, so are you referring to a hypothetical android.util.IntMath? I don't see why it should be there - the spec surely isn't designed around a particular CPU?

That Integer.numberOfLeadingZeros is replaced at link time by a native method, rather than executing the bytecode.  I haven't looked into the guts of Dalvik, but I'd expect it to do this.
Offline Hansdampf

Senior Member


Projects: 3


too offending?


« Reply #54 - Posted 2009-08-10 13:04:59 »

for those who need a faster integer sqrt:
http://atoms.alife.co.uk/sqrt/SquareRoot.java
I did not test it under Android. With jre1.6 the performance factor is 2.4 (5.3 with -server)

lots of sillystupid games: http://www.emaggame.com
Offline pjt33
« Reply #55 - Posted 2009-08-10 13:32:05 »

That Integer.numberOfLeadingZeros is replaced at link time by a native method, rather than executing the bytecode.  I haven't looked into the guts of Dalvik, but I'd expect it to do this.
Didn't realise it was in Integer. Ooh. I'll have to look and see what else they added in 1.5.
Offline Roquen
« Reply #56 - Posted 2009-08-12 15:59:29 »


This is a quick hack to test a workaround for slow floatToRawIntBits on Android.  As I don't have hardware I can't do timing tests myself.  If someone with hardware is willing to test please try timing "getExponent" against "timeMe".

Has the following methods: getExponent, getSignificand, scalb and a first pass isqrt.  These were thrown together so probably have bugs and isqrt is a quick pass of making a reference version (included) not completely dog-slow.


1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
47  
48  
49  
50  
51  
52  
53  
54  
55  
56  
57  
58  
59  
60  
61  
62  
63  
64  
65  
66  
67  
68  
69  
70  
71  
72  
73  
74  
75  
76  
77  
78  
79  
80  
81  
82  
83  
84  
85  
86  
87  
88  
89  
90  
91  
92  
93  
94  
95  
96  
97  
98  
99  
100  
101  
102  
103  
104  
105  
106  
107  
108  
109  
110  
111  
112  
113  
114  
115  
116  
117  
118  
119  
120  
121  
122  
123  
124  
125  
126  
127  
128  
129  
130  
131  
132  
133  
134  
135  
136  
137  
138  
139  
140  
141  
142  
143  
144  
145  
146  
147  
148  
149  
150  
151  
152  
153  
154  
155  
156  
157  
158  
159  
160  
161  
162  
163  
164  
165  
166  
167  
168  
169  
170  
171  
172  
173  
174  
175  
176  
177  
178  
179  
180  
181  
182  
183  
184  
185  
186  
187  
188  
189  
190  
191  
192  
193  
194  
195  
196  
197  
198  
199  
200  
201  
202  
203  
204  
205  
206  
207  
208  
209  
210  
211  
212  
213  
214  
215  
216  
217  
218  
219  
220  
221  
222  
223  
224  
225  
226  
227  
228  
229  
230  
231  
232  
233  
234  
235  
236  
237  
238  
239  
240  
241  
242  
243  
  /**
   * Float value x = f 2<sup>k</sup>, where 1 ≤ f < 2 and k is an integer.
   * <p>
   * Returns k.
   * <p>
   * Differs from JDK version in that the exponents of denormals
   * are returned as if representable.
   * @see #getSignificand(float)
   */


  public static int getExponent(float x)
  {
    int bs = 31;

    if (x < 0)
      x = -x;
   
    if (x < Float.POSITIVE_INFINITY) {
      if (x >= 1) {
        while (x > 0x1p30) {
          x  *= 0x1p-30;
          bs += 30;
        }
      }
      else {
        if (x == 0) return 0;

        do {
          x  *= 0x1p30;
          bs -= 30;
        } while(x < 1);
      }
     
      return bs-Integer.numberOfLeadingZeros((int)x);
    }

    // x is +/-infinity or NaN
   return 128;
  }
 


  /**
   * Float value x = f 2<sup>k</sup>, where 1 ≤ f < 2 and k is an integer.
   * <p>
   * Returns abs(f).
   * @see #getExponent(float)
   */

  public static float getSignificand(float x)
  {    
    if (x < 0) x = -x;
   
    // infinity and NaN
   
    if (x >= 1) {
      if (x >= 0x1p30) {
        if (x == Float.POSITIVE_INFINITY) return -1;
        do {
          x *= 0x1p-30;
        } while (x > 0x1p30);
      }
    }
    else {
      if (x == 0) return 0;

      do {
        x *= 0x1p30;
      } while(x < 1);
    }
   
    if (x < 2) return x;
   
    // scale to [1,2)
   int bs = Integer.numberOfLeadingZeros((int)x);
    x  *= (1<<bs) * 0x1p-31f;
   
    return x;
  }

  /**
   * Returns: x * 2<sup>n</sup>
   */

  public static float scalb(float x, int n)
  {
    if (n >= 0) {
      while(n > 30) {
        x *= 0x1p30f;
        n -= 30;
      }
      return x * (1 << n);
    }
   
    while(n < -30) {
      x *= 0x1p-30f;
      n += 30;
    }
   
    return x * (1<<(31+n)) * 0x1p-31f;
  }

  // constants for 1st order inverse sqrt guess
 private static final float a0 =  0x1.439cfep+0f;
  private static final float a1 = -0x1.253f20p-2f;

  // constants for 2nd order inverse sqrt guess
 private static final float b0 =  0x1.946338p+0f;
  private static final float b1 = -0x1.7605f6p-1f;
  private static final float b2 =  0x1.2e76d0p-3f;
 
  /** sqrt(2)/2 */
  private static final float sqrt2o2 = 0x1.6a09e6p-1f;
 
  // reference version: shows what's happening in isqrt.
 public static float isqrtRef(float x)
  {
    // get same info as Float.floatToRaw bits
   // (but in fp instead of integer)
   float man = getSignificand(x);
    int   exp = getExponent(x);
   
    float r;

    // make an initial guess:  1 <= man < 2
   // two examples:
 //r =  a0 + a1 * man;
   r =  b0 + man*(b1 + man*b2);
   
    // add in the 1/sqrt(2^(k/2)) term
   if (exp != 0) {
      r = scalb(r,-(exp>>1));
      if ((exp & 1) != 0) r *= sqrt2o2;
    }
   
    float hx  = x * 0.5f;

    // perform some number of N-R steps
   r = r * (1.5f - hx * r * r);
    r = r * (1.5f - hx * r * r);
    r = r * (1.5f - hx * r * r);
   
    return r;
  }
 
  // temp hack for unoptimized isqrt
 private static float scalb_temp_hack(float x, int n)
  {
    if (n >= 0)
      return x * (1 << n);
 
    return x * (1<<(31+n)) * 0x1p-31f;
  }
 
  public static float isqrt(float y)
  {
    float x  = y;
    int   bs = 31;

    // scale if needed to bring into integer range
   if (x >= 1) {
      while (x > 0x1p30) {
        x  *= 0x1p-30;
        bs += 30;
      }
    }
    else {
      if (x == 0) return 0;

      do {
        x  *= 0x1p30;
        bs -= 30;
      } while(x < 1);
    }

    // isolate exponent and significand
   int   lzc = Integer.numberOfLeadingZeros((int)x);
    int   exp = bs-lzc;
    float man = x;
    float hx = 0.5f*y;
   
    if (x >= 2)
      man  *= (1<<lzc) * 0x1p-31f;
   
    // make an initial guess
    float r =  a0 + a1 * man;
   //float r =  b0 + man*(b1 + man*b2);

    // add in the 1/sqrt(2^(k/2)) term
   if (exp != 0) {
      r = scalb_temp_hack(r, -(exp >> 1));
      if ((exp & 1) != 0) r *= sqrt2o2;
    }

    // perform some number of N-R steps
   r = r * (1.5f - hx * r * r);
    //r = r * (1.5f - hx * r * r);
   //r = r * (1.5f - hx * r * r);
   
    return r;
  }
   
  @SuppressWarnings("boxing")
  public static float timeSqrt()
  {
    float x,r0=0;
    x = 65536;

    long t0,t1;

    // test some large values
   x  = 0x1.133321p122f;
    t0 = System.nanoTime();
   
    do {
    //r0 += (1.f / (float) Math.sqrt(x));
     r0 += isqrt(x);
      x   = Math.nextUp(x);
    } while (x <= 0x1.133321p122f * 2);

    t1 = System.nanoTime();
   
    System.out.printf("time (big)   = %f (%f)\n", (t1-t0)*(1/1000000.0), r0);
   
    // test some small values
   x  = 0x1.133321p-122f;
    t0 = System.nanoTime();
   
    do {
    //r0 += (1.f / (float) Math.sqrt(x));
     r0 += isqrt(x);
      x   = Math.nextUp(x);
    } while (x <= 0x1.133321p-122f * 2);

    t1 = System.nanoTime();
   
    System.out.printf("time (small) = %f (%f)\n", (t1-t0)*(1/1000000.0), r0);
   
    return r0;
  }

  public static float timeMe(float f)
  {
    return Float.intBitsToFloat(Float.floatToRawIntBits(f) + 1);
  }
Offline Nate

JGO Kernel


Medals: 145
Projects: 4
Exp: 14 years


Esoteric Software


« Reply #57 - Posted 2009-08-15 06:46:47 »

It took me a bit to figure out the 1.6 Math.nextUp call was causing Android to throw VerifyError!

Here are the milliseconds I got on the G1 with 10000 iterations (passing the i to the method):
Quote
timeMe: 40.924072
getExponent: 60.88257

timeMe: 40.46631
getExponent: 57.250977

timeMe: 42.297363
getExponent: 58.135983

Maybe FloatMath.sqrt is the fastest we are going to get? I don't have any proof it is a bottleneck.

Quote
FloatMath.sqrt: 49.102783
isqrt: 134.52148

FloatMath.sqrt: 45.98999
isqrt: 139.4043

What do you guys think of these methods? I forgot where I scrounged them up.

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
   /**
    * Fixed point multiply.
    */

   static public int multiply (int x, int y) {
      long z = (long)x * (long)y;
      return ((int)(z >> 16));
   }

   /**
    * Fixed point divide.
    */

   static public int divide (int x, int y) {
      return (int)(((((long)x) << 32) / y) >> 16);
   }


Offline pjt33
« Reply #58 - Posted 2009-08-15 12:41:32 »

1  
2  
3  
4  
5  
6  
   /**
    * Fixed point divide.
    */

   static public int divide (int x, int y) {
      return (int)(((((long)x) << 32) / y) >> 16);
   }

Not sure why you wouldn't just do:
1  
2  
3  
4  
5  
6  
   /**
    * Fixed point divide.
    */

   static public int divide (int x, int y) {
      return (int)((((long)x) << 16) / y);
   }


Not that I tend to favour 16.16 fix-point anyway. 24.8 is usually good enough and most of the time it allows you to fit intermediate results in 32 bits.
Offline Nate

JGO Kernel


Medals: 145
Projects: 4
Exp: 14 years


Esoteric Software


« Reply #59 - Posted 2009-08-15 21:39:41 »

Thanks pjt33. Your method seems to work just as well and is simpler. Smiley

In general, I haven't found using fixed point to be any faster than just using floats, even though the G1 doesn't have an FPU. I only use fixed point with OpenGL ES, which is 16.16. I found any real number crunching is going to have to be native code. I didn't try using fixed point in native code as it doesn't seem to be a bottleneck.

Pages: 1 [2] 3
  ignore  |  Print  
 
 
You cannot reply to this message, because it is very, very old.

 

Add your game by posting it in the WIP section,
or publish it in Showcase.

The first screenshot will be displayed as a thumbnail.

Riven (4 views)
2014-07-23 21:16:32

Riven (5 views)
2014-07-23 21:07:15

Riven (5 views)
2014-07-23 20:56:16

ctomni231 (40 views)
2014-07-18 06:55:21

Zero Volt (36 views)
2014-07-17 23:47:54

danieldean (30 views)
2014-07-17 23:41:23

MustardPeter (32 views)
2014-07-16 23:30:00

Cero (47 views)
2014-07-16 00:42:17

Riven (48 views)
2014-07-14 18:02:53

OpenGLShaders (38 views)
2014-07-14 16:23:47
HotSpot Options
by dleskov
2014-07-08 03:59:08

Java and Game Development Tutorials
by SwordsMiner
2014-06-14 00:58:24

Java and Game Development Tutorials
by SwordsMiner
2014-06-14 00:47:22

How do I start Java Game Development?
by ra4king
2014-05-17 11:13:37

HotSpot Options
by Roquen
2014-05-15 09:59:54

HotSpot Options
by Roquen
2014-05-06 15:03:10

Escape Analysis
by Roquen
2014-04-29 22:16:43

Experimental Toys
by Roquen
2014-04-28 13:24:22
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!