Raghar
Junior Member  
Ue ni taete 'ru hitomi ni kono mi wa dou utsuru
|
 |
«
Posted
2005-02-28 21:53:02 » |
|
I tried to implement a frame scaling in a float, just to see how it would perform. It was funny I disabled it after few tenths of seconds and tried it with double. It was still 2x slower than packed signed bytes in INT, but it was considerably faster than float. I looked at some of my old benchmarks, and wow float arithmetic is slower than double. It doesn't matter for me, doubles ar much more precisse than floats, and are more useful for calculations, but I remmember on something from XITH3D manual. "We are using floats, because they are faster..."
So I wonder does have the float performance some reasons, or it's just leftover in 5.0?
|
|
|
|
|
digitprop
|
 |
«
Reply #1 - Posted
2005-03-01 07:16:21 » |
|
Most CPUs these days support double operation in silico, but not float operations. Therefore, for float operations floats will be converted to doubles internally, then reconverted to floats which takes some time (not much, though).
The only factor in favour of floats these days is the lower memory consumption for applications with large floating point arrays.
Also, with a platform-independent system like Java, I wouldn't bet on any implementation-specific performance paradigms such as 'always use floats' (or 'always use doubles', for that matter, or the 'avoid object creation at all costs' which was popular some while ago). The various VM implementations are always good for bizarre surprises performance-wise, and things that work splendidly on one platform fail abysmally on another.
|
|
|
|
Markus_Persson
|
 |
«
Reply #2 - Posted
2005-03-01 07:41:24 » |
|
Most CPUs these days support double operation in silico, but not float operations.  .. really? Doubles are 64 bits.
|
|
|
|
Games published by our own members! Check 'em out!
|
|
princec
|
 |
«
Reply #3 - Posted
2005-03-01 09:51:04 » |
|
The main reason for using floats is bus bandwidth when you're blasting things down to a graphics card. For calculations, doubles may well be faster on current VM/CPU architectures. Dunno about the Mac mind you/ Cas 
|
|
|
|
Markus_Persson
|
 |
«
Reply #4 - Posted
2005-03-01 10:25:44 » |
|
Interesting!
Seems like my last remaining preemptive optimisation ("Of course I'll speed up my app by using floats!") has been shot down as well.
|
|
|
|
Raghar
Junior Member  
Ue ni taete 'ru hitomi ni kono mi wa dou utsuru
|
 |
«
Reply #5 - Posted
2005-03-01 20:54:11 » |
|
Actually Intel's CPUs are using 80 bits. They have floating point registers aliased with MMX registers, and they are using revolver like structure. (This doesn't mean you could use assembly and wow you would have full 80 bits of precission, these lowest bits are just for avoiding bad rounding errors. ) So conversion from and to floats shouldn't be as bad as 120/40, or is it?
This remminds me I should look somewhere on FP numbers in the NVIDIA FX5700. It has some internal support for them, but I forgot its precision and maximum range. Are they in 10 - 6 format, or are they in a some variation on FP32 format.
|
|
|
|
|
Vorax
Senior Member    Projects: 1
System shutting down in 5..4..3...
|
 |
«
Reply #6 - Posted
2005-03-01 22:16:24 » |
|
Average taken of 100 loops of test run of 1,000,000 calculations of:
result = value * result result = value / result
Tested with two values (2 runs): 0.00001 and .234323 of the type being tested. No difference in time results from different values. Times are in milleseconds on an AMD2600.
Java 1.4.2 ------------- ** Float Test ** Average Time: 12 # of divisions: 1000000
Average Time: 6 # of multiplications: 1000000
** Double Test ** Average Time: 13 # of divisions: 1000000
Average Time: 6 # of multiplications: 1000000
Java 1.5 ----------- ** Float Test ** Average Time: 11 # of divisions: 1000000
Average Time: 6 # of multiplications: 1000000
** Double Test ** Average Time: 13 # of divisions: 1000000
Average Time: 6 # of multiplications: 1000000
The results seem clear to me... Even if double's were faster (which at least in Java doesn't seem to be the case. Division was slower and mutiplication was same), floats would be better for 3D graphics anyways because it is half the amount of data to transfer on the bus.
In C/C++ the preference of floats over doubles comes strictly from the bus issues of transfering around doubles. Alot of the point comes from nVidia who is a strong proponent for floats over doubles (or even half-floats over floats). The performance point seems to be even more true in Java.
|
|
|
|
K.I.L.E.R
Senior Member   
Java games rock!
|
 |
«
Reply #7 - Posted
2005-03-01 23:06:55 » |
|
Sorry but I don't believe your results. Not because you are intentionally trying to give bad results, but because micro benchmarks can't be trusted when you give out little information about them.
I'm going to try my own results later on to see if I can get faster float performance vs double performance on my A64 3K+.
|
Vorax: Is there a name for a "redneck" programmer? Jeff: Unemployed. 
|
|
|
Vorax
Senior Member    Projects: 1
System shutting down in 5..4..3...
|
 |
«
Reply #8 - Posted
2005-03-01 23:25:18 » |
|
Sorry but I don't believe your results. Not because you are intentionally trying to give bad results, but because micro benchmarks can't be trusted when you give out little information about them.
I'm going to try my own results later on to see if I can get faster float performance vs double performance on my A64 3K+. It's not a [edit]bad[/edit] microbenchmark. I did 100 million calculations in each loop before determining the final times using another run of 100 million calcs and ran the test several times and varied the values to calculate and had identical results...but run your own.
|
|
|
|
phazer
Junior Member  
Come get some
|
 |
«
Reply #9 - Posted
2005-03-02 11:32:44 » |
|
I'm seeing some weird floating point benchmark results on my Athlon 1.4 GHz using Java 5 -server. I'm benchmarking a very simple loop using floats, doubles and fixed point. The fixed point loop is as fast as similar C code compiled with GCC or Visual Studio, but the floating point math is running at half the speed of similar C code. That is the same speed as when "Strict" floating point is turned on in Visual Studio. Here's the loop I'm benchmarking: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
| private static final float x = 0.7456f; private static final float y = 0.97543f; private static final int count = 100000; private static float f1;
void runTest() { float a = x; float b = y;
for (int i=0; i<count; i++) { b = a * b + a; }
f1 = b; } |
The same code using doubles is even slower. Running the benchmark in JRockit produces the same results as compiled C code = about twice as fast as Hotspot server. Why can't Hotspot optimize the Java code as well as a C compiler?
|
|
|
|
Games published by our own members! Check 'em out!
|
|
Markus_Persson
|
 |
«
Reply #10 - Posted
2005-03-02 11:40:26 » |
|
100k iterations is very few. The overhead of compliling that into native code (if it even bothers doing so.. enable some profiling) will take up a large percentage of the total time spent in that loop.
point being: It quite possibly already is compiling that loop into native code as fast as a c compiler, but your microbenchmark is broken.
|
|
|
|
K.I.L.E.R
Senior Member   
Java games rock!
|
 |
«
Reply #11 - Posted
2005-03-02 11:59:33 » |
|
Doubles have higher performance. Replace the variables used in the calculation with 'float' to bench the float performance. Float performance = 22 seconds. Double performance = 13 seconds. I ran the bench a few times for consistency. Athlon64 3000+ @ 2109.34MHz, 1GB RAM PC 3600, Nforce 3 mobo. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
| private static final double x = 0.7456f; private static final double y = 0.97543f; private static final long count = 1000000000; private static double f1; public static void main( String[] args ) { Main main = new Main(); double fps, start; start = System.nanoTime(); main.runTest(); fps = (System.nanoTime() - start) / 1e9; System.out.println(fps); }
void runTest() { double a = x; double b = y;
for ( int i = 0; i < count; i++ ) { b = (a * b + a) / 2; }
f1 = b; } |
|
Vorax: Is there a name for a "redneck" programmer? Jeff: Unemployed. 
|
|
|
EgonOlsen
|
 |
«
Reply #12 - Posted
2005-03-02 12:43:18 » |
|
Doubles have higher performance. Replace the variables used in the calculation with 'float' to bench the float performance.
Float performance = 22 seconds. Double performance = 13 seconds.
I ran the bench a few times for consistency. Try to start your app with -Xcompile and see what happens to your results... You are obviously measuring a difference in hotspot's handling of floats vs. doubles here, not a real performance plus of doubles.
|
|
|
|
K.I.L.E.R
Senior Member   
Java games rock!
|
 |
«
Reply #13 - Posted
2005-03-02 13:05:40 » |
|
With XCompile:
Float: 12.944601059 seconds. Double: 13.299814082 seconds.
|
Vorax: Is there a name for a "redneck" programmer? Jeff: Unemployed. 
|
|
|
phazer
Junior Member  
Come get some
|
 |
«
Reply #14 - Posted
2005-03-02 13:26:19 » |
|
100k iterations is very few. The overhead of compliling that into native code (if it even bothers doing so.. enable some profiling) will take up a large percentage of the total time spent in that loop.
point being: It quite possibly already is compiling that loop into native code as fast as a c compiler, but your microbenchmark is broken. It's not broken, on the contrary I'm quite certain that my benchmark is correct. The code I posted is the loop I'm benchmarking, it's not the entire benchmark application. I do 10 s warmup and 10 s benchmarking. My tests seem to indicate that there is a flaw in the Hotspot optimizer.
|
|
|
|
Markus_Persson
|
 |
«
Reply #15 - Posted
2005-03-02 13:39:11 » |
|
s/your microbenchmark is broken/the code you posted is broken/ 
|
|
|
|
erikd
|
 |
«
Reply #16 - Posted
2005-03-02 13:41:11 » |
|
When I run a simple micro benchmark on a P4, double performance is consistently slightly slower after warm-up.
It could be that this is an Athlon-only issue though (I've seen cases before where some numeric operations run slower than on a P4 because of some P4 specific shortcuts which are impossible on an Athlon), but I have the feeling the results are misleading somehow. I have to test at home (where I have an Athlon too).
Could you post the entire benchmark?
|
|
|
|
erikd
|
 |
«
Reply #17 - Posted
2005-03-02 13:46:19 » |
|
Why can't Hotspot optimize the Java code as well as a C compiler? It can. As a matter of fact I once converted a little C/ASM fractal program to Java where the java version ran as fast as the ASM version of the program and even faster than the C compiled one. It surprised me almost as much as the author of the original program who wanted to show how much faster C is compared to java. 
|
|
|
|
phazer
Junior Member  
Come get some
|
 |
«
Reply #18 - Posted
2005-03-02 13:57:31 » |
|
Ok, I've put together the entire benchmark into a single class (see below). On Java 5 -server I get: Float: 776 iterations / s Double: 642 iterations / s Fixed point: 1793 iterations / s 2.9308171, 2.8822784, 2.9308173801885786 With JRockit I get: Float: 1572 iterations / s Double: 1575 iterations / s Fixed point: 1800 iterations / s 2.9308174, 2.8822784, 2.9308173801885786 Note the 2x speed improvement in the float and double benchmarks. I get the same scores for a similar C test compiled with GCC or Visual Studio. It's a VERY simple loop to optimize (fmul, fadd, jump in the assembler output from GCC), so it's really surprising that Hotspot can't optimize it properly. You're more than welcome to try to tweak the code to make it run fast in Hotspot. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
| public class MathTest { private static final float x = 0.7456f; private static final float y = 0.97543f; private static final int count = 100000; private static float f1; private static float f2; private static double f3;
private static void run(String text, Runnable runnable) { long time = System.currentTimeMillis();
while (System.currentTimeMillis() - time < 10000) { runnable.run(); }
time = System.currentTimeMillis(); long count = 0;
while (System.currentTimeMillis() - time < 10000) { runnable.run(); count++; }
System.out.println(text + ": " + count * 1000L / (System.currentTimeMillis() - time) + " iterations / s"); }
public static void main(String[] args) { run("Float", new Runnable() { public void run() { float a = x; float b = y;
for (int i=0; i<count; i++) { b = a * b + a; }
f1 = b; } }); run("Double", new Runnable() { public void run() { double a = x; double b = y;
for (int i=0; i<count; i++) { b = a * b + a; }
f3 = b; } }); run("Fixed point", new Runnable() { public void run() { int a = (int) (x * 65536.0f); int b = (int) (y * 65536.0f);
for (int i=0; i<count; i++) { b = (a >> 8) * (b >> 8) + a; }
f2 = (float) b / 65536.0f; } });
System.out.println(f1 + ", " + f2 + ", " + f3); } } |
|
|
|
|
erikd
|
 |
«
Reply #19 - Posted
2005-03-02 16:24:09 » |
|
Your benchmark gave these results on my Athlon 2200 on the server VM: Float: 2230 iterations / s Double: 941 iterations / s Fixed point: 2553 iterations / s 2.9308171, 2.8822784, 2.9308173801885786 A slightly modified version of your benchmark gave these results (after some warm-up): Float : 6690 iterations / s Double : 3975 iterations / s Fixed : 27322 iterations / s 2.9308171, 2.8822784, 2.9308173801885786 So whatever it's causing this difference, I don't know, but the only thing you can conclude from your benchmark is that JRockit optimizes it better (for whatever that's worth). The modified (uglified  ) version: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86
| public class MathTest2 { Â Â Â private float x = 0.7456f; Â Â Â private float y = 0.97543f; Â Â Â private int count = 100000; Â Â Â private float f1; Â Â Â private float f2; Â Â Â private double f3;
   public void runFloat() {       long time = System.currentTimeMillis();       int count = 0;       while (System.currentTimeMillis() - time < 10000) {          float a = x;          float b = y;
         for (int i = 0; i < count; i++) {             b = a * b + a;          }
         f1 = b;          count++;       }       System.out.println(          "Float : " + count * 1000L / (System.currentTimeMillis() - time) + " iterations / s");    }
   public void runDouble() {       long time = System.currentTimeMillis();       int count = 0;
      while (System.currentTimeMillis() - time < 10000) {          double a = x;          double b = y;
         for (int i = 0; i < count; i++) {             b = a * b + a;          }
         f3 = b;          count++;       }       System.out.println(          "Double : " + count * 1000L / (System.currentTimeMillis() - time) + " iterations / s");    }
   public void runFixed() {       long time = System.currentTimeMillis();
      while (System.currentTimeMillis() - time < 10000) {          int a = (int) (x * 65536.0f);          int b = (int) (y * 65536.0f);
         for (int i = 0; i < count; i++) {             b = (a >> 8) * (b >> 8) + a;          }
         f2 = (float) b / 65536.0f;          count++;       }       System.out.println(          "Fixed : " + count * 1000L / (System.currentTimeMillis() - time) + " iterations / s");    }
   public static void main(String[] args) {       MathTest2 m = new MathTest2();             for (int i = 0; i < 10; i++) {          m.runFloat();          m.runDouble();          m.runFixed();             System.out.println(m.f1 + ", " + m.f2 + ", " + m.f3);                   m.f1=m.f2=0;          m.f3=0;       }             System.out.println("----------------");             m.runFloat();       m.runDouble();       m.runFixed();
      System.out.println(m.f1 + ", " + m.f2 + ", " + m.f3);    } } |
|
|
|
|
Azeem Jiva
Junior Member  
Java VM Engineer, Sun Microsystems
|
 |
«
Reply #20 - Posted
2005-03-02 16:55:05 » |
|
When I run a simple micro benchmark on a P4, double performance is consistently slightly slower after warm-up.
It could be that this is an Athlon-only issue though (I've seen cases before where some numeric operations run slower than on a P4 because of some P4 specific shortcuts which are impossible on an Athlon), but I have the feeling the results are misleading somehow. I have to test at home (where I have an Athlon too).
Could you post the entire benchmark? The reason double performance on Athlons (and P3s for that matter) is worse than P4s, is because we use SSE2 style registers (which are not available for Athlon/P3). Those double registers greatly speed up performance.
|
|
|
|
|
blahblahblahh
|
 |
«
Reply #21 - Posted
2005-03-02 17:31:25 » |
|
I do 10 s warmup and 10 s benchmarking.
10 seconds? I've seen server VM take 5 minutes to really get going...
|
malloc will be first against the wall when the revolution comes...
|
|
|
NVaidya
Junior Member  
Java games rock!
|
 |
«
Reply #22 - Posted
2005-03-02 17:46:45 » |
|
The reason double performance on Athlons (and P3s for that matter) is worse than P4s, is because we use SSE2 style registers (which are not available for Athlon/P3). Those double registers greatly speed up performance.
Need a quick clarification on this if you don't mind... The Athlon64s do have SSE2 support don't they ? My 2.0GHz socket 939 Winchester 3200+ appears to have it for sure. I've been benching a P4 1.6GHz Williamette against the Winchester 3200+ and the P4 doesn't seem to be doing that bad comparitively in particle tracking systems involving lots of double-based number crunching. Thanks
|
Gravity Sucks !
|
|
|
princec
|
 |
«
Reply #23 - Posted
2005-03-02 18:40:20 » |
|
For microbenchmarking I advise you tweak -XX:CompileThreshold=500 or so. Hotspot doesn't need to collect very much information to optimise this kind of stuff. Cas 
|
|
|
|
Azeem Jiva
Junior Member  
Java VM Engineer, Sun Microsystems
|
 |
«
Reply #24 - Posted
2005-03-02 19:25:05 » |
|
Need a quick clarification on this if you don't mind... The Athlon64s do have SSE2 support don't they ? My 2.0GHz socket 939 Winchester 3200+ appears to have it for sure. I've been benching a P4 1.6GHz Williamette against the Winchester 3200+ and the P4 doesn't seem to be doing that bad comparitively in particle tracking systems involving lots of double-based number crunching. Thanks
Yes the Athlon64s have SSE2, but I'd need more info to say why your not seeing a large performance difference. If I was to guess, I'd say the generated code might be slightly different, and that might be causing the performance anomaly.
|
|
|
|
|
NVaidya
Junior Member  
Java games rock!
|
 |
«
Reply #25 - Posted
2005-03-02 20:09:10 » |
|
OK ! so with the Athlon64s - with SSE2 support - can I take it that the JVM will (indeed) use SSE2-style registers so that the double performance of the Athlon64s will be comparable to P4s ?
In general, is the SSE2 performance of Athlon64 as good as the P4 ? And specifically, when running Java apps with the -server option ?
And for the couple of reasons mentioned earlier 1) some CPUs use extended 80-bit precision and 2) some do all the computations in doubles and reconvert to floats (the IBM RS6000 workstation, IIRC, used to do that), is it worth the trouble to stick to floats for speed benefits if memory size is not a consideration ?
|
Gravity Sucks !
|
|
|
Vorax
Senior Member    Projects: 1
System shutting down in 5..4..3...
|
 |
«
Reply #26 - Posted
2005-03-02 22:11:39 » |
|
With XCompile:
Float: 12.944601059 seconds. Double: 13.299814082 seconds. Believe me now 
|
|
|
|
K.I.L.E.R
Senior Member   
Java games rock!
|
 |
«
Reply #27 - Posted
2005-03-03 06:54:53 » |
|
Believe me now  No difference in using doubles or floats. Since people don't use apps with -XCompile, using doubles is faster.
|
Vorax: Is there a name for a "redneck" programmer? Jeff: Unemployed. 
|
|
|
phazer
Junior Member  
Come get some
|
 |
«
Reply #28 - Posted
2005-03-03 07:05:14 » |
|
Your benchmark gave these results on my Athlon 2200 on the server VM: Float: 2230 iterations / s Double: 941 iterations / s Fixed point: 2553 iterations / s 2.9308171, 2.8822784, 2.9308173801885786
A slightly modified version of your benchmark gave these results (after some warm-up): Float : 6690 iterations / s Double : 3975 iterations / s Fixed : 27322 iterations / s 2.9308171, 2.8822784, 2.9308173801885786
So whatever it's causing this difference, I don't know, but the only thing you can conclude from your benchmark is that JRockit optimizes it better (for whatever that's worth).
I was a bit confused why your code gave such a big speed improvement (10x faster fixed point), then I noticed that you use the 'count' variable both for iteration count and inner loop count. This is a bug, right? Fixing the code I get the following results with Java 5 -server: Float : 417 iterations / s Double : 596 iterations / s Fixed : 2253 iterations / s 2.9308171, 2.8822784, 2.9308173801885786 JRockit: Float : 1697 iterations / s Double : 1699 iterations / s Fixed : 1940 iterations / s 2.9308174, 2.8822784, 2.9308173801885786 Now JRockit is 4 times faster than Hotspot on the float benchmark!  Remember that this is a simple fmul, fadd loop. Something weird is happening here...
|
|
|
|
Vorax
Senior Member    Projects: 1
System shutting down in 5..4..3...
|
 |
«
Reply #29 - Posted
2005-03-03 10:18:36 » |
|
No difference in using doubles or floats. Since people don't use apps with -XCompile, using doubles is faster.
I thought -Xcompile just makes the VM do it's optimizations on the first pass so you don't need to wait for a wind up for the JIT? If so, then it just means your microbenchmark isn't running optimized like a real world app would be. If not, what does it do?
|
|
|
|
|