altair
Senior Newbie 
|
 |
«
Posted
2003-05-24 02:15:07 » |
|
Hi guys,
I was trying to make speed improvements in some code and got unexpected results. First look at some code:
int[] iv = new int[1000]; iv[0] = 8; iv[1] = 9;
long before = System.currentTimeMillis();
for (int ii=0; ii<100000; ii++) { for (int i=2; i<iv.length; i++) { iv = (5 * iv[i-1] - 10 * iv[i-2]) / 3; iv *= iv; } }
long after = System.currentTimeMillis(); System.out.println(after-before);
before = System.currentTimeMillis();
for (int ii=0; ii<100000; ii++) { int previous = iv[1];
for (int i=2; i<iv.length; i++) { int v = (5 * previous - 10 * iv[i-2]) / 3; v *= v; iv = v; previous = v; } }
after = System.currentTimeMillis(); System.out.println(after-before);
float[] fv = new float[1000]; fv[0] = 8.0f; fv[1] = 9.0f;
before = System.currentTimeMillis();
for (int ii=0; ii<100000; ii++) { for (int i=2; i<fv.length; i++) { fv = (5.0f * fv[i-1] - 10.0f * fv[i-2]) * 0.333333333f; fv *= fv; } }
after = System.currentTimeMillis(); System.out.println(after-before);
before = System.currentTimeMillis();
for (int ii=0; ii<100000; ii++) { float previous = fv[1];
for (int i=2; i<fv.length; i++) { float f = (5.0f * previous - 10.0f * fv[i-2]) * 0.333333333f; f *= f; fv = f; previous = f; } }
after = System.currentTimeMillis(); System.out.println(after-before);
OK, basically it is some math computation inside loops. I ran these tests several times in the same process and got the following results:
int : 5203 int aliased : 4750 float : 4172 float aliased : 5031
The absolute values by themselves are not really important but they reveal a pattern.
First conclusion: the floats are faster that the ints on my platform (WinXP, Athlon XP, Java 1.4.2beta) !!! Incredible isn't it ?
Naively, I thought otherwise. I guess the SSE2 instructions now used by the JVM kick in to boost dramatically the float performance. It would be interesting to see the results on other platforms (SSE or no SSE). That is good news for Open GL Java ;-)
Second conclusion: aliasing is always better with ints but counter productive with floats which also was not obvious to me.
I am perfectly aware that these results should be taken with a grain of salt (if you slightly modify the code inside the loop you may end up with other conclusions).
The bottom line: it is going very difficult to optimize code since the optimization on one platform may end up decreasing perf on another platform. The only way to know is to test !
I'd be interesting in other people testing with other hardware, OS and JVM versions.
|
|
|
|
|
AndersDahlberg
|
 |
«
Reply #1 - Posted
2003-05-24 03:57:58 » |
|
Tried this test on my machine (adding some usual repeat stuff) java 1.4.2 beta, RedHat 9.0, athlon 900: int: 7769 int aliased: 7409 float: 4250 float aliased: 5273 int: 8080 int aliased: 7172 float: 4325 float aliased: 5263 int: 8075 int aliased: 7160 float: 4322 float aliased: 5268 int: 8072 int aliased: 7173 float: 4312 float aliased: 5282 java -server: int: 3530 int aliased: 2501 float: 8488 float aliased: 7738 int: 1937 int aliased: 1779 float: 8154 float aliased: 7718 int: 1983 int aliased: 1764 float: 8171 float aliased: 7722 int: 1945 int aliased: 1767 float: 8189 float aliased: 7712  server versus client...  ROFL - benchmarks are fun  For even more laughs  I tried it with gcj 3.2.2: int: 7893 int aliased: 4297 float: 9014 float aliased: 5995 int: 9787 int aliased: 4584 float: 9833 float aliased: 6528 Pretty bad,  , but wait - here is more, compiled it with "gcj -msse2 -m3dnow -O2 -o test --main=Test Test.java": int: 2698 int aliased: 1798 float: 3997 float aliased: 1892 int: 2684 int aliased: 1790 float: 3987 float aliased: 1900 gcj rocks on this test  - is it cheating or is sun java just being slow? Maybe should try it with ibm as they (?) are faster on calculus 
|
|
|
|
|
altair
Senior Newbie 
|
 |
«
Reply #2 - Posted
2003-05-24 05:57:54 » |
|
Geeeeee ! Your results are puzzling because in the end you do not know what to shoot for. With both integers and floats, the values span (roughly) from 2 to 9 ! And the floats can be faster or slower that the ints. I note than with the new JVM, the floats 'can be' REALLY fast ... How can you be sure that the optimizations you made on your platform are relevant at all ? The only constant thing for sure is that aliased integers are always faster than non aliased.
Food for thought, people ...
|
|
|
|
|
Games published by our own members! Check 'em out!
|
|
aikarele
Senior Newbie 
|
 |
«
Reply #3 - Posted
2003-05-24 11:04:49 » |
|
I did some microbenchmarking too (with J2SE 1.4.2 + WinXP) and I found out that floating point division was working faster than integer division. However, floating point multiplication was working slower than integer multiplication. By the way, you can find an interesting table comparing the relative performance of common operations on PIII-733 using C++ here (appendix B): http://www.tantalon.com/pete/cppopt/appendix.htmIt states that "In fact, floating-point division is as fast or faster than integer division". So this is not a Java only "feature". Somebody should do a table like that for Java and post it here  .
|
|
|
|
|
Abuse
|
 |
«
Reply #4 - Posted
2003-05-24 11:33:16 » |
|
Isn't it normal that floating point is better for division/multiplation and integer is better for addition/negation? I thought it was common knowledge 
|
Make Elite IV:Dangerous happen! Pledge your backing at KICKSTARTER here! 
|
|
|
aikarele
Senior Newbie 
|
 |
«
Reply #5 - Posted
2003-05-24 13:17:26 » |
|
Isn't it normal that floating point is better for division/multiplation and integer is better for addition/negation? I thought it was common knowledge  Read the earlier posts. It depends. For example, I just said on my earlier post that "floating point multiplication is SLOWER than integer multiplication".
|
|
|
|
|
Abuse
|
 |
«
Reply #6 - Posted
2003-05-24 17:33:03 » |
|
oh yeah, soz  well... you've proved 1 thing. Micro-optimisations are a waste of time  and attempting to benchmark micro-optimisations are an even larger waste of time 
|
Make Elite IV:Dangerous happen! Pledge your backing at KICKSTARTER here! 
|
|
|
jbanes
|
 |
«
Reply #7 - Posted
2003-05-24 17:48:30 » |
|
If I may, I think the real reason that floats appear to be so much faster has less to do with SIMD/SSE, and more to do with floating-point coprocessors. THe idea that only integers should be used in games comes from back in the days of 386 machines where floating point had to be simulated in software. Unfortunately, it was one of those ideas that the development community never got past. I remember only one game that ever advertised the fact that it used floating point math. It was some sort of 3D car/shoot'em up game that ran on 486s and Pentiums. It actually ran quite well, but the market (to my knowledge) never picked up on this little tidbit.
|
|
|
|
swpalmer
|
 |
«
Reply #8 - Posted
2003-05-24 20:01:21 » |
|
The fact that the server VM did float operations so much slower than the client VM is a performance issue that I would file a bug report on.
Performance in that area should be the same or better.
GCC on intel is known to suck. But for floats MSVC 6.0 is also known to suck. The Intel compiler produces much better code, although I hear that MS has caught up with the .net compiler.
I like that the server compiler outperformed GCJ on ints... this sort of information is helpful in dispelling the myth that Java is just all around "slow".
Here are some numbers from the above code for Mac OS X 1GHz PowerPC G4 client VM
test repeated 4 times.. these are the typical results...
int: 3809 int a: 2276 float: 4653 float a: 3410
int: 3803 int a: 2268 float: 4619 float a: 3424
With -server ALL numbers are HIGHER by about 100-200.
|
|
|
|
altair
Senior Newbie 
|
 |
«
Reply #9 - Posted
2003-05-24 20:23:55 » |
|
swpalmer,
On Mac, the floats are somewhat slower than the ints, but all in all, the JVM/Max Os X/G4 combunation kicks butts !!!
Abuse,
micro-optimizations may be a waste of time but benchmarking them is not IMO : you are always safe with aliasing int arrays, the floats are 'usually' pretty fast (at least on the 3 config that have been tested so far).
The -server option seems to decrease dramatically the speed (unless the micro benchmark was too short for the JVM optimizations to kick in).
I would not have bet on these facts before.
|
|
|
|
|
Games published by our own members! Check 'em out!
|
|
AndersDahlberg
|
 |
«
Reply #10 - Posted
2003-05-25 00:07:07 » |
|
Well, my take on this issue would be that java is really fast on even this microbenchmark! As we all know (?  ) any java vm become more competitive the more complex a problem it faces, as this test is a very easy one - the gcj native compiler should IMHO outperform sun and ibm java vm (as it is able to compile with all optimizations - "easy to calculate" optimizations). As I (and you) noticed sun java did quite good, even though the big difference between java -server and client! As Altair said I believe this big server - client difference to be a java1.4.2 bug - if anyone could test this with another java version...? Anyway I don't really understand you Altair when you said I didn't know what I was testing? I was just doing your test on a different architecture and trying to get as much data as possible - then it's up to the experts to try to get something worth to mention out of this data!  I.e. I'm not "shooting" for anything except more data!
|
|
|
|
|
swpalmer
|
 |
«
Reply #11 - Posted
2003-05-25 00:47:22 » |
|
On Mac, the floats are somewhat slower than the ints, but all in all, the JVM/Max Os X/G4 combunation kicks butts !!! I think this is more because the PowerPC is likely much easier to compile decent code for, since it has a much better design than intel (despite being behind in terms of clock speed). I mean, when you don't have to worry so much about what you will keep in registers and what you need to shove on the stack because you actually have a decent amount of general purpose registers... well it just seems like optimizing on that architecture would be easier.
|
|
|
|
altair
Senior Newbie 
|
 |
«
Reply #12 - Posted
2003-05-25 01:19:34 » |
|
AndersDahlberg
I did not make myself clear. " Your results are puzzling because in the end 'we' do not know what to shoot for ".
I meant that it is not obvious from your results what to target to achieve the best speed : use floats or integers ? It depends on the platform (hardware, JVM, OS). Alias / not alias ? it all depends on the code inside of the loop AND the platform.
|
|
|
|
|
AndersDahlberg
|
 |
«
Reply #13 - Posted
2003-05-25 04:46:57 » |
|
altair: Ok, then we understand each other  ...for my part I don't really care which one is faster - will probably never become a big issue for me anyways (1x or 2x slower on a test like this is "almost nothing" 
|
|
|
|
|
Mark Thornton
|
 |
«
Reply #14 - Posted
2003-05-28 09:02:22 » |
|
On my 3.06GHz P4 (WIndows XP) I get
int: 2374 int aliased: 1421 float: 843 float aliased: 828 float (div): 1374 float aliased (div): 1343
The extra pair of float results are using /3f instead of *0.333... These results are with the server VM, for the client VM the results are:
4716 4372 153369 154689 155082 156864
Ouch!
The results for double are essentially the same as for float (both client and server).
|
|
|
|
|
Kevdog
|
 |
«
Reply #15 - Posted
2003-05-28 22:30:38 » |
|
When you guys are using the -server option, are you putting code in there to "warm it up" before actually running it?
Maybe run the tests twice and throw out the first results?
|
There are only 10 types of people, those who understand binary and those who don't!
|
|
|
Mark Thornton
|
 |
«
Reply #16 - Posted
2003-05-29 08:56:43 » |
|
When you guys are using the -server option, are you putting code in there to "warm it up" before actually running it?
Maybe run the tests twice and throw out the first results? Yes, but the length of the loop is sufficiently long that the changes aren't large.
|
|
|
|
|
erikd
|
 |
«
Reply #17 - Posted
2003-05-30 18:51:11 » |
|
Yes, but the length of the loop is sufficiently long that the changes aren't large. There's not much to warm up in a micro benchmark so that makes sense  Erik
|
|
|
|
Kevdog
|
 |
«
Reply #18 - Posted
2003-05-30 20:54:56 » |
|
Okay, just had to give it a try on my work machine: P3 1Ghz 512MB mem WinNT
java 1.4.1_01 -client Run #1 int: 6409 int alias: 5939 float: 66125 float alias: 64413
Run #2 int: 6819 int alias: 5799 float: 64873 float alias: 64703
Run #3 int: 6850 int alias: 5858 float: 64834 float alias: 65914
-server Run #1 int: 5668 int alias: 5658 float: 62330 float alias: 62229
Run #2 int: 5598 int alias: 6259 float: 66416 float alias: 66996
Run #3 int: 5828 int alias: 6079 float: 62210 float alias: 62239
Looks like under WinNT the SSE instructions aren't being used? Float math is horrible! Maybe that's why some of the demo games run very slow and jerky on my system. Hopefully we're upgrading to WinXP by the end of the year!
|
There are only 10 types of people, those who understand binary and those who don't!
|
|
|
altair
Senior Newbie 
|
 |
«
Reply #19 - Posted
2003-05-31 02:15:57 » |
|
Unlike previous results, the results on the P3/NT would be enough to ban floats (or remove this platform from the targets). You did not give the version of the JVM though (upgrading could help improve the score).
Consider a game heavily using floats: it would fly on the Mac and the fast P4s (with XP) but would be implayable with a 'slow' PC with NT. Less so with integers.
"Write once run anywhere" seems really not to be an easy task as far as performance is concerned ...
|
|
|
|
|
swpalmer
|
 |
«
Reply #20 - Posted
2003-05-31 04:00:02 » |
|
You did not give the version of the JVM though (upgrading could help improve the score). Look again it says 1.4.1_01
|
|
|
|
Mark Thornton
|
 |
«
Reply #21 - Posted
2003-05-31 11:59:46 » |
|
Okay, just had to give it a try on my work machine: P3 1Ghz 512MB mem WinNT
java 1.4.1_01 -client ... Looks like under WinNT the SSE instructions aren't being used? Float math is horrible! Maybe that's why some of the demo games run very slow and jerky on my system. Hopefully we're upgrading to WinXP by the end of the year!
The use of SSE is new in 1.4.2 beta and even then only in the server version.
|
|
|
|
|
princec
|
 |
«
Reply #22 - Posted
2003-05-31 13:26:39 » |
|
> only in the server version Fools. Java gaming once again takes a poke in the eye. Cas 
|
|
|
|
Mark Thornton
|
 |
«
Reply #23 - Posted
2003-05-31 15:20:54 » |
|
One serious problem with this benchmark is that the floating point calculation overflows (and then becomes NaN). The processor timing for the subsequent operations may not be very representative of normal calculation.
|
|
|
|
|
Mark Thornton
|
 |
«
Reply #24 - Posted
2003-05-31 15:29:55 » |
|
I've now changed the constants slightly and found that on my P2/400 the floating point time changes from 155 seconds down to 13 seconds. Curiously that 155 second time for the original benchmark is almost the same as for the 3.06GHz P4 I have at work when SSE is not used. Evidently the SSE path is much faster at hanldling NaN values than the ordinary case, but this doesn't tell us much about real life floating point speed.
|
|
|
|
|
|
|
Mark Thornton
|
 |
«
Reply #26 - Posted
2003-06-02 09:54:44 » |
|
> only in the server version Fools. Java gaming once again takes a poke in the eye. Cas  The standard fp performance is reasonable provided that you avoid the NaN case. Of course the -server version is faster (wouldn't be much point otherwise), but that is also true of the integer results. So perhaps we should look for a benchmark which does something realistic with the values finite and preferably non zero. My slight variation in this benchmark results in the values converging to zero which isn't ideal either (too easy). In this case on my P4 the floating point calculation is faster than the integer method. Any suggestions for a relevant calculation which has both integer and fp forms.
|
|
|
|
|
princec
|
 |
«
Reply #27 - Posted
2003-06-02 11:20:56 » |
|
Mandelbrot? And of course, vertex transformation is where it's at these days. And the reason why the client FP performance isn't reasonable at all, just slow  Cas 
|
|
|
|
Mark Thornton
|
 |
«
Reply #28 - Posted
2003-06-02 12:35:00 » |
|
Mandelbrot? And of course, vertex transformation is where it's at these days. Probably a bad example as you would really like the transformations to be done by all those transform pipelines on the graphics card. I wonder if the graphics card FP could usefully be used for general purpose fp --- I seem to recall that the Sony game systems being hooked up as a 'supercomputer' were doing something like that.
|
|
|
|
|
cfmdobbie
|
 |
«
Reply #29 - Posted
2003-06-02 14:47:52 » |
|
Coincidentally, there was a thread about that a couple of days ago!
The conclusion was "technically yes, but getting the results out again is too slow", I believe.
|
Hellomynameis Charlie Dobbie.
|
|
|
|