Java-Gaming.org Hi !
Featured games (83)
games approved by the League of Dukes
Games in Showcase (539)
Games in Android Showcase (132)
games submitted by our members
Games in WIP (603)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: [1]
  ignore  |  Print  
  SSE with GCC vs Java Server VM  (Read 5085 times)
0 Members and 1 Guest are viewing this topic.
Offline Riven
« League of Dukes »

« JGO Overlord »


Medals: 841
Projects: 4
Exp: 16 years


Hand over your head.


« Posted 2008-12-16 16:33:14 »

I have this simple C sourcecode, that is supposed to be
faster than Java, as the Java VM can't do SIMD yet.

The problem is that the non-SIMD code in C (1100ms),
  is faster than the SIMD code in C (1750ms).
And both are beaten by the Java Server VM (750ms)


Now, the JVM can't possibly be faster, as the C version
is supposed to do 2-4x as much in every operation
(2x on P3/P4, 4x on C2D/C2Q).

So, where am I screwing up?
Even in C the SSE version is slower...






C initialization code:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
#include <stdio.h>
#include <xmmintrin.h>
#include <time.h>
#include <sys/time.h>
#include <errno.h>
#include <windows.h>

__m128 a, b, c;

float f4a[4] __attribute__((aligned(16))) = { +1.2, +3.5, +1.7, +2.8 };
float f4b[4] __attribute__((aligned(16))) = { -0.7, +2.6, +3.3, -4.0 };
float f4c[4] __attribute__((aligned(16))) = { -0.7, +2.6, +3.3, -4.0 };

unsigned long long System_currentTimeMillis() {
    FILETIME t;
    long long c;
    GetSystemTimeAsFileTime(&t);
    c = (unsigned long long int) t.dwHighDateTime << 32LL;
    return (c | t.dwLowDateTime) / 10000;
}


Normal (x86) C code: (takes 1100ms)
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
   t0 = System_currentTimeMillis();
   for(i = 0; i < end; i++)
   {
      f4c[0] = f4a[0] + f4b[0];
      f4c[1] = f4a[1] + f4b[1];
      f4c[2] = f4a[2] + f4b[2];
      f4c[3] = f4a[3] + f4b[3];

      f4a[0] = f4c[0] - f4b[0];
      f4a[1] = f4c[1] - f4b[1];
      f4a[2] = f4c[2] - f4b[2];
      f4a[3] = f4c[3] - f4b[3];

      f4c[0] = f4a[0] * f4c[0];
      f4c[1] = f4a[1] * f4c[1];
      f4c[2] = f4a[2] * f4c[2];
      f4c[3] = f4a[3] * f4c[3];
   }
   t1 = System_currentTimeMillis();
   printf("x86 took: %dms\n", (int)(t1-t0));


SIMD SSE2 code: (takes 1750ms)
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
   t0 = System_currentTimeMillis();
   a = _mm_load_ps(f4a);
   b = _mm_load_ps(f4b);
   c = _mm_load_ps(f4c);
   for(i = 0; i < end; i++)
   {
      c = _mm_add_ps(a, b);
      a = _mm_sub_ps(c, b);
      c = _mm_mul_ps(a, c);
   }
   _mm_store_ps(f4c,c);
   t1 = System_currentTimeMillis();
   printf("SSE took: %dms\n", (int)(t1-t0));



Java code: (takes 750ms)
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
   static void _mm_mul_ps(float[] a, float[] b, float[] dst)
   {
      dst[0] = a[0] * b[0];
      dst[1] = a[1] * b[1];
      dst[2] = a[2] * b[2];
      dst[3] = a[3] * b[3];
   }

   static void _mm_add_ps(float[] a, float[] b, float[] dst) { .......... }
   static void _mm_sub_ps(float[] a, float[] b, float[] dst) { .......... }

   static float[] run()
   {
      float[] a = { 1.2f, 3.5f, 1.7f, 2.8f };
      float[] b = { -0.7f, 2.6f, 3.3f, -4.0f };
      float[] c = { -0.7f, 2.6f, 3.3f, -4.0f };

      int end = 1024 * 1024 * 64;

      for (int i = 0; i < end; i++)
      {
         _mm_add_ps(a, b, c);
         _mm_sub_ps(c, b, a);
         _mm_mul_ps(a, c, c);
      }
     
      return c;
   }



I am compiling with:
M:\MinGW_C_compiler\bin\gcc -Wall -Wl,-subsystem,console -march=pentium3 -mfpmath=sse -fomit-frame-pointer -funroll-loops sse.c -o "sse.exe"

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social
Offline Orangy Tang

JGO Kernel


Medals: 56
Projects: 11


Monkey for a head


« Reply #1 - Posted 2008-12-16 16:57:07 »

I havn't used gcc command line for ages, but you don't appear to be compiling with optimisations? Try sticking in a -O3 (max optimisations) arg.

Without -O3 functions won't be inlined and your SSE version would appear to suffer more without that.

[ TriangularPixels.com - Play Growth Spurt, Rescue Squad and Snowman Village ] [ Rebirth - game resource library ]
Offline Riven
« League of Dukes »

« JGO Overlord »


Medals: 841
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #2 - Posted 2008-12-16 17:03:32 »

persecutioncomplex Damn! Silly me!

Java took: 750ms
C x86 took: 484ms
C SSE took: 297ms

Thanks for that!

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline bienator

Senior Devvie




OutOfCoffeeException


« Reply #3 - Posted 2008-12-16 17:30:43 »

seriously, i expected that java would be far slower compared to pure SSE instructions.

now try this testcase on a GPU for comparison Wink (estimation: at least 20x faster as SSE on a mainstream card)

Offline princec

« JGO Spiffy Duke »


Medals: 434
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #4 - Posted 2008-12-16 17:56:21 »

What x86 code does the Java compiler emit? Be interesting to see where it could be optimised. You could give this little benchmark to the VM team. (I assume there's some use for this particular bit of code somewhere?)

Cas Smiley

Offline erikd

JGO Ninja


Medals: 16
Projects: 4
Exp: 14 years


Maximumisness


« Reply #5 - Posted 2008-12-16 19:32:09 »

I'd also be interested what GCJ would do with and without bounds checks disabled.  Smiley

Offline Riven
« League of Dukes »

« JGO Overlord »


Medals: 841
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #6 - Posted 2008-12-16 19:48:12 »

Mind you that this was run on a P4, and run with the Server VM of Java 1.6.

On a P4 @ 2.4GHz:
ClientVM took: 3840ms
ServerVM took:  750ms
C x86 took:     484ms
C SSE took:     297ms


On a Q6600 @ 2.4GHz:
ClientVM took: 3350ms
ServerVM took:  650ms
C x86 took:     328ms
C SSE took:     200ms

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social
Offline Riven
« League of Dukes »

« JGO Overlord »


Medals: 841
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #7 - Posted 2008-12-16 20:02:52 »

What x86 code does the Java compiler emit? Be interesting to see where it could be optimised. You could give this little benchmark to the VM team. (I assume there's some use for this particular bit of code somewhere?)

Cas Smiley


IIRC that requires the Debug JDK. I don't have it, and it's quite a husle to get the assambler code out of it. And then I never did any ASM so I could only post it here - somebody with the Debug JDK should.

Anyway, the VM simply cannot use SIMD, as the float[] is not guaranteed the be aligned on 16 bytes, and if it is, we're dealing with offsets in the float[] that must be multiples of 4, AND have at least 3 more elements, AND you'd have to execute the same instructions on all 4. Pretty heavy stuff for the VM to figure out.

Further, I already filed an RFE about manual SIMD (a library) in the bugparade, but it was closed, mentioning 'the JVM should be able to make this optimization itself' - well, I guess that's not going to happen in the next 10 years.

Last but not least, if you compare C x86 vs. ServerVM, I guess GCC uses SIMD behind the scenes..., so it wouldn't be a fair comparison.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social
Offline bienator

Senior Devvie




OutOfCoffeeException


« Reply #8 - Posted 2008-12-16 21:09:35 »

Further, I already filed an RFE about manual SIMD (a library) in the bugparade, but it was closed, mentioning 'the JVM should be able to make this optimization itself' - well, I guess that's not going to happen in the next 10 years.
ironically they implemented behind the scenes a SSE CPU pipeline for Decora the backend for JavaFX' graphical gimmicks.

Offline Riven
« League of Dukes »

« JGO Overlord »


Medals: 841
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #9 - Posted 2008-12-16 21:15:11 »

Yeah, ironically... but that's very specialized stuff - decoding video, nothing like turning your average bytecode that may just as well be decoding XML and releasing a SSE-enabled JIT on it. I'm really sure we won't see stuff like this any time soon, it's just too hard with too little gain, only vector-math takes advantage of it (while webserver- and database performance is where all the money is at), and the vector code there is normally so small that you can write some native lib for it.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline GKW

Senior Devvie




Revenge is mine!


« Reply #10 - Posted 2008-12-17 00:07:12 »

Have you tried 1.7?  Supposedly it can generate SIMD code.  I'm not sure if you have to use a command line flag to turn it on.
Offline VeaR

Junior Devvie





« Reply #11 - Posted 2008-12-17 09:17:08 »

Anyway, the VM simply cannot use SIMD, as the float[] is not guaranteed the be aligned on 16 bytes, and if it is, we're dealing with offsets in the float[] that must be multiples of 4, AND have at least 3 more elements, AND you'd have to execute the same instructions on all 4. Pretty heavy stuff for the VM to figure out.

If a C compiler can optimize to SIMD, Java should be able to do it even more easily.
Offline princec

« JGO Spiffy Duke »


Medals: 434
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #12 - Posted 2008-12-17 11:34:46 »

You might notice rather a lot of very specific and fiddly looking flags and macros that you need to do to allow SIMD optimisations in C. In other words the C compiler isn't really automatically creating SIMD code at all, you're giving it a ton of specific hints (eg
1  
float f4a[4] __attribute__((aligned(16)))
)

...in which case Java will be needing similar hints, and is therefore probably better off as Riven reckons with a library to deal with it explicitly.

Cas Smiley

Offline Riven
« League of Dukes »

« JGO Overlord »


Medals: 841
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #13 - Posted 2008-12-17 17:52:54 »

Have you tried 1.7?  Supposedly it can generate SIMD code.  I'm not sure if you have to use a command line flag to turn it on.

I just ran the (extremely simplistic) micro benchmark on non-debug JDK 1.7 EA, and I didn't see any performance improvement. I think this simple piece of code doesn't show any of the performance improvements in the VM.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social
Offline trembovetski

Senior Devvie




If only I knew what I'm talking about!


« Reply #14 - Posted 2008-12-17 22:35:08 »

At the risk of stating the obvious  persecutioncomplex, you did warm up the VM prior to taking the measurement right?

Dmitri
Offline VeaR

Junior Devvie





« Reply #15 - Posted 2008-12-18 00:26:24 »



http://kohlerm.blogspot.com/2008/12/how-much-memory-is-used-by-my-java.html

Array objects are aligned at 8 bytes, but it also means that the first element in the array is never aligned at 16 bytes. Ahh well.

What kind of library is it?

For example this

http://www.javaworld.com/javaworld/jw-12-2008/jw-12-year-in-review-2.html?page=5

kind of API could be built around a SIMD library, or even built into the ParallelArray. It would nicely hide any technicalities, which is necessary for any RFE to go trough. Tongue
Offline Riven
« League of Dukes »

« JGO Overlord »


Medals: 841
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #16 - Posted 2008-12-18 08:23:30 »

At the risk of stating the obvious  persecutioncomplex, you did warm up the VM prior to taking the measurement right?

Dmitri


Yup, it runs the benchmark 16 times and prints how long each run took.

But again, this is a very simplistic benchmark, only doing + - * on the same float[]s

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social
Offline GKW

Senior Devvie




Revenge is mine!


« Reply #17 - Posted 2008-12-18 14:47:24 »

If I'm reading the source correctly I think if you use the 1.7 debug build and use the flag -XX:TraceSuperWord then you can see what hotspot is doing behind the scene.  If I find some time later today I might give that a try.
Offline Riven
« League of Dukes »

« JGO Overlord »


Medals: 841
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #18 - Posted 2008-12-18 20:22:35 »

The DebugVM doesn't recognize that paramater and refuses to launch.

Removing that parameter, and just running the VM instantly crashes with a nasty native crash. Smiley

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social
Offline GKW

Senior Devvie




Revenge is mine!


« Reply #19 - Posted 2008-12-18 20:44:18 »

I guess I had alittle better luck than you.  The command line flag is actually -XX:+TraceSuperWord.  I left out the plus before.  I got a dump from the SuperWord opto.  I don't really have any idea what it did but the JVM seems to recognize that it should generate SIMD instructions for that pattern.  I'd like to take a look at the JIT'ed code but I've never done that before.  I tried running the program with UseSSE=0 and UseSSE=3 adn there was a whole 2 seconds difference between them, 618(no sse) to 616(sse3).  Small enough to be noise I would think.
Offline Riven
« League of Dukes »

« JGO Overlord »


Medals: 841
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #20 - Posted 2008-12-18 21:38:46 »

I really can't get it to work, but well, I guess I can wait until 1.7 is out

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social
Offline GKW

Senior Devvie




Revenge is mine!


« Reply #21 - Posted 2008-12-19 18:48:15 »

After a little searching on the web to figure out how to look at the assembly that hotspot is generating I found a blog post that used teh flag -XX+PrintOptoAssembly and that works quite nicely.  So if the printout via this flag is correct it looks like the code generated by hotspot is just using MMX for all the math which is why we have not noticed any improvement in application speed despite the SuperWord(SIMD) optimizations that are in 1.7.  The run that I dumped to log actually went through the SuperWord optimizer twice, at least SuperWord dumped twice to the log via -XX:+TraceSuperWord, but it never generated any SIMD instructions.  It seems like a waste to not use SIMD at all.  Even in a worst case scenario with unaligned loads we would still see a little speed bump over MMX.  Did someone at Sun forget to turn this bit of code all the way on?  Please correct me if I am reading thsi wrong.  I don't know anything about Hotspot and I have not spent much time looking at assembly since college.



Here's a bit of JIT.  THere is a longer one that is basically just like this only longer, I assume from loop unrolling.
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
04e      mulss   XMM1, [RAX + #24 (8-bit)]
053      movss   [RAX + #24 (8-bit)], XMM1   # float
058      movss   XMM0, [R9 + #28 (8-bit)]   # float
05e      mulss   XMM0, [RAX + #28 (8-bit)]
063      movss   [RAX + #28 (8-bit)], XMM0   # float
068      movss   XMM1, [R9 + #32 (8-bit)]   # float
06e      mulss   XMM1, [RAX + #32 (8-bit)]
073      movss   [RAX + #32 (8-bit)], XMM1   # float
078      movss   XMM0, [RAX + #36 (8-bit)]   # float
07d      mulss   XMM0, [R9 + #36 (8-bit)]
083      movss   [RAX + #36 (8-bit)], XMM0   # float
088      incl    R11   # int
08b      cmpl    R11, #1
08f      jl,s   B2   # loop end  P=0.500000 C=1180672.000000


Offline GKW

Senior Devvie




Revenge is mine!


« Reply #22 - Posted 2008-12-19 23:07:02 »

I dug a little further and it looks like even though the optimizer recognizes that this code should be SIMD'd it decides against it in the end.  It rejects the arithmetic as unsupported, which it is but without directly debugging the C code I'm not sure why it thinks vector adds/muls/etc are not supported, and it thinks the vector loads/stores are not worth the effort for some reason.  At this point I'd love to see the test code that was used to verify that this optimization actually works.  Is this optimization too conservative to be useful or just buggy?
Offline trembovetski

Senior Devvie




If only I knew what I'm talking about!


« Reply #23 - Posted 2008-12-20 00:30:18 »

I dug a little further and it looks like even though the optimizer recognizes that this code should be SIMD'd it decides against it in the end.  It rejects the arithmetic as unsupported, which it is but without directly debugging the C code I'm not sure why it thinks vector adds/muls/etc are not supported, and it thinks the vector loads/stores are not worth the effort for some reason.  At this point I'd love to see the test code that was used to verify that this optimization actually works.  Is this optimization too conservative to be useful or just buggy?

Why not ask hotspot developers directly? http://mail.openjdk.java.net/mailman/listinfo/hotspot-dev

Dmitri
Offline GKW

Senior Devvie




Revenge is mine!


« Reply #24 - Posted 2008-12-20 03:00:36 »

I suppose that is the next course of action.  From what I can tell the compiler doesn't know how to generate ADDPS/MULPS/SUBPS/DIVPS opcodes.  That would be the reason for the vector arithmetic being rejected as being unsupported by SuperWord.  It looks like the compiler can use MOVAPS but appears to only support moving data from mmx register to mmx register so that probably is the reason that the vector moves are rejected as unproductive.  I probably just missed something obvious but it looks like no one added the arithmetic SIMD opcodes to the compiler.
Offline GKW

Senior Devvie




Revenge is mine!


« Reply #25 - Posted 2008-12-22 19:57:45 »

I'm guessing that a response might not arrive until after the holidays so I'm going to try and figure out how to add the opcodes to hotspot myself.  Unless I'm totally wrong, and I certainly hope I am, the SuperWord optimization won't actually do anything useful on a x86 cpu.
Pages: [1]
  ignore  |  Print  
 
 
You cannot reply to this message, because it is very, very old.

 

Add your game by posting it in the WIP section,
or publish it in Showcase.

The first screenshot will be displayed as a thumbnail.

rwatson462 (35 views)
2014-12-15 09:26:44

Mr.CodeIt (26 views)
2014-12-14 19:50:38

BurntPizza (53 views)
2014-12-09 22:41:13

BurntPizza (86 views)
2014-12-08 04:46:31

JscottyBieshaar (48 views)
2014-12-05 12:39:02

SHC (63 views)
2014-12-03 16:27:13

CopyableCougar4 (65 views)
2014-11-29 21:32:03

toopeicgaming1999 (126 views)
2014-11-26 15:22:04

toopeicgaming1999 (117 views)
2014-11-26 15:20:36

toopeicgaming1999 (34 views)
2014-11-26 15:20:08
Resources for WIP games
by kpars
2014-12-18 10:26:14

Understanding relations between setOrigin, setScale and setPosition in libGdx
by mbabuskov
2014-10-09 22:35:00

Definite guide to supporting multiple device resolutions on Android (2014)
by mbabuskov
2014-10-02 22:36:02

List of Learning Resources
by Longor1996
2014-08-16 10:40:00

List of Learning Resources
by SilverTiger
2014-08-05 19:33:27

Resources for WIP games
by CogWheelz
2014-08-01 16:20:17

Resources for WIP games
by CogWheelz
2014-08-01 16:19:50

List of Learning Resources
by SilverTiger
2014-07-31 16:29:50
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!