Java-Gaming.org    
Featured games (81)
games approved by the League of Dukes
Games in Showcase (487)
Games in Android Showcase (110)
games submitted by our members
Games in WIP (553)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: [1]
  ignore  |  Print  
  SSE / SSE2 - SIMD wrapper  (Read 6674 times)
0 Members and 1 Guest are viewing this topic.
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 783
Projects: 4
Exp: 16 years


Hand over your head.


« Posted 2006-02-03 15:27:06 »

In several topics the performance of tight loops were discussed. The current implementation of the Sun JVM can't take advantage of the SIMD properties of SSE(2). When one would need raw performance, SIMD would be a nice solution. Ofcourse these instructions cannot be wrapped directly, but a proper API around them would mean they would become accessible.

Ofcourse not all CPUs have SSE/SSE2 so to keep it platform- (or CPU-) independant, we'd have a fallback to Java-instructions.


1  
2  
3  
4  
5  
6  
FloatMath.add(FloatBuffer a, FloatBuffer b);
FloatMath.sub(FloatBuffer a, FloatBuffer b);
FloatMath.mul(FloatBuffer a, FloatBuffer b);
FloatMath.madd(FloatBuffer a, FloatBuffer b);

FloatMath.lerp(FloatBuffer a, FloatBuffer b, float w, FloatBuffer r); // combination of several SSE instructions


The code would try to do 4 calculations at once, until the remaining data in the buffers is less than 4.



This idea is so obvious, that I feel I'm forgetting something. Also I lack any C-skills and thus have never done any projects involving JNI.
From my point of view it looks like this could be programmed in a rainy afternoon.

Could anybody shed some light on this?

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline pepe

Junior Member




Nothing unreal exists


« Reply #1 - Posted 2006-02-03 15:44:10 »

Does not go in your path, but the direction looks like the same.
http://java.sun.com/j2se/1.4.2/changes.html#vm

Home page: http://frederic.barachant.com
------------------------------------------------------
GoSub: java2D gamechmark http://frederic.barachant.com/GoSub/GoSub.jnlp
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 783
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #2 - Posted 2006-02-03 15:46:59 »

Sun's JVM supports SSE(2), yes, but not the SIMD part of it. That's the whole point Smiley

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline pepe

Junior Member




Nothing unreal exists


« Reply #3 - Posted 2006-02-03 15:54:41 »

The whole thing with jvm optimisations like that is to code in patterns that the JIT will recognize and optimise.
Which ones of us tried to change a decrementing loop to an incrementing one in order to get rid of bound checks? Most of those that needed to. This is the same principle. (btw, is that still needed?)
Once we know the pattern that the JIT uses to do what is needed, we'll have it optimized the same way, but will not need to code using special tools. Hopefully, those patterns are generally those of a correct object design.
IIRC, two years ago this was discussed here. We finally came to the idea that such operations should be done on big arrays of data, instead of operations by operations. Thus that would have been the job of a special API instead of special instructions that would be regognized by jit.

Home page: http://frederic.barachant.com
------------------------------------------------------
GoSub: java2D gamechmark http://frederic.barachant.com/GoSub/GoSub.jnlp
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 783
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #4 - Posted 2006-02-03 16:01:28 »

I don't really see your point Smiley

We are both saying these kinds of optimisation can't (or shouldn't) be done by the JIT, but by some specialized API.
And we agree both that it is only useful to do on large sets of data.
So it seems we share the same ideas Smiley

How hard would it be for anybody to code this API? As I see it, it would be a very small effort.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline zingbat

Senior Member




Java games rock!


« Reply #5 - Posted 2006-02-03 17:58:37 »

I don't really see your point Smiley

We are both saying these kinds of optimisation can't (or shouldn't) be done by the JIT, but by some specialized API.
And we agree both that it is only useful to do on large sets of data.
So it seems we share the same ideas Smiley

How hard would it be for anybody to code this API? As I see it, it would be a very small effort.

Your idea looks interesting but i doubt you will get much help here. Try posting this on an assembler forum like fasm for example and ask for help there:

http://flatassembler.net/

Or you can try to generate the machine code yourself into memory and then call your generated code from java. Don't know if this can be done in java. You will find in this resource page a link to Intel docs describing their machine code encoding:

http://board.flatassembler.net/topic.php?t=2530

Finnaly if there is any way to extend the jvm Jit with a dll this would be the best solution. The JIT would be trained to recognized certain specific classes like 'vector' and 'matrix' and automatically select SIMD instructions if available to do the math. This may be very hard to impossible to do without any support from Sun.
Offline Raghar

Junior Member




Ue ni taete 'ru hitomi ni kono mi wa dou utsuru


« Reply #6 - Posted 2006-02-03 18:15:38 »

1  
2  
3  
4  
5  
6  
FloatMath.add(FloatBuffer a, FloatBuffer b);
FloatMath.sub(FloatBuffer a, FloatBuffer b);
FloatMath.mul(FloatBuffer a, FloatBuffer b);
FloatMath.madd(FloatBuffer a, FloatBuffer b);

FloatMath.lerp(FloatBuffer a, FloatBuffer b, float w, FloatBuffer r); // combination of several SSE instructions


The code would try to do 4 calculations at once, until the remaining data in the buffers is less than 4.



This idea is so obvious, that I feel I'm forgetting something. Also I lack any C-skills and thus have never done any projects involving JNI.
From my point of view it looks like this could be programmed in a rainy afternoon.



What you described would be hugely uneffective. Better way would be something like SIMD program ARB. You know, write down some SSE3 instructions, add pipes, and be happy.

SIMDAbstractModule.addProgram(float[] pipes, SIMDProgramARB2 program, int IN_PLACE | LOCAL)

Of course SIMD program would need to have intel syntax to make it more easy to work with intel manuals and decrease amount on unnecessary locks.

instruction destination, source

{
pipe uint p0
pipe uint p1

mov r0, p0
mul p1, p1
add p0, p1
add p0, r0
}

{
p0 data {int}    //This doesn't imply any conversion it's simply Java way to describe layering of the pipe.
p1 data {int}
}
Offline Linuxhippy

Senior Member


Medals: 1


Java games rock!


« Reply #7 - Posted 2006-02-06 07:47:44 »

Quote
What you described would be hugely uneffective.
Why? This methods could be represented by instrincs and therefor run for "free".
Furthermore the suggested API has a quite clear design and would also be portable over many architectures that support SIMD instruction, and even on older CPUs there could be many levels of fall-backs.

Quote
Better way would be something like SIMD program ARB. You know, write down some SSE3 instructions, add pipes, and be happy.[/
Don't agree here. You completly loose platform-independence but furthermore executing your own assembly would render the whole jvm concept useless. no sandbox could be trusted anymore, JIT optimizations must take care of eventuall nonsence happening in the assembly (from where should the JIT know which registers are used or cpu states have been changed?).

lg Clemens
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 783
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #8 - Posted 2006-02-06 08:21:19 »

I'm glad you said that, as I just couldn't figure out how it could be that inefficient.

The C compiler should be smart enough to generate reasonably fast ASM code using xmmintrinc.h. I've tried it, but I get compiler-errors from those very same header-files. I fear this will requires some effort of somebody that actually knows what he's doing, hehe, so not me.

If anybody is interested in doing it, I (and I suppose many others) would be delighted to hear it Smiley

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline AndersDahlberg

Junior Member





« Reply #9 - Posted 2006-02-06 16:41:41 »

Simple benchmark - regular disclaimers about benchmark implementation, coding misstakes etc. apply Smiley

Code implements vector multiplication A = A*B for 4*1024*1024 length vectors A and B populated with randomly distributed floats [0,100).
First numbers is with float arrays, second with (direct) FloatBuffers and third is a native implementation (direct bytebuffers used to transfer data).
"Well..." don't really serve any purpose except possibly preventing the number crunching process from being optimized away (?).

java version:
java version "1.6.0-rc"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.6.0-rc-b68)
Java HotSpot(TM) Client VM (build 1.6.0-rc-b68, mixed mode, sharing)

gcc version:
gcc (GCC) 3.4.5 (Gentoo 3.4.5, ssp-3.4.5-1.0, pie-8.7.9)

command used to compile shared library:
1  
gcc -pipe -msse -march=pentium4 -O3 simd_SimdTest.c -o libtestsimd.so -shared -ljvm -lverify -lnet -ljava -lnio -I/opt/sun-jdk-1.6.0/include -I/opt/sun-jdk-1.6.0/include/linux -L/opt/sun-jdk-1.6.0/jre/lib/i386 -L/opt/sun-jdk-1.6.0/jre/lib/i386/client -Wl,-soname,libtestsimd.so -lc  -fPIC  -D_REENTRANT  -W -Wall -Wno-unused -Wno-parentheses -Werror 



1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
<snip>
Well...:                        7412,38965.
Array calc took 34631 us.
Well...:                        4627,65234.
Array calc took 35024 us.
Well...:                        1983,89075.
Array calc took 34717 us.
<snip>
Well...:                        578,43506.
Buffer calc took 34490 us.
Well...:                        5744,37891.
Buffer calc took 34895 us.
Well...:                        3716,52808.
Buffer calc took 33800 us.
<snip>
Well...:                        1946,09839.
SIMD calc took 12619 us.
Well...:                        657,79602.
SIMD calc took 12674 us.
Well...:                        3669,88257.
SIMD calc took 12666 us.


Java:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
47  
48  
49  
50  
51  
52  
53  
54  
55  
56  
57  
58  
59  
60  
61  
62  
63  
64  
65  
66  
67  
68  
69  
70  
71  
72  
73  
74  
75  
76  
77  
78  
79  
80  
81  
82  
83  
84  
85  
86  
package simd;

import java.nio.ByteBuffer;
import java.nio.ByteOrder;
import java.nio.DoubleBuffer;
import java.nio.FloatBuffer;

public class SimdTest {
   private static final int ITERS = 10;
   private static final int SIZE = 4*1024*1024;
    private static float[] dataA = new float[SIZE];
   private static float[] dataB = new float[SIZE];

   private static FloatBuffer bufferA = ByteBuffer.allocateDirect(SIZE*4).
                               order(ByteOrder.nativeOrder()).
                               asFloatBuffer();
   private static FloatBuffer bufferB = ByteBuffer.allocateDirect(SIZE*4).
                                 order(ByteOrder.nativeOrder()).
                                 asFloatBuffer();
   
   public static void main(String[] args) {
     
        // Test ARRAY based
        int random = (int) (SIZE*Math.random());
        for (int i = 0; i < ITERS; i++) {
         initArrays();
         long before = System.nanoTime();
         calcArrays();
         long after = System.nanoTime();
            System.out.printf("Well...:\t\t\t%4.5f.\n", dataA[random]);
         long diffInMs = (after - before) / 1000;
         System.out.printf("Array calc took %d us.\n", diffInMs);
      }
     
        // Test BUFFER based
     for (int i = 0; i < ITERS; i++) {
         initBuffers();
         long before = System.nanoTime();
         calcBuffers();
         long after = System.nanoTime();
            System.out.printf("Well...:\t\t\t%4.5f.\n", bufferA.get(random));
         long diffInMs = (after - before) / 1000;
         System.out.printf("Buffer calc took %d us.\n", diffInMs);
      }
     
        // Test SIMD based
        System.loadLibrary("testsimd");
        for (int i = 0; i < ITERS; i++) {
         initBuffers();
         long before = System.nanoTime();
         calcSIMD(bufferA, bufferB);
         long after = System.nanoTime();
         System.out.printf("Well...:\t\t\t%4.5f.\n", bufferA.get(random));
         long diffInMs = (after - before) / 1000;
         System.out.printf("SIMD calc took %d us.\n", diffInMs);
      }
   }

   private static void calcArrays() {
      for (int i = 0; i < SIZE; i++) {
         dataA[i] = dataA[i]*dataB[i];
      }
   }

   private static void calcBuffers() {
      for (int i = 0; i < SIZE; i++) {
         bufferA.put(i, bufferA.get(i)*bufferB.get(i));
      }
   }

   private static native void calcSIMD(FloatBuffer bufferA, FloatBuffer bufferB);
   
   private static void initArrays() {
      for (int i=0; i < SIZE; i++) {
            dataA[i] = (float) (Math.random() * 100);
            dataB[i] = (float) (Math.random() * 100);
      }
    }

   private static void initBuffers() {
      for (int i=0; i < SIZE; i++) {
            bufferA.put(i, (float) (Math.random() * 100));
            bufferB.put(i, (float) (Math.random() * 100));
      }
   }
}


C header:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
/* DO NOT EDIT THIS FILE - it is machine generated */
#include <jni.h>
/* Header for class simd_SimdTest */

#ifndef _Included_simd_SimdTest
#define _Included_simd_SimdTest
#ifdef __cplusplus
extern "C" {
#endif
#undef simd_SimdTest_SIZE
#define simd_SimdTest_SIZE 409600L
/*
 * Class:     simd_SimdTest
 * Method:    calcSIMD
 * Signature: (Ljava/nio/DoubleBuffer;Ljava/nio/DoubleBuffer;)V
 */

JNIEXPORT void JNICALL Java_simd_SimdTest_calcSIMD
  (JNIEnv *, jclass, jobject, jobject);

#ifdef __cplusplus
}
#endif
#endif


C-impl:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
#include <jni.h>
#include <stdio.h>
#include "simd_SimdTest.h"

//typedef float v4sf __attribute__ ((mode(V4SF)));

JNIEXPORT void JNICALL Java_simd_SimdTest_calcSIMD(JNIEnv *env, jclass cls, jobject ba, jobject bb)
{
    /*
    // Float vectors of size 4
    v4sf a;
    v4sf b;
    v4sf result;
    */

   
    float* bufferA = (float*) (*env)->GetDirectBufferAddress(env, ba);
    int sizeA = (*env)->GetDirectBufferCapacity(env, ba);
    float* bufferB = (float*) (*env)->GetDirectBufferAddress(env, bb);
   
    // assumes buffers is equal in size.
    int i = 0;
    for (; i < sizeA; i++) // i += 4
    {
        *bufferA = (*bufferA) * (*bufferB);
        bufferA++; bufferB++;
       
        /*
        // GCC 3.4 compiles above to same speed as the more
        // verbose version below.
       
        // Loads 4 floats from adress bufferA
        a = __builtin_ia32_loadups(bufferA);
        b = __builtin_ia32_loadups(bufferB);
        // a * b
        result = __builtin_ia32_mulps (a, b);   
        //Stores 4 floats at  adress bufferA
        __builtin_ia32_storeups(bufferA, result);
        bufferA = bufferA + 4;
        bufferB = bufferB + 4;
        */

    }
   
}
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 783
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #10 - Posted 2006-02-06 17:20:40 »

Thank you very much!

According to the results SIMD is 2.6x faster, which is really worth it, IMHO.

I think the more complex math is done in native code, the more the speedup will be (up to 4x?), like multiplying an array of vectors by one matrix, or lerping between 2 data-sets, writing to the 3rd (a * (1-w) + b * w).


Was it much work? Will it be much work to add other operations? If not, could you wrap quite a few basic math operations in a native lib (DLL / SO)? Smiley

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline zingbat

Senior Member




Java games rock!


« Reply #11 - Posted 2006-02-06 23:29:28 »

That would be really great.  Smiley
Offline pepe

Junior Member




Nothing unreal exists


« Reply #12 - Posted 2006-02-07 10:02:58 »

I don't really see your point Smiley

We are both saying these kinds of optimisation can't (or shouldn't) be done by the JIT, but by some specialized API.
And we agree both that it is only useful to do on large sets of data.
So it seems we share the same ideas Smiley
awww.. errr.
Lips Sealed
I didn't see the 'buffer' part of the names of your variables. Yes, we agree.

Home page: http://frederic.barachant.com
------------------------------------------------------
GoSub: java2D gamechmark http://frederic.barachant.com/GoSub/GoSub.jnlp
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 783
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #13 - Posted 2006-02-07 11:31:30 »

Would something like this work too? I'm not too sure I did the pointer-offsets right in the loop.

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
// cross
JNIEXPORT void JNICALL Java_simd_BufferMath_cross(JNIEnv *env, jclass cls, jobject a, jobject b, jobject c) {
    int size = (*env)->GetDirectBufferCapacity(env, a);

    float* pA = (float*) (*env)->GetDirectBufferAddress(env, a);
    float* pB = (float*) (*env)->GetDirectBufferAddress(env, b);
    float* pC = (float*) (*env)->GetDirectBufferAddress(env, c);
   
    size = size / 3;

    for (int i = 0; i < size; i++) {
       (*pC+0) = (*pA+1) * (*pB+2) - (*pA+2) * (*pB+1); // are these pointer-offsets valid?
       (*pC+1) = (*pA+2) * (*pB+0) - (*pA+0) * (*pB+2); // are these pointer-offsets valid?
       (*pC+2) = (*pA+0) * (*pB+1) - (*pA+1) * (*pB+0); // are these pointer-offsets valid?
       *pA = (*pA) + 3;
       *pB = (*pB) + 3;
       *pC = (*pC) + 3;
    }
}


Also it would be good to only pass direct pointers to the function, so that the java-code can figure out the position & limit of the FloatBuffer and take that into account before entering the loop.

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
JNIEXPORT void JNICALL Java_simd_BufferMath_cross(JNIEnv *env, jclass cls, jlong a, jlong b, jlong c, jint vectors) {
    int size = (int) vectors * 3;

    float* pA = (float*) a;
    float* pB = (float*) b;
    float* pC = (float*) c;

    for (int i = 0; i < size; i++) {
       (*pC+0) = (*pA+1) * (*pB+2) - (*pA+2) * (*pB+1); // are these pointers valid?
       (*pC+1) = (*pA+2) * (*pB+0) - (*pA+0) * (*pB+2); // are these pointers valid?
       (*pC+2) = (*pA+0) * (*pB+1) - (*pA+1) * (*pB+0); // are these pointers valid?
       *pA = (*pA) + 3;
       *pB = (*pB) + 3;
       *pC = (*pC) + 3;
    }
}


I think this was how LWJGL handled this scenario.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline AndersDahlberg

Junior Member





« Reply #14 - Posted 2006-02-07 20:03:42 »

Hmm, I *think* that should be:
1  
2  
3  
*(pC+0) = *(pA+1) * ...
*(pC+1) = *(pA+2) * ...
*(pC+2) = *(pA+0) * ...


I.e. IIRC the dereferencing operator * have higher priority than + (disclaimer: I've only read a course in uni, never done any real work in c/c++)

EDIT: I really see no added benefit (only "trouble" Smiley ) by passing direct pointers instead of direct bytebuffers - as the buffers should have adequate performance (and, I believe LWJGL now uses only direct bytebuffers as well).

EDIT2: I.e. you could use limit()/capacity()/etc at the java-side and still send bytebuffers + extra integer "size" parameters and not bother with using a pointer Smiley
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 783
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #15 - Posted 2006-02-07 20:53:25 »

Hmm, I *think* that should be:
1  
2  
3  
*(pC+0) = *(pA+1) * ...
*(pC+1) = *(pA+2) * ...
*(pC+2) = *(pA+0) * ...

Yes, I feared that.

I.e. IIRC the dereferencing operator * have higher priority than + (disclaimer: I've only read a course in uni, never done any real work in c/c++)

EDIT: I really see no added benefit (only "trouble" Smiley ) by passing direct pointers instead of direct bytebuffers - as the buffers should have adequate performance (and, I believe LWJGL now uses only direct bytebuffers as well).

EDIT2: I.e. you could use limit()/capacity()/etc at the java-side and still send bytebuffers + extra integer "size" parameters and not bother with using a pointer Smiley


My idea was to have this at the java-side:

1  
2  
3  
4  
5  
6  
7  
PUBLIC static final void mul(FloatBuffer a, FloatBuffer b, FloatBuffer c)
{
     int size = a.remaining(); // and check b, and c..
    mul(a, b, c, a.position() << 2, b.position() << 2, c.position() << 2, size);
}

PRIVATE static native void mul(FloatBuffer a, FloatBuffer b, FloatBuffer c, int aOff, int bOff, int cOff, int size);


For a moment I was considering using the Unsafe class to get the pointer at the java-side, but that call is done through reflection, which means a waste of performance, as it will get called very often.

This all means there will be quite a few method-parameters (one offset INT for every FloatBuffer, and 1 size) but that will be hidden for the java-side.


Sounds like a plan? Smiley

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline Raghar

Junior Member




Ue ni taete 'ru hitomi ni kono mi wa dou utsuru


« Reply #16 - Posted 2006-02-07 21:00:11 »

Quote
What you described would be hugely uneffective.
Why?
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
1.
for 0 buffer length
mul a, b

for 0 buffer length
add a, c


is significantly less efficient than
2.
for 0 buffer length
mul a, b
add a, c


and of course better way is
3.
for 0 buffer length += 4
mov128 XMM0, [pointer to a + ofset]
mulf128D32 XMM0, [pointer to b + ofset]
addf128D32 XMM0, [pointer to c + ofset]
mov128 [pointer to a + ofset], XMM0

pipe SIMDOperationMadInAcion(a,b,c, maxLength) {
pipe a float
pipe b float
pipe c float

a = a*b + c //or a *= b;  a += c;

} (float[] a, float[] b, float[] c, length)

Would be easily converted into 3rd case, 4 pipes at once. And of course it would have nice bonus to work better on 256 bit SIMD register. If CPU will not have any SIMD ability, it would fall back just to one pipe at once.

It might be actually more helpful to convert
for(int d = 0....){
a[d] = a[d] * b[d] + c[d]
}

into JIT recognized SIMD "intristic" with asumption that neither array was volatile, and enable this type of optimisation "automatically".
Of course the above pipe syntax has advantage of being extremely clear to compiler. Obviously such analisys should be done at bytecode compilation time, so PIPE like hint to the compiler could be encoded into the .class file, and save some time of the JIT compiler.
Quote
Quote
Better way would be something like SIMD program ARB. You know, write down some SSE3 instructions, add pipes, and be happy.[/
Don't agree here. You completly loose platform-independence but furthermore executing your own assembly would render the whole jvm concept useless. no sandbox could be trusted anymore, JIT optimizations must take care of eventuall nonsence happening in the assembly (from where should the JIT know which registers are used or cpu states have been changed?).

I didn't say compile SSE3 commands into binary, I said use SSE3 like syntax for something like "vertex program/GLSL".. Obviously Sparcs has it's own version of SSE, and it's not exactly difficult to convert from one restricted command set into another.

Offline AndersDahlberg

Junior Member





« Reply #17 - Posted 2006-02-07 21:41:35 »

Regarding many parameters:
I believe the simd versions of math will only be relevant if the number of calculations is >> 1000 which seems to imply that even if you go crazy with number of extra checks you do at the java side / extra parameters you pass to native side, it will only have a slight impact on performance (and probably a big impact on increased usability!).

Hopefully all of this will soon (tm) become irrelevant, by the work of our sun jit-gods Wink. I agree though that it still could be a helpful library at the present time.
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 783
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #18 - Posted 2006-02-07 22:14:29 »

Raghar, your idea is clean, and above all, it's safe (from sandbox POV) but either it requires a change in the JIT and in the class-spec, or dynamic compilation of SSE code done by the app (requires a C compiler!). Those are not going to happen in the next few years, while the alternative is so incredibly trival, as AndersDahlberg showed us.

If I had to choose, I'd go for *now*, instead of waiting for Sun to make (more or less) complex changes to their JVM, released in Java 7 I think.


AndersDahlberg, would you like to invest some time in this? If you want to, I could write all Java-code and C-code, I just can't manage to compile it as my compiler doesn't like the header-files that are required for SSE/SIMD. You might have to fix some trival bugs, as I can't test it, but the code is so simple that won't be a problem.

I really hope we can get this thing off the ground Cool

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 783
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #19 - Posted 2006-02-10 16:17:31 »

I'll take that as a "no" then.. ? Undecided

Anybody else? Kiss

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline rreyelts

Junior Member




There is nothing Nu under the sun


« Reply #20 - Posted 2006-02-10 16:39:10 »

Raghar, your idea is clean, and above all, it's safe (from sandbox POV) but either it requires a change in the JIT and in the class-spec, or dynamic compilation of SSE code done by the app (requires a C compiler!). Those are not going to happen in the next few years, while the alternative is so incredibly trival, as AndersDahlberg showed us.

Actually, I think I'm going to have a go at making a SIMD DSL for Java. It shouldn't require a pre-compiler, but it would obviously require you to load a native library. It would probably have an optional post-compilation step that would allow you to verify that your SIMD program was syntactically correct, and probably create a simpler/quicker representation of the program for runtime. I'll probably do something trivial just to demonstrate the interface/model is workable.


About me: http://jroller.com/page/rreyelts
Jace - Easier JNI: http://jace.reyelts.com/jace
Retroweaver - Compile on JDK1.5, and deploy on 1.4: http://retroweaver.sf.net.
Offline AndersDahlberg

Junior Member





« Reply #21 - Posted 2006-02-12 18:40:53 »

Sorry, no, I've got my hands full with finishing my final courses at university and at the same time trying to get a "thesis-job".
Pages: [1]
  ignore  |  Print  
 
 
You cannot reply to this message, because it is very, very old.

 

Add your game by posting it in the WIP section,
or publish it in Showcase.

The first screenshot will be displayed as a thumbnail.

CopyableCougar4 (23 views)
2014-08-22 19:31:30

atombrot (34 views)
2014-08-19 09:29:53

Tekkerue (30 views)
2014-08-16 06:45:27

Tekkerue (28 views)
2014-08-16 06:22:17

Tekkerue (18 views)
2014-08-16 06:20:21

Tekkerue (27 views)
2014-08-16 06:12:11

Rayexar (65 views)
2014-08-11 02:49:23

BurntPizza (41 views)
2014-08-09 21:09:32

BurntPizza (33 views)
2014-08-08 02:01:56

Norakomi (42 views)
2014-08-06 19:49:38
List of Learning Resources
by Longor1996
2014-08-16 10:40:00

List of Learning Resources
by SilverTiger
2014-08-05 19:33:27

Resources for WIP games
by CogWheelz
2014-08-01 16:20:17

Resources for WIP games
by CogWheelz
2014-08-01 16:19:50

List of Learning Resources
by SilverTiger
2014-07-31 16:29:50

List of Learning Resources
by SilverTiger
2014-07-31 16:26:06

List of Learning Resources
by SilverTiger
2014-07-31 11:54:12

HotSpot Options
by dleskov
2014-07-08 01:59:08
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!