Java-Gaming.org Hi !
Featured games (91)
games approved by the League of Dukes
Games in Showcase (804)
Games in Android Showcase (239)
games submitted by our members
Games in WIP (868)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: 1 ... 3 4 [5] 6 7 ... 13
  ignore  |  Print  
  Java OpenGL Math Library (JOML)  (Read 216103 times)
0 Members and 1 Guest are viewing this topic.
Offline BurntPizza

« JGO Bitwise Duke »


Medals: 486
Exp: 7 years



« Reply #120 - Posted 2015-07-14 16:55:13 »

Your previous post was a bit confusing to me. If I understood it right, you're saying that I could create a small native function by giving JOML an opcode buffer which is then somehow compiled to a native function? Huh So... Software compute shaders for Java? O_o

It's a jit: https://github.com/JOML-CI/JOML/blob/native-interpreter/native/src/codegen.dasc

The idea is that your regular java code (chains of JOML methods) produce a native function in memory at runtime.

Here's the pure java test case: https://github.com/JOML-CI/JOML/blob/native-interpreter/test/org/joml/NativeMatrix4f.java#L108
In the current impl, the Sequence object is essentially a function pointer to the native function, you set it's arguments and call it.
NativeMatrix is a builder pattern for an opcode buffer that gets compiled to SSE code when you call terminate().
Offline Roquen

JGO Kernel


Medals: 518



« Reply #121 - Posted 2015-07-14 17:04:50 »

Consider that an animation hiearchy most often is a static trie.  A breath first walk yields a flat linear ordering of memory chunks.  Each logical node can be a quaternion, a vector and the number of children which although an integer can be stored as a single.  It's cheaper to xform 1 vector directly than to pass through LA.  So no struct library is really needed in this case.  There are a number of variants on this theme.
Offline Roquen

JGO Kernel


Medals: 518



« Reply #122 - Posted 2015-07-14 17:13:10 »

Oh and I intended to point out that even if you go with a structure where you need to xform multiple the actual matrix doesn't needed to be stored anywhere...it just temp expanded in registers.

Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline theagentd
« Reply #123 - Posted 2015-07-14 17:17:14 »

Consider that an animation hiearchy most often is a static trie.  A breath first walk yields a flat linear ordering of memory chunks.  Each logical node can be a quaternion, a vector and the number of children which although an integer can be stored as a single.  It's cheaper to xform 1 vector directly than to pass through LA.  So no struct library is really needed in this case.  There are a number of variants on this theme.
Oh and I intended to point out that even if you go with a structure where you need to xform multiple the actual matrix doesn't needed to be stored anywhere...it just temp expanded in registers.
I'm not entirely sure what you're talking about, but I'm not using MappedObjects for my skinning. I experimented with it for a physics simulation which had a much higher number of cache misses, and the gains were only on AMD processors which have a much worse memory controller.

Myomyomyo.
Offline Spasi
« Reply #124 - Posted 2015-07-14 17:56:34 »

Has anyone considered using OpenCL with a CPU device for this type of thing? What Kai is doing is basically compiling and running a simple OpenCL kernel.
Offline basil_

« JGO Bitwise Duke »


Medals: 418
Exp: 13 years



« Reply #125 - Posted 2015-07-14 18:10:14 »

i was wondering too already. if you go down the rabbit hole that deep - why not use C or openCL in the first place ?
Offline KaiHH

JGO Kernel


Medals: 796



« Reply #126 - Posted 2015-07-14 18:22:26 »

Has anyone considered using OpenCL with a CPU device for this type of thing? What Kai is doing is basically compiling and running a simple OpenCL kernel.
Yes, OpenCL would be very cool! If anyone wants to contribute an OpenCL backend for JOML, that'd be awesome!
But for the time being I really enjoy x86 assembly programming, which I have never ever done any assembly programming before.
And it is really fun! All those segmentation faults. Smiley
Offline KaiHH

JGO Kernel


Medals: 796



« Reply #127 - Posted 2015-07-14 19:25:22 »

A 4.5x faster matrix multiplication method would however be nice to have...
Just changed the code generator to use unconditional jumps with DynASM dynamic labels (I just have to say, DynASM is an AMAZING tool!) and got for the same 100 bulked operations with 10,000 executions now a speedup of over 600% !!!
(edit: if the JVM has a bad day, it's even over 760% sometimes!)
Absolutely same resulting end result matrix, just 600% faster. Cheesy
The code generator does not emit linear code anymore, but only emits a used matrix function one time and uses unconditional jmp to a generated jump labels/addresses to go back to the next statement.
I am sure there is much more that can be improved (like what @Roquen rightly pointed out) and I will sometimes end up at even 1000% speedup.

EDIT: A bit on how it's done:
I used a slightly different method compared to what this awesome tutorial(edited!, was wrong link) did. They emitted an immediate jump to a dynamic label address, because they emit code for every loop separately (they use that technique for nested loops).
This I could not do, sadly.
I needed a way to do what call/ret does, namely storing the address of a label in a register and then unconditionally jumping to it.
However some stackoverflow thread pointed out that push eax and then ret would not be optimal for a CPU for prediction or something.
Therefore I used rdx to store the address/label of the next operation in the batch, as that was the next register that was free. (rcx is used for the "arguments" ByteBuffer), and then let DynASM generate a new free label.
Every one-and-only generated matrix operation function at the end always just "jmp rdx".

EDIT2: @Spasi, I do not deserve that star: the code generator had a bug actually, and now it's back to "just" 260% faster, but actually correct. Smiley

EDIT3: However, since the code size does not grow as fast as before, doing a batch of 1000 matrix multiplications is 360% faster, and 10,000 also.
Offline BurntPizza

« JGO Bitwise Duke »


Medals: 486
Exp: 7 years



« Reply #128 - Posted 2015-07-14 21:00:34 »

Now with the dynamic label jumping (aka threaded code) this is closer to an embedded Forth than ever.
Offline KaiHH

JGO Kernel


Medals: 796



« Reply #129 - Posted 2015-07-14 21:08:54 »

Yes. And thanks for that "threaded code" link! Did not know of that code generation "scheme".
I am still new to native code, and that technique seemed to be the most natural thing to do to get small and fast code. Smiley
I also think people can find a lot of resemblences in many other languages and runtime environments.
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline theagentd
« Reply #130 - Posted 2015-07-14 22:09:11 »

It'd be interesting if you could make something I could test in the same way I tested the earlier stuff. I'd just need native implementations of mul (preferably mul4x3) and translationRotateScale().

Myomyomyo.
Offline BurntPizza

« JGO Bitwise Duke »


Medals: 486
Exp: 7 years



« Reply #131 - Posted 2015-07-14 22:15:12 »

I'm working on making the jitted 'matrix construction templates' (as I call them) work for transforming FloatBuffers containing many matrices at once. Would that be a relevant use case @theagentd? That's what I got out of your prior example with queues and seems like a use that would actually benefit from this tech. (1 JNI call to transform N matrices per unique op sequence)
Offline KaiHH

JGO Kernel


Medals: 796



« Reply #132 - Posted 2015-07-14 22:17:50 »

It'd be interesting if you could make something I could test in the same way I tested the earlier stuff. I'd just need native implementations of mul (preferably mul4x3) and translationRotateScale().
A working implementation of mul is already in.
(edit: I am adding a translationRotateScale as well, but I doubt it will be any faster than Java's, but we'll see. Just need to figure out how to vectorize that best. Smiley)

I keep a basic test always up to date:
https://github.com/JOML-CI/JOML/blob/native-interpreter/test/org/joml/NativeMatrix4f.java
(the main method at the end)

Although the API is very far from beautiful currently, needing some boilerplate code and is easily used very wrongly. Smiley
One thing I had recently was completely stunning my PC by doing 10 million invocations of NativeMatrix4f.mul(), which was quite bad, since that created a very large opcode buffer and then tried to generate executable code from it... Cheesy
It is safe to do around/atmost 10,000 operations on a matrix before terminate()'ing it to JIT the native code.

Do you need a pre-compiled Win64 library?
(edit: oh and I think I am only going to support Windows x64 during the development phase, just so much pain to spool up a Linux VM :-)
Offline theagentd
« Reply #133 - Posted 2015-07-14 22:45:34 »

Ehhhhh, I think I'll wait until it's a bit more ready then. X___X

To describe my entire use-case a bit more: Each bone is represented by a translation, an orientation and a scale. A bone's parameters are interpolated using slerp/lerp from permanently loaded animation bones, which have the same parameters. I also have some other functions that allow the user of the engine to override the parameters of a bone to a fixed user-specified value.

After a bone's parameters have been interpolated/updated, the bone is transformed by its parent.
1  
2  
3  
            parentOrientation.transform(translation).scl(parentScale).add(parentTranslation); //rotate translation by parentOrientation, then scale by parentScale, then add parentTranslation
            orientation.mulLeft(parentOrientation);
            scale.scl(parentScale);


After all, the bones are all converted to matrices, multiplied with the (precomputed) inverse bind-pose matrices of the model and finally it's stored in a ByteBuffer. This is by far the most expensive part actually.

Myomyomyo.
Offline KaiHH

JGO Kernel


Medals: 796



« Reply #134 - Posted 2015-07-15 20:11:23 »

@theagentd: Your first proposal of an API with your usecase of the queues and the matrices being created on that queue was actually pretty good and I did it this way now. Feels nice.

You can now do something like this:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
Sequence seq = new Sequence();
NativeMatrix4f m1 = new NativeMatrix4f(seq);
NativeMatrix4f m2 = new NativeMatrix4f(seq);
{
    m1.identity().rotateZ((float) Math.PI / 2.0f);
    m2.identity().rotateX((float) Math.PI).transpose();
    m1.mul(m2);
    ...
}
seq.call();
// Store the NativeMatrix4f 'm1' into a Matrix4f:
Matrix4f m3 = new Matrix4f();
m1.get(m3);

The above shown sequence is of course absurdly simple, but you get the point. Smiley
I am about to translate the first (simple) Matrix4f methods to the NativeMatrix4f. Those provide the same flexibility that the existing Matrix4f has: Operate on "this" or store the operation result in a "dest" matrix.
You can build pretty complex transformations with that, which execute when you call "call()" on the sequence.
Every NativeMatrix4f holds its NIO Buffer of at least 4*16 bytes. This NIO Buffer can of course be shared with many NativeMatrix4f instances, each of them using their own offset into the buffer.

EDIT: The API begins to solidify (if one can say that that early). Once a single Sequence instance is terminated (either by explicitly calling terminate() or implicitly by the first call()), it can now be reused (as it was intended in the first place) for multiple invocations of the same sequence of operations that are already stored in it and jitted. This is necessary in order to update the actual arguments to the sequenced operations via the builder classes (NativeMatrix4f, NativeVector4f, ...) to keep the client code look the same in all cases (terminated and not terminated; i.e. terminating an already terminated Sequence does nothing).
There are validations in place to check whether you always call the same sequence of methods on that "Sequence", because breaking the sequence can have unpredictable consequences (most likely process crash).
This is because once the sequence is terminated, its operations are jitted to a native function, which then simply expects the actual arguments in a given order, that was determined when writing the sequence the first time.
Offline theagentd
« Reply #135 - Posted 2015-07-15 23:07:07 »

The reason I proposed that was for the simplicity of it. It might not have as insanely good performance as possible alternatives, but it does seem extremely easy to implement. The only real addition you have to make is to create a Sequence and make the matrices "native". That's convenient enough for people to be able to use it in every-day scenarios.

Myomyomyo.
Offline KaiHH

JGO Kernel


Medals: 796



« Reply #136 - Posted 2015-07-16 09:34:51 »

Okay, having investigated HotSpot's code it generates I must say it is doing a pretty darn good job. Like @Riven said, HotSpot keeps fields in registers and that even over many method invocations. It looks like it knows that the same object is used again in a subsequent method and uses barrier checks to see if that is still the case, and if it is not, it spills the values into main memory.
HotSpot also seems to "evolve" the native code it generates for a method.
For a simple test case I did with Matrix4f.mul() it generated that method four times, always with slightly more optimized instructions and keeping more things in registers as compared to the stack or generlly the main memory.
So far, JOML only outperforms HotSpot for the first 10,000 or so invocations.
After that, JOML is on par and even sometimes slower than HotSpot, and it looks like this is because HotSpot keeps all the matrix fields in the XMM registers, which is an amazing job.
However, JOML can also make use of that: If for example it sees a sequence of two matrix operations which operate on the same source and destination matrices, it need not spill the computed matrix columns into main memory again, but keep it in XMM registers.
Let's see how that will improve, since I do believe that memory latency is currently the bottleneck.
After that is eliminated, in the optimal case the only difference between HotSpot and JOML would be that HotSpot still only emits scalar instructions whereas JOML uses vector instructions and has the JNI overhead. So in theory, JOML could be at max 4x as fast as HotSpot minus JNI overhead.
Offline CommanderKeith
« Reply #137 - Posted 2015-07-16 10:03:57 »

Your experiments are very interesting. While I don't make 3D games, it's nice to see how code can be heavily optimised.

I've read that Java's JIT compiler Hotspot uses SIMD (single instruction multiple data) in array copies and fills:
http://psy-lob-saw.blogspot.com.au/2015/04/on-arraysfill-intrinsics-superword-and.html

I haven't heard of hotspot using SIMD in anything else, so your changes are a very cutting edge optimisation. It will be interesting to see if there will actually be a significant speed up.


Offline KaiHH

JGO Kernel


Medals: 796



« Reply #138 - Posted 2015-07-16 10:17:02 »

Yepp. I am beginning to doubt that there will be really significant benefits from that, probably 2x as fast. And the amount of work needed now becomes a bit in disproportion to the actual gains. But so far it was an interesting venture and I've ordered "Modern X86 Assembly Programming" for future projects.

So the bottom line I take from all this is probably: Whoever thinks that Java is slow,... think again! Smiley
Offline Riven
Administrator

« JGO Overlord »


Medals: 1371
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #139 - Posted 2015-07-16 10:47:07 »

I went through similar discoveries when working on LibStruct.

Main lessons learnt would be:
1) retaining data in a register is crucial for performance.
2) memory access (even L1 cache) is slow.
3) most diehard optimizations will be hidden/thwarted by memory latency.

The only way to do vector math really quickly is to write dedicated ASM that takes #1 into account and works on bulk data, with loop unrolling in ASM, minimize the amount of jumps, conditional or not. Also note that SSE processes 128 bits in one go (or 256 bits in SSE5), but due to latency my typical measured speedup was factor 1.7-2.2 in real world scenarios, in micro-benchmarks the gains were greater, but... who cares.

Usually it's more productive to create/generate HotSpot friendly bytecode, than to gnerate ASM, as your ASM is bound to lack all kinds of peephole optimizations (like complex reordering of instructions) or maybe even basic stuff like memory pre-fetching. Let HotSpot do its magic, even if that means the lack of SIMD.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings!
Offline KaiHH

JGO Kernel


Medals: 796



« Reply #140 - Posted 2015-07-16 11:05:51 »

I am thinking now of analyzing the operation sequence in Java and inserting "hints" into the final "instructions" buffer, which goes to the native JIT, as to whether a jump should be performed or unrolled and also whether it is safe to keep the matrices in registers for subsequent operations.
Preferrably, for the ease of implementation and portability, I would do most of the work in Java and submit only the "IDs" of the raw assembly instructions (possibly parameterized, i.e. templates) to the JIT which it can then stitch together.
That'll be fun! Smiley
Offline Riven
Administrator

« JGO Overlord »


Medals: 1371
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #141 - Posted 2015-07-16 11:16:03 »

And down the rabbit hole we go Pointing

See ya in a few months! Tongue



Update: keep in mind you are creating this side-projects that only a handful of people will understand, and eventually nobody will use. I've been there with LibStruct. I think it's cool tech, but even though I'd deem it simple to use, it's too complex for 'the 99%', and the remaining people can't be bothered jumping through hoops, as the benefits of the library seem not to outweigh the complexity cost.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings!
Offline KaiHH

JGO Kernel


Medals: 796



« Reply #142 - Posted 2015-07-16 11:21:52 »

Hehe. More like in a few "years." Wink

...wait, why not just have an embedded LLVM and emit LLVM bitcode and not tinker with those lowlevel optimizations anymore... Cheesy

Update on your update: I aggree. The complexity for the end-user if far greater than the gains. Given that even with "classic" JOML every code I have seen so far always used it against its design, leading to worse performance.
So, it's back to glTranslatef() and gluLookAt(), I guess. Cheesy
Offline theagentd
« Reply #143 - Posted 2015-07-16 12:27:08 »

Update: keep in mind you are creating this side-projects that only a handful of people will understand, and eventually nobody will use. I've been there with LibStruct. I think it's cool tech, but even though I'd deem it simple to use, it's too complex for 'the 99%', and the remaining people can't be bothered jumping through hoops, as the benefits of the library seem not to outweigh the complexity cost.
LibStruct was a bit of an extreme case though. Having classes being converted to structs is a gigantic minefield as you don't have any compile-time checks, and a huge number of things that Java programmers are used to doing are simply not supported. That in combination with the sometimes confusing ways it failed made it hard for me to justify the overhead, especially since the gains were minimal on my own CPU (but substantial on AMD CPUs).

The complexity of the system I proposed is nowhere near as high or intrusive. If you know how to use the standard Matrix4f class, it's not a big leap to understand the Sequence object and that all operations are queued in it. It's also not very confusing or full of pitfalls if you use builder methods that take care of argument filling etc. The only thing that can cause VM crashes is feeding invalid op codes or arguments to the native processor, and that can be hidden by abstraction.

Myomyomyo.
Offline KaiHH

JGO Kernel


Medals: 796



« Reply #144 - Posted 2015-07-16 15:41:41 »

So I've decided to move to using GCC intrinsics from now on, where I let GCC compile it to plain asm, which I then use with DynASM to emit it during runtime.
Also most of the code out there is using GCC intrinsics. No one seems to be using plain assembly anymore. Smiley

Using intrinsics you can define your own variable names, can easily initialize vectors with constant values without going through hoops, and also having GCC do the register allocation for you. This is just sooo much easier and at times also gives surprising results. So now I just write the matrix operations as C functions with SSE intrinsics and then see what gets emitted by that and use that as a template for DynASM. Smiley
Should've chosen this path earlier.
Because, so far it's been such a pain keeping track of all free/used registers myself all the time. That really gives headache.

The next biggest thing would be making use of AVX when bulk-multiplying eight matrices at once as described here: https://software.intel.com/en-us/articles/benefits-of-intel-avx-for-small-matrices
But for this JOML somewhat needs an operation scheduler that can detect and interleave independent operations to make even use of such parallel bulk operations when the client does not explicitly anticipiate that.

But before I start anything new, I will research a bit more on what the actual possibilities are and what reasonable bulky and yet simple-to-concatenate "atomic" operations there could be for vector math.
Based on that it would be reasonable to see which code generation architecture to choose.

EDIT: Okay, so far the translation of the GCC intrinsics of this matrix multiplication into assembly code by gcc was more than disappointing.
In -O3 the code looks like the amateurily handcrafted code of mine but is actually consistently 50% slower than mine... gcc's code has 32 memory reads in it, whereas I used 20, but I made use of the whole width of the 8 xmm registers, whereas gcc only uses 4, which is good, too, since then the other four can hold a matrix permanently...
The -O2 code even did not unroll the loops and had two nested jumps in it. So even worse.
Maybe clang/llvm does a better job of it.
Offline KaiHH

JGO Kernel


Medals: 796



« Reply #145 - Posted 2015-07-17 18:35:04 »

So, I went deeper into that rabbit hole again. Cheesy

I decided to make JOML-native require at least SSE3, which is available with all x64 Intel Pentium 4 "Prescott Family" processors which exist since 2004.
AMD also has support for SSE3 since 2005.
So, whoever has a processor from at most 10 years ago, it should work. Smiley

But why do I want to require SSE3?
Simply because it has a whole bunch of more SIMD registers available compared to SSE2, namely: XMM0-XMM15
Most native operations, such as matrix multiplication, make use of XMM0-XMM7 to avoid main memory access as most as possible. Now with 8 additional registers (XMM8-XMM15) we can hold *two* matrices permanently in registers, which is great.
I measured main memory accesses and those are really slow, like @Riven said, even L1 hits.

So, I am using the XMM8-XMM11 registers to hold the 4 columns of the primary/first matrix, which corresponds to Java's 'this' when invoking methods on a Matrix4f.
The other 4 registers XMM12-XMM15 I use for either a second matrix operand (such as for matrix multiplication) or to hold a 'dest' matrix when 'dest' is different from 'this'.

Now, the native code actually does less than it did before and all instruction scheduling is moved to the Java "Sequence" class. It decides when to load/store registers from/to main memory based on the actual arguments of the issued matrix operations.
If you use the NativeMatrix4f class optimally by only issuing matrix operations that store their results back into the same 'this' matrix, then only one single load from main memory is issued for the whole sequence of operations!

When the Sequence is terminated it will be checked whether the registers are in sync with main memory (which is known due to the operation semantics) and if they're not, they are lastly spilled back to main memory.
I am also emitting the instructions linearly again to avoid jumps.

Also, having learnt a bit more on X64 calling conventions, the generated native function now actually has a correct function prologue and epilogue, which when wrong could've crashed the process due to not saving non-volatile registers. Cheesy

I will do some performance testing once the main matrix methods are implemented and I validated their correctness.
Offline KaiHH

JGO Kernel


Medals: 796



« Reply #146 - Posted 2015-07-17 19:23:18 »

Okay, here are the real numbers for Matrix4f.mul(Matrix4f).
The test fixture gives HotSpot enough opportunity to generate optimal code by first warming HotSpot with 100,000,000 invocations of the classic Matrix4f.mul(Matrix4f) call with always the same objects that are also used when timing later.

So here are the timings (average of 10 runs).
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
num of inv.  native /    classic    % faster
50x mul:       1520 /    5702 ns -> 275% faster
100x mul:      2281 /    8742 ns -> 283% faster
250x mul:     4.561 /  15.964 µs -> 250% faster
500x mul:     9.123 /  29.648 µs -> 225% faster
1000x mul:   17.105 /  53.215 µs -> 211% faster
2000x mul:   31.929 / 133.418 µs -> 317% faster
3000x mul:   47.134 / 187.393 µs -> 297% faster
5000x mul:   76.782 / 296.865 µs -> 287% faster
8000x mul:  123.155 / 442.446 µs -> 259% faster
10000x mul: 155.664 / 545.835 µs -> 259% faster

avg: 266% faster!


EDIT:

Just for fun, here are some numbers without warming HotSpot:
1  
2  
3  
4  
5  
6  
7  
8  
9  
50x      1,521 /   48,654 ns -> 2,459%
100x     2,281 /   87,045 ns -> 3,716%
500x     8.743 /  432.563 µs -> 4,847%
1000x   16.724 /  778.841 µs -> 4,557%
2000x   33.450 / 1260.819 µs -> 3,669%
5000x   76.782 / 1752.679 µs -> 2,182%
8000x  122.015 / 1589.992 µs -> 1,203%
10000x 153.944 / 1701.744 µs -> 1,005%
20000x 489.960 / 2282.170 µs ->   365%

So, if you are planning on doing "only" 20,000 invocations ever or need the utmost performance right from the start, then JOML-native is definitely a win over HotSpot.
Offline KaiHH

JGO Kernel


Medals: 796



« Reply #147 - Posted 2015-07-18 22:47:37 »

Since I am really fed-up with writing assembler and doing register allocation all in my head and with pencil on paper for even simple functions, I want to use SSE for, I am planning to add a small register allocator with simple iterative liveness dataflow analysis.

Before that, I tried using GCC for converting simple functions using SSE intrinsics into assembler, but GCC has no possibility to specify for certain values that specific registers should be used. It has the ability to "fix" some registers, so that GCC's register allocator will not touch them (i.e. assumes they do not exist), but that is not what I want.
There is also the possibility to "annotate" a variable (that has the "register" storage qualifier) with the name of a certain register using the "asm(register)" syntax after the variable declaration. That however did not work and was ignored by both GCC 4.9 and the recent LLVM/clang 3.6.2.
I wanted to be able to allocate a certain variable to a specific register, since the interface of the assembly template snippets I used used xmm8-xmm11 to store the matrix in.

So, what I really want now is something like DynASM for Java. Being able to dynamically JIT native code during runtime with pure Java-Syntax! There are two possibilites how to go about this. Both ways would provide an API for the user to make use of the SSE functions via intrinsics named after those from GCC and msvc. So, there would be a class "Intrinsics" with static methods such as _mm_add_ps or _mm_set_ps.

Provided this, there are now two ways:

1. do static analysis of Java bytecode and literally translate each method marked as "should be jitted" into native code by translating the JVM bytecode instructions to x86 instructions. Of course, no invocations of pure-Java methods would be allowed in those methods.

2. Provide a dynamic runtime where the provided intrinsic methods are actually implemented and would write a representation of themselves into an "instructions" list once invoked. This is like the NativeMatrix4f builder concept currently works. Just that the operations would not be complex matrix methods anymore, but more lowlevel SSE intrinsics.

Both methods have advantages and disadvantages:

Method 1 has the advantage that it can account for dynamic control flow in the execution of the jitted function, as it would have to also jit Java control flow bytecode instructions, such as "goto" into x86 "jmp", etc.
It does however have the disadvantage that the code it generates is fixed and static, because Java bytecode is static and (normally) does not change during runtime. We therefore lose the ability to dynamically alter the generated code at runtime based on runtime conditions.

Method 2 has the advantage of allowing this. But it has the disadvantage that we lose the ability of control flow in the generated jitted code, since we have no way of expressing it without making the Java code, that should represent this, very ugly by introducing Java methods for control flow instructions.
DynASM actually uses this Method 2 but since it uses a preprocessor with special non-C syntax, it does not have the mentioned disadvantage of uglifying the source language too much.

So, which method to choose? I tend to Method 2. It does give the client the ability to use it more like a template/dynamic JIT to generate instructions based on conditions at runtime. This method also does not need classpath scanning to actually find classes/methods that should be jitted and it simplifies the whole process a lot by not having to translate the whole JVM bytecode instruction set. We just need to implemented the provided API to record the called instructions and then spit out a jitted native function at the end, just like the current NativeMatrix4f interface does.

In the end, this will allow you to make use of SSE/AVX and whatnot with pure Java syntax for any usecase that requires it. That's the goal.

EDIT:

Ideally, it would look something like this:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
import org.joml.jit.JIT;
import org.joml.jit.NativeCode;
import org.joml.jit.__m128;

import static org.joml.jit.Intrinsics.*;
import static org.joml.jit.Registers.*;

public class JitTest {
    public NativeCode generateMatrixIdent() {
        JITContext jit = JIT.newContext();

        // Generate the identity matrix in xmm8-xmm11
        __m128 one = _mm_set_ps(0.0f, 0.0f, 0.0f, 1.0f);
        __m128 xmm8 = new __m128(XMM8);
        __m128 xmm9 = new __m128(XMM9);
        __m128 xmm10 = new __m128(XMM10);
        __m128 xmm11 = new __m128(XMM11);
        _mm_move_aps(xmm8, one);
        one = _mm_shuffle_ps(one, one, _MM_SHUFFLE(2, 1, 0, 3));
        _mm_move_aps(xmm9, one);
        one = _mm_shuffle_ps(one, one, _MM_SHUFFLE(2, 1, 0, 3));
        _mm_move_aps(xmm10, one);
        one = _mm_shuffle_ps(one, one, _MM_SHUFFLE(2, 1, 0, 3));
        _mm_move_aps(xmm11, one);

        return jit.generate();
    }
}
Offline theagentd
« Reply #148 - Posted 2015-07-19 02:55:31 »

Wait, so I could literally convert any function to use SSE instructions? Would this cause some kind of JNI overhead? I am not very intimate with these kind of low-level instructions, so I'm sorry if I'm misunderstanding something.

Myomyomyo.
Offline CommanderKeith
« Reply #149 - Posted 2015-07-19 03:10:18 »

I only vaguely understand what you're doing, but it sounds like it would be a great tool and a commendable engineering feat!  Cool
Being able to batch up operations for SSE would benefit lots of areas where long calculations are done.
An advantage of the tool would be that noobs like me can use SSE without figuring out the vagaries of SSE on different architectures.
But would a disadvantage for the experts be that your tool would sacrifice some control so that the most efficient SSE code could not be made using java code? Perhaps a solution like embedding pure SSE code in the java source file directly would be better. I saw a language that did that somewhere by preceding the line with a pipe | symbol. But even doing it in a String would suffice. Then, I assume, finer control could be achieved.
Some other thoughts:
-I wonder how fast SSE is compared to using OpenCL, as @Spasi hinted at above. Since most performance-intensive programs will use the graphics card, openCL would be available to them and might even be faster than SSE given the much larger amount of cores on graphics cards and the possibility of the openCL calculation results being kept on the graphics card for drawing in openGL.
-I assume that the tool would only be available for java programs and not for Android or JavaScript code. It blows against the wind a little since java as a client-side technology is dying. The best ways to distribute our java programs these days is as WebGL, HTML5 and Javascript or as an Android program.

Pages: 1 ... 3 4 [5] 6 7 ... 13
  ignore  |  Print  
 
 

 
Riven (581 views)
2019-09-04 15:33:17

hadezbladez (5510 views)
2018-11-16 13:46:03

hadezbladez (2402 views)
2018-11-16 13:41:33

hadezbladez (5772 views)
2018-11-16 13:35:35

hadezbladez (1223 views)
2018-11-16 13:32:03

EgonOlsen (4661 views)
2018-06-10 19:43:48

EgonOlsen (5682 views)
2018-06-10 19:43:44

EgonOlsen (3198 views)
2018-06-10 19:43:20

DesertCoockie (4095 views)
2018-05-13 18:23:11

nelsongames (5115 views)
2018-04-24 18:15:36
A NON-ideal modular configuration for Eclipse with JavaFX
by philfrei
2019-12-19 19:35:12

Java Gaming Resources
by philfrei
2019-05-14 16:15:13

Deployment and Packaging
by philfrei
2019-05-08 15:15:36

Deployment and Packaging
by philfrei
2019-05-08 15:13:34

Deployment and Packaging
by philfrei
2019-02-17 20:25:53

Deployment and Packaging
by mudlee
2018-08-22 18:09:50

Java Gaming Resources
by gouessej
2018-08-22 08:19:41

Deployment and Packaging
by gouessej
2018-08-22 08:04:08
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!