Java-Gaming.org Hi !
Featured games (91)
games approved by the League of Dukes
Games in Showcase (808)
Games in Android Showcase (239)
games submitted by our members
Games in WIP (872)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: 1 ... 4 5 [6] 7 8 ... 13
  ignore  |  Print  
  Java OpenGL Math Library (JOML)  (Read 218787 times)
0 Members and 1 Guest are viewing this topic.
Offline Spasi
« Reply #150 - Posted 2015-07-19 06:52:18 »

@Kai, assuming what you describe above is doable, you could then wrap the low-level interface in a user-friendly API similar to SIMD.js.

Would this cause some kind of JNI overhead?

It doesn't change anything performance-wise, you'd still have to batch enough work to trivialize the native call overhead.

An advantage of the tool would be that noobs like me can use SSE without figuring out the vagaries of SSE on different architectures.
But would a disadvantage for the experts be that your tool would sacrifice some control so that the most efficient SSE code could not be made using java code?

What Kai is describing is direct access to SIMD instructions from Java. So full control and full performance. It doesn't make it any easier for people not familiar with SSE. But of course you could hide the nastiness in JOML or something like SIMD.js.

I wonder how fast SSE is compared to using OpenCL, as @Spasi hinted at above. Since most performance-intensive programs will use the graphics card, openCL would be available to them and might even be faster than SSE given the much larger amount of cores on graphics cards and the possibility of the openCL calculation results being kept on the graphics card for drawing in openGL.

The benefit of OpenCL is transparent use of SSE->AVX2 and multithreading at the same time. The disadvantages are a) you'd have to batch up more work, going through OpenCL will have more overhead than what Kai is doing and b) OpenCL availability; SSE is everywhere, but OpenCL requires a runtime to be available at the user's machine.

Also note that using the GPU with OpenCL is not an option in this case. I'm assuming that whatever you're doing with JOML needs to be CPU-accessible, for culling, collision-detection, etc. Otherwise, you'd be using OpenGL shaders directly.

Offline Roquen

JGO Kernel


Medals: 518



« Reply #151 - Posted 2015-07-19 07:31:49 »

I have been using simd ops since mmx was only available on prototype hardware.  Something that's been true since then is if you want real performance gains then the data has to laid out in a friendly manner.
Offline basil_

« JGO Bitwise Duke »


Medals: 418
Exp: 13 years



« Reply #152 - Posted 2015-07-19 08:32:03 »

just a side note : if you want to see the power of SIMD - try the game "from dust" (https://en.wikipedia.org/wiki/From_Dust) - it shows crazy large scale deformable volumetric meshes computed entirely on the cpu. side note side note : it's made by eric chahi the creator of "another world" if you are old enough to remember that Wink

on topic : i would love to use a "general purpose" SIMD native api!
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline KaiHH

JGO Kernel


Medals: 819



« Reply #153 - Posted 2015-07-19 12:32:13 »

Wait, so I could literally convert any function to use SSE instructions? Would this cause some kind of JNI overhead? I am not very intimate with these kind of low-level instructions, so I'm sorry if I'm misunderstanding something.
It would allow you to use the SSE instructions, preferably in the form of the GCC/msvc intrinsics to make use of three-operand instructions and to not care about register allocation. So, anyone wanting to use this would first have to familiarize him/herself with SSE, which is not very hard.
The plan however is not to allow "any" currently scalar operations to be automagically converted to use SSE instructions. That would be "auto-vectorization" and that is a non-trivial task. Currently, even state-of-the-art C/C++ compilers support it in very limited form, such as "loop-vectorization."
So, to use this you would come up with algorithms (or search for one) that already make use of SSE instructions and the plan is then to make those algorithms easily portable/accessible to Java.

An advantage of the tool would be that noobs like me can use SSE without figuring out the vagaries of SSE on different architectures.
Yepp. But there are not many different architectures since SSE is x86 only. So, there is just x86 and x86-64. The latter with more registers and wider basic datatypes. And AFAIK, there is only the difference in calling conventions and register usages between Windows and not-Windows. Smiley
But that would be hidden by the library, yes.

But would a disadvantage for the experts be that your tool would sacrifice some control so that the most efficient SSE code could not be made using java code?
I plan to support all SSE instructions. The only difference is in expressiveness of control-flow. I am not planning to add emitting control-flow instructions. The only control-flow a client wants to do would have to be in the generation of the native code by branching in Java when calling the SSE (builder) methods.

Perhaps a solution like embedding pure SSE code in the java source file directly would be better. I saw a language that did that somewhere by preceding the line with a pipe | symbol. But even doing it in a String would suffice. Then, I assume, finer control could be achieved.
Yes, you are talking about DynASM here. And since DynASM provides the whole x86 instruction set (and others) it allows to have control-flow in the generated code. But DynASM basically "only" provides encoding of the low-level instructions and relocation of labels (so it literally IS an assembler). This is too low-level in my opinion, but of course is the most expressive and flexible way, if one really wants to code in assembly.
What I would like is a library that is one level above that by providing simple register allocation, so you don't have to do that anymore. To do that we would need to lift the abstraction level of the SSE instructions one level higher and make those really source-level functions/methods that do not operate on registers anymore but on "variables" (i.e. Java "locals").
This is what the native compilers do with their "intrinsics" functions. They make the SSE/AVX instructions available via source-level functions. This is a very elegant way. And it seems that when people are talking about SIMD algorithms they do so in terms of those source-level intrinsics. So as stated above, the real motivation for me to create a library as described is to be able to translate those GCC/msvc intrinsics directly to Java.

Some other thoughts:
-I wonder how fast SSE is compared to using OpenCL
EDIT: After thankful correction from @Spasi (see below post) the following is wrong and therefore strokethrough:

I see OpenCL a different field now which needs special treatment. The real motivating driving force for the library for me currently is easy porting of SSE intrinsics from GCC/msvc to Java.
A disadvantage of OpenCL in addition to what @Spasi already mentioned is for me: complexity in the writing of algorithms for it.
Remember that OpenCL conceptually (and with modern GPUs also physically) only has scalar operations.
Now to actually do SIMD, you would need to think about multithreading in a clever way to actually use 4 threads to simultaneously operate on a 4-component vector (or bigger).
And to have the best possible performance, those threads will likely want to be in the same wavefront and the data they access would need to be consecutive.
Also, the operational semantics and the concept of OpenCL are quite different from x86 SSE, though both provide SIMD.
If you want to think of SSE in terms of OpenCL, it would be that SSE only has 4 threads. Those 4 threads are synchronized "implicitly" after each SSE instruction. Each such instruction has lock semantics (meaning changes to memory are visible immediately) within the same thread group (those 4 threads).
In SSE you express your algorithm in SIMD form, whereas in OpenCL you express it in scalar form and hope that the runtime will execute multiple such instructions in parallel.
All in all this makes SSE a lot easier to use than OpenCL, but of course also limits you to 4 threads.
But if you want to use more than 4 "conceptual" threads you can of course always use operating system threads to parallelize your SSE algorithm even further on more data, so having "more" SIMD. Smiley


-I assume that the tool would only be available for java programs and not for Android or JavaScript code. It blows against the wind a little since java as a client-side technology is dying. The best ways to distribute our java programs these days is as WebGL, HTML5 and Javascript or as an Android program.
Android support is currently not planned, since I have absolutely no idea on how to do dynamic native code on Android. Smiley
Even if used on a x86 architecture, I don't know whether one can mmap and mprotect on Android or if it is at all possible for user applications to generate native code at runtime. There seems to be very little resources about that online.

About JavaScript: If you are talking about libGDX with its GWT backend, that GWT backend would need to be augmented to recognize the SSE instructions of JOML and convert it to SIMD.js for example. This would allow Java-to-JavaScript-converted code to use SSE. At least on Firefox Nightly.
Offline CommanderKeith
« Reply #154 - Posted 2015-07-19 13:57:59 »

The benefit of OpenCL is transparent use of SSE->AVX2 and multithreading at the same time. The disadvantages are a) you'd have to batch up more work, going through OpenCL will have more overhead than what Kai is doing and b) OpenCL availability; SSE is everywhere, but OpenCL requires a runtime to be available at the user's machine.

Also note that using the GPU with OpenCL is not an option in this case. I'm assuming that whatever you're doing with JOML needs to be CPU-accessible, for culling, collision-detection, etc. Otherwise, you'd be using OpenGL shaders directly.

I see OpenCL a different field now which needs special treatment. The real motivating driving force for the library for me currently is easy porting of SSE intrinsics from GCC/msvc to Java.
A disadvantage of OpenCL in addition to what @Spasi already mentioned is for me: complexity in the writing of algorithms for it.
Remember that OpenCL conceptually (and with modern GPUs also physically) only has scalar operations.
Now to actually do SIMD, you would need to think about multithreading in a clever way to actually use 4 threads to simultaneously operate on a 4-component vector (or bigger).
And to have the best possible performance, those threads will likely want to be in the same wavefront and the data they access would need to be consecutive.
Also, the operational semantics and the concept of OpenCL are quite different from x86 SSE, though both provide SIMD.
If you want to think of SSE in terms of OpenCL, it would be that SSE only has 4 threads. Those 4 threads are synchronized "implicitly" after each SSE instruction. Each such instruction has lock semantics (meaning changes to memory are visible immediately) within the same thread group (those 4 threads).
In SSE you express your algorithm in SIMD form, whereas in OpenCL you express it in scalar form and hope that the runtime will execute multiple such instructions in parallel.
All in all this makes SSE a lot easier to use than OpenCL, but of course also limits you to 4 threads.
But if you want to use more than 4 "conceptual" threads you can of course always use operating system threads to parallelize your SSE algorithm even further on more data, so having "more" SIMD. Smiley

Good points, I understand the distinction better now. Thanks for the explanations.

@kaiHH Maybe I'm missing something, but I can see from the Jittest NativeCode sample you posted that you will give access to the registers as static variables of the class org.joml.jit.Registers. This makes sense for a single-threaded application but for multi-threaded use of an SSE setup like you hinted at then shouldn't these be local variables?

Do you think it would be worth making a version that operates on 2 doubles rather than 4 floats? I notice that SIMD SSE2 allows 2 doubles as well as 4 floats: https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions#Registers
Obviously there would be a 50% slowdown doing 2 doubles instead of 4 floats but it would be nice to have. I only ask because I've sometimes found that doing intermediate calculations with floats can often cause rounding problems compared to using doubles all the way through.

This is a very exciting project.

Offline Spasi
« Reply #155 - Posted 2015-07-19 14:09:13 »

I see OpenCL a different field now which needs special treatment. The real motivating driving force for the library for me currently is easy porting of SSE intrinsics from GCC/msvc to Java.
A disadvantage of OpenCL in addition to what @Spasi already mentioned is for me: complexity in the writing of algorithms for it.
Remember that OpenCL conceptually (and with modern GPUs also physically) only has scalar operations.
Now to actually do SIMD, you would need to think about multithreading in a clever way to actually use 4 threads to simultaneously operate on a 4-component vector (or bigger).
And to have the best possible performance, those threads will likely want to be in the same wavefront and the data they access would need to be consecutive.
Also, the operational semantics and the concept of OpenCL are quite different from x86 SSE, though both provide SIMD.
If you want to think of SSE in terms of OpenCL, it would be that SSE only has 4 threads. Those 4 threads are synchronized "implicitly" after each SSE instruction. Each such instruction has lock semantics (meaning changes to memory are visible immediately) within the same thread group (those 4 threads).
In SSE you express your algorithm in SIMD form, whereas in OpenCL you express it in scalar form and hope that the runtime will execute multiple such instructions in parallel.
All in all this makes SSE a lot easier to use than OpenCL, but of course also limits you to 4 threads.
But if you want to use more than 4 "conceptual" threads you can of course always use operating system threads to parallelize your SSE algorithm even further on more data, so having "more" SIMD. Smiley

Everything you said here is wrong, you're missing the point of OpenCL entirely. Multithreading happens at the workgroup level. Everything within a workgroup runs on the same thread with appropriate SIMD instructions for the kernel workload and target CPU. Read this for details. The Intel OpenCL compiler can even output 512-bit SIMD on Xeon Phi.

You really cannot beat OpenCL in terms of usability either:

1  
2  
3  
float4 a = ...;
float4 b = ...;
float4 r = a + b; // it doesn't get more friendly than this


1  
2  
3  
float16 a = ...;
float16 b = ...;
float16 r = a + b; // this will be 1 instruction on Phi or 4 "unrolled" instructions with plain SSE

GPUs are indeed scalar these days. For example AMD cards had vector registers up to the Radeon HD 6000 series (VLIW5 originally, then VLIW4), but switched to a scalar architecture with GCN. A vec4 operation will indeed be translated to 4 scalar instructions, but that's an implementation detail. In both OpenGL shaders and OpenCL kernels, you're using vector types for the semantics; it is the compiler's job to translate that to the optimal hardware instructions. That means SIMD on Intel CPUs.
Offline KaiHH

JGO Kernel


Medals: 819



« Reply #156 - Posted 2015-07-19 14:18:29 »

@kaiHH Maybe I'm missing something, but I can see from the Jittest NativeCode sample you posted that you will give access to the registers as static variables of the class org.joml.jit.Registers. This makes sense for a single-threaded application but for multi-threaded use of an SSE setup like you hinted at then shouldn't these be local variables?
Your Java code will not have access to the registers. The native code, that the Java method represents and that will be built by the executed Java "builder" methods, will.
Those __m128 Java locals/variables in the example only "represent" variables, which will eventually be allocated to registers when generating native code.
But those registers will not be accessible from Java and also do not link with those Java variables/locals.
Accessing registers from Java is not the goal of JOML, as that is partly not possible and also would require each register access to go through a JNI interface function, which would be dramatically slow.

Do you think it would be worth making a version that operates on 2 doubles rather than 4 floats? I notice that SIMD SSE2 allows 2 doubles as well as 4 floats: https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions#Registers
Obviously there would be a 50% slowdown doing 2 doubles instead of 4 floats but it would be nice to have. I only ask because I've sometimes found that doing intermediate calculations with floats can often cause rounding problems compared to using doubles all the way through.
We could do this of course, if it will be usable useful for someone.

@Spasi: Ahh...thank you for your clarifications! I was somehow under the impression that OpenCL has no vector primitive types when I wrote the post. Should've looked that up, though. Smiley

EDIT:

Just actively searched for existing projects that promised to do what I wanted. Only found jssembly so far. However, it is in a very early pre-alpha state with not much actually working. And it is also too low-level for me, as it seems to plan on having only an instruction encoder and a simple instruction text mnemonic to operation mapper using an ANTLR grammar, so no relocation and linking to make it a full assembler.
Offline Kefwar
« Reply #157 - Posted 2015-07-20 13:57:51 »

It would be very useful if every class has the
get(FloatBuffer buffer)
function.
Currently only the Matrix4f has it (correct me when I'm wrong).

Offline KaiHH

JGO Kernel


Medals: 819



« Reply #158 - Posted 2015-07-20 14:08:46 »

That is correct. Currently, only the matrix classes have those methods.
The rationale behind this is that it is considered a low burden for a client to do:
1  
2  
3  
4  
Vector3f v = ...;
FloatBuffer fb = ...;
// put vector into fb
fb.put(v.x).put(v.y).put(v.z);

whereas with the 4x4 matrix it can be difficult and cumbersome because of the 16 values and having to take care of correct column-major layout.
But if you really need it, it can of course be add.
Offline Kefwar
« Reply #159 - Posted 2015-07-20 14:14:03 »

whereas with the 4x4 matrix it can be difficult and cumbersome because of the 16 values and having to take care of correct column-major layout.
But if you really need it, it can of course be add.
I had already fixed it that way. I thought it just would be useful if every class had the function, since it is used quite often.

Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline KaiHH

JGO Kernel


Medals: 819



« Reply #160 - Posted 2015-07-20 14:28:57 »

Okay, org.joml:joml:1.4.1-SNAPSHOT and org.joml:joml-mini:1.4.1-SNAPHOT contain the changes to Vector3f and Vector4f. Additionally, there is also a getTransposed in Matrix3f and Matrix4f.
Fetch it from https://oss.sonatype.org/content/repositories/snapshots if you are using Maven; or just from the sources.
Offline cylab

JGO Kernel


Medals: 196



« Reply #161 - Posted 2015-07-20 15:11:45 »

If you now put in some static versions for use with collections and array, that would also be great:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
// Create Buffers from Vectors and return the new Buffer
public static FloatBuffer Vector3f.toBuffer(Vector3f[] vectors)
public static FloatBuffer Vector3f.toBuffer(Collection vectors)

// Add Vectors to a Buffer and return it with the position after the added vectors (+last stride)
public static FloatBuffer Vector3f.toBuffer(Vector3f[] vectors, FloatBuffer buffer)
public static FloatBuffer Vector3f.toBuffer(Vector3f[] vectors, FloatBuffer buffer, int position)
public static FloatBuffer Vector3f.toBuffer(Vector3f[] vectors, FloatBuffer buffer, int position, int stride)

// same with a collection of Vectors
public static FloatBuffer Vector3f.toBuffer(Collection<Vector3f> vectors, FloatBuffer buffer)
public static FloatBuffer Vector3f.toBuffer(Collection<Vector3f> vectors, FloatBuffer buffer, int position)
public static FloatBuffer Vector3f.toBuffer(Collection<Vector3f> vectors, FloatBuffer buffer, int position, int stride)

// Create Arrays of Vectors from a Buffer
public static Vector3f[] Vector3f.toArray(FloatBuffer buffer)
//  position and stride are for the buffer
public static Vector3f[] Vector3f.toArray(FloatBuffer buffer, int position)
public static Vector3f[] Vector3f.toArray(FloatBuffer buffer, int position, int stride)

// Create ArrayLists of Vectors from a Buffer
public static List<Vector3f> Vector3f.toCollection(FloatBuffer buffer)
//  position and stride are for the buffer
public static List<Vector3f> Vector3f.toCollection(FloatBuffer buffer, int position)
public static List<Vector3f> Vector3f.toCollection(FloatBuffer buffer, int position, int stride)

// CopyVectors from a Buffer to an existing Array, until it's filled, starting at index 0
public static Vector3f[] Vector3f.toArray(FloatBuffer buffer, Vector3f[] vectors)
//  position and stride are for the buffer
public static Vector3f[] Vector3f.toArray(FloatBuffer buffer, int position, Vector3f[] vectors)
public static Vector3f[] Vector3f.toArray(FloatBuffer buffer, int position, int stride, Vector3f[] vectors)

// Copy Vectors from a Buffer to an existing Collection and return this
public static <T extends Collection<Vector3f>> Vector3f.toCollection(FloatBuffer buffer, T vectors)
// position and stride are for the buffer
public static <T extends Collection<Vector3f>> Vector3f.toCollection(FloatBuffer buffer, int position, T vectors)
public static <T extends Collection<Vector3f>> Vector3f.toCollection(FloatBuffer buffer, int position, int stride, T vectors)

// Same with matrix etc.
...


Also prevent Feature Creep Wink

Mathias - I Know What [you] Did Last Summer!
Offline KaiHH

JGO Kernel


Medals: 819



« Reply #162 - Posted 2015-07-20 16:01:27 »

JOML was designed to be a 3D math library, easily usable with Java/OpenGL bindings such as JOGL or LWJGL, which make use of Java NIO to efficiently interface with native OpenGL.
The last part of the above sentence is also the only reason why JOML even provides NIO FloatBuffer/ByteBuffer conversion methods.
If it wasn't for NIO being used by Java/OpenGL bindings, then even those conversion methods would not be in JOML.

Therefore I am afraid you gonna have to build those utility methods yourself, as JOML is unlikely to interface with the whole Java Collections API just for converting JOML classes to vectors/lists/arrays/maps/sets and whatnot, since this has little to do anymore with a 3D OpenGL math library, but resembles more an Apache Commons Style Library.

So, if you need such functionality for your projects, you can create a joml-util (or such) library containing those methods.
Offline KaiHH

JGO Kernel


Medals: 819



« Reply #163 - Posted 2015-07-20 18:24:23 »

Had to take today and tomorrow off from work. So I used the time to read about the x86 instruction encoding rules, which are nicely explained here and here, and debugged DynASM's LUA sources.
My plan is still to build a small x86 and x86-64 assembler for Java as a first step towards a nice "high-level" assembly language for Java. "High-level" in a sense that one can use "variables" with automatic register allocation and one can use the SSE/AVX instructions via intrinsic functions.
The x86 encoding rules are wicked and "historically evolved," but they can still be schematically layed out in the form of a table, like this awesome site does.
Now I am working with jsoup to auto-generate an encoding algorithm from that table, which will use pattern matching to find the correct encoding rule for e.g. "mov eax, ebx", because "mov" has various possible encodings based on the operand kinds.
DynASM does the same in they LUA sources, only differently.
Offline KaiHH

JGO Kernel


Medals: 819



« Reply #164 - Posted 2015-07-21 12:00:52 »

@theagentd: NativeMatrix4f.translationRotateScale() is in. Once you finished your WSW Demo 7, you might want to have a look at it.
It times at around 486% faster in 20,000 invocations with cold HotSpot and stabilizes at around 260% faster in 20,000 invocations with warmed HotSpot.
Now I will take care of doing a linux version of all that.
Okay, linux x64 version works. Smiley and should be even faster than the win64 version since linux's calling convention does not seem to have non-volatile registers which must be saved to stack on win64.
Offline theagentd
« Reply #165 - Posted 2015-07-21 14:02:04 »

I'm not entirely sure how to try it out. I could extend the benchmarks I've made so far to also test NativeMatrix4f I guess, but I'm not sure how the code works anymore? @__@

Myomyomyo.
Offline KaiHH

JGO Kernel


Medals: 819



« Reply #166 - Posted 2015-07-21 14:03:38 »

Here are functioning JUnit testcases for all so far implemented native methods: https://github.com/JOML-CI/JOML/blob/native-interpreter/test/org/joml/test/NativeMatrix4fTest.java
Prebuilt win64 and linux64 shared libraries are here: https://github.com/JOML-CI/JOML/tree/native-interpreter/native

It works like you layed out your initial API design originally. Smiley
Generally, what one wants to do with NativeMatrix4f is to batch as many invocations on it via a given Sequence (given as constructor argument). And at the end, call Sequence.call() to execute that sequence. The .get() methods are also delayed/batched.
Offline theagentd
« Reply #167 - Posted 2015-07-21 16:48:41 »

Bad news. It's slow.

Librarybones per second
LibGDX12 452k
JOML (mul4x3)31 251k
JOML native6 346k
JOML native, only sequence building11 339k
JOML native, only sequence.call()13 601k

It looks hopeless. The native version is 1/5th as fast as the normal version. Just building the sequence is 1/3rd as fast as the normal version, and even if I only benchmark sequence.call() only it's still less than half as fast as the normal version. Looking at the code, I can see that the argument and opcode buffer building is very inefficient. putArgs() is called 12 times for translationRotateScale(), which checks the buffer size 12 times when once should be enough. I tried to optimize that to see if it made a difference, but only managed to crash Java.exe. Still, it looks hopeless since call() itself is half as fast as the normal version.

EDIT: My test source code: http://www.java-gaming.org/?action=pastebin&id=1310

Myomyomyo.
Offline KaiHH

JGO Kernel


Medals: 819



« Reply #168 - Posted 2015-07-21 16:54:59 »

Many thanks for your test!

One thing though that makes joml-native horendously inefficient is that you are swapping the operated-on matrix with each iteration via
1  
2  
int id = i % BONES;
NativeMatrix4f bone = jomlNativeBoneMatrices[id];

This causes the registers/mem to constantly being synched and that is slow.
Could you try building a Sequence that only operates on a single bone?
But maybe this is not a likely usecase for you?

EDIT: But you are right, I also see that the arguments buffer building is by far the most expensive thing of everything. Should've timed that, too. Smiley Now with everything counted the native version is only 0,03x as fast as HotSpot.  Cry   Smiley
So in the end: it doesn't gain a thing.
Offline theagentd
« Reply #169 - Posted 2015-07-21 17:30:30 »

With only 1 bone, JOML goes up to 33 344k and native JOML to 9 290k. Benchmarking only sequence.call() gives 16 284k bones per second, which is still ~half as fast as normal JOML. Argument buffer building is of course unaffected.

My guess is that the only way to benefit from SSE would be to port the entire skeleton animation to native code. That of course means that I might as well write the entire game in C instead. >____>

Still, please don't be discouraged. JOML still has a huge number of advantages for the average user.

EDIT: Also, the reason LibGDX uses a native function for matrix multiplications could be because they're faster on Android. I've heard that HotSpot is much more flakey there, and it's possible the overhead of calling a native function is lower (just a guess) so it might simply be an optimization for low-end hardware at the cost of high-end performance.

Myomyomyo.
Offline ra4king

JGO Kernel


Medals: 508
Projects: 3
Exp: 5 years


I'm the King!


« Reply #170 - Posted 2015-07-22 09:55:02 »

I've been lurking and watching this thread for a while, and I want to finally post saying that I'm really liking this library after looking through its code for a bit.

Quick questions I had:
- Why did you choose to not increment the buffer's position in the constructors and methods that read from/write to NIO Buffers?
- Why do the Matrix/Vector constructors not call the functions their respective methods rather than re-implementing the same operation?
- Is a conditional really cheaper than calling 'dest.set(...)', especially considering Hotspot inlines aggressively: https://github.com/JOML-CI/JOML/blob/master/src/org/joml/Matrix4f.java#L492

Some inconsistencies (I'll be making a pull request for these):
- Some classes are missing constructors reading from Buffers
- Some classes use both ByteBuffer and FloatBuffer and some only FloatBuffer and some missing Buffers outright

Offline cylab

JGO Kernel


Medals: 196



« Reply #171 - Posted 2015-07-22 10:02:38 »

I've heard that HotSpot is much more flakey there, and it's possible the overhead of calling a native function is lower (just a guess) so it might simply be an optimization for low-end hardware at the cost of high-end performance.
There is no HotSpot on Android at all...  There was Dalvik 'til Kitkat (4.4) and from Lollipop (5.0) now there is Android Runtime (ART)

Mathias - I Know What [you] Did Last Summer!
Offline KaiHH

JGO Kernel


Medals: 819



« Reply #172 - Posted 2015-07-22 10:07:28 »

Thank you, ra4king, for giving JOML a try! Smiley

On to your questions:

- Why did you choose to not increment the buffer's position:
This was chosen to be compliant with how LWJGL is doing it. Since those methods are likely being used to get a Matrix4f into a ByteBuffer before uploading it to OpenGL, incrementing the buffer's position would require the client to do a rewind(), flip(), position(int), clear() on the buffer before handing it to a Java/OpenGL binding.

- Why do the Matrix/Vector constructors not call the functions their respective methods:
This was rather some insignificant design detail, but I think I thought about Effective Java's Item 17, which states that a class should be designed for inheritance, or forbid it. JOML is designed for inheritance (for whichever reason Smiley ) and therefore allows to override the Matrix4f and other classes including its methods. This would be dangerous had the constructor called an overridden method. That's the point of Item 17.

- Is a conditional really cheaper than calling 'dest.set(...)':
I honestly don't know. Smiley Maybe someone can try it out. Just did it this way, because it reads a little bit better in the case where dest != this.

About the last two points:

- Some classes are missing constructors reading from Buffers:
That's true. If you feel that you need those other classes to also read from buffers in their constructors, then this can be added quickly. Or make a PR.

- Some classes use both ByteBuffer and FloatBuffer and some only FloatBuffer:
Yepp. Those changes were rather not done with consistency in mind but came as the requirements for them came. Smiley So again, if you need other classes to handle it the same, then we can of course add it.

Thanks!
Offline ra4king

JGO Kernel


Medals: 508
Projects: 3
Exp: 5 years


I'm the King!


« Reply #173 - Posted 2015-07-22 10:19:19 »

- Why did you choose to not increment the buffer's position:
This was chosen to be compliant with how LWJGL is doing it. Since those methods are likely being used to get a Matrix4f into a ByteBuffer before uploading it to OpenGL, incrementing the buffer's position would require the client to do a rewind(), flip(), position(int), clear() on the buffer before handing it to a Java/OpenGL binding.

Ahh makes sense... although I did not like that design at all either Smiley


- Why do the Matrix/Vector constructors not call the functions their respective methods:
This was rather some insignificant design detail, but I think I thought about Effective Java's Item 17, which states that a class should be designed for inheritance, or forbid it. JOML is designed for inheritance (for whichever reason Smiley ) and therefore allows to override the Matrix4f and other classes including its methods. This would be dangerous had the constructor called an overridden method. That's the point of Item 17.

Design principles getting in the way of clean code! Tongue I see the reasoning though, thanks.


- Is a conditional really cheaper than calling 'dest.set(...)':
I honestly don't know. Smiley Maybe someone can try it out. Just did it this way, because it reads a little bit better in the case where dest != this.

Readability should be no excuse for adding a branch to critical code Smiley .... I'll write a small benchmark next time i have a chance although I expect the impact is minimal.

Offline Spasi
« Reply #174 - Posted 2015-07-22 10:26:58 »

- Why did you choose to not increment the buffer's position:
This was chosen to be compliant with how LWJGL is doing it. Since those methods are likely being used to get a Matrix4f into a ByteBuffer before uploading it to OpenGL, incrementing the buffer's position would require the client to do a rewind(), flip(), position(int), clear() on the buffer before handing it to a Java/OpenGL binding.

Ahh makes sense... although I did not like that design at all either Smiley

Note that all those increments, rewinds and flips do memory writes. They do have an unnecessary performance cost.

- Is a conditional really cheaper than calling 'dest.set(...)':
I honestly don't know. Smiley Maybe someone can try it out. Just did it this way, because it reads a little bit better in the case where dest != this.

Readability should be no excuse for adding a branch to critical code Smiley .... I'll write a small benchmark next time i have a chance although I expect the impact is minimal.

It's not about inlining. Besides, if you used dest.set(...), then there would be no difference between the two branches.

It's about semantics. With dest.set(...) you're ensuring that all arguments will be evaluated before any memory write happens in the destination. Even if the method call is inlined, this invariant will make it all the way to the JITed code. The conditional detects that there's no aliasing and the semantics can be different: note how this.m00 is read 3 times after dest.m00 has been written to. This results in more efficient code (less CPU registers required).

The increased bytecode size and the branch itself have a cost though, which may negate any benefits. It might be better to offer a version of mul with explicit no-aliasing semantics. The user would be responsible to use it when appropriate.
Offline KaiHH

JGO Kernel


Medals: 819



« Reply #175 - Posted 2015-07-22 11:03:59 »

Ahh makes sense... although I did not like that design at all either Smiley
I did not like it first either. But you grow accustomed to it very quickly and then really dislike the other way. Smiley
Offline ra4king

JGO Kernel


Medals: 508
Projects: 3
Exp: 5 years


I'm the King!


« Reply #176 - Posted 2015-07-23 03:07:51 »

Note that all those increments, rewinds and flips do memory writes. They do have an unnecessary performance cost.

Memory writes = only the position, limit, and/or mark variables so not really. Also most of my use cases aren't for uploading 1 Vector/Matrix to GL, it's for filling up a big buffer of many Vectors/Matrices. It's much cleaner for me to do:
for(MyObject m : objects)
    m.position.get(buffer);
buffer.flip();

than
int position = 0;
for(MyObject m : objects) {
    m.position.get(position, buffer); // assuming Vec4
    position += 4;
}


1 flip() after filling my buffer is better than continuously setting the position, and it is especially less error prone in case I add another get(buffer) in there in complex code and forget to adjust the position increment properly.

It's not about inlining. Besides, if you used dest.set(...), then there would be no difference between the two branches.

It's about semantics. With dest.set(...) you're ensuring that all arguments will be evaluated before any memory write happens in the destination. Even if the method call is inlined, this invariant will make it all the way to the JITed code. The conditional detects that there's no aliasing and the semantics can be different: note how this.m00 is read 3 times after dest.m00 has been written to. This results in more efficient code (less CPU registers required).

The increased bytecode size and the branch itself have a cost though, which may negate any benefits. It might be better to offer a version of mul with explicit no-aliasing semantics. The user would be responsible to use it when appropriate.


Ah you misunderstood me as you worded my intentions exactly. I did notice how this.m00 is read after dest.m00 has been set, which is why the conditional is there to make sure it's not written to before it is read again in the case the destination is the same as either operands. I wasn't arguing for the inlining, I was arguing that avoiding a function call by adding a branch is not worth it as the function call will most likely be inlined by Hotspot anyway.

Offline Spasi
« Reply #177 - Posted 2015-07-23 06:46:02 »

1 flip() after filing my buffer is better than continuously setting the position, and it is especially less error prone in case I add another get(buffer) in there in complex code and forget to adjust the position increment properly.

There's probably not a right answer here and comes down to personal opinion. My personal opinion is:

- Having to deal with flip() means having to mentally track two variables; position and limit. This is more complex than dealing with position() only.
- Having a method call mutate arguments (i.e. changing a buffer's current position) is fundamentally bad practice and bad API design. I don't know about you, but I can reason better about method calls that are free of side-effects and any mutations to my objects are explicit.
- Last but not least, I've spent 12 years of my life reading posts on the LWJGL forum about people forgetting to .flip() a buffer.

Ah you misunderstood me as you worded my intentions exactly. I did notice how this.m00 is read after dest.m00 has been set, which is why the conditional is there to make sure it's not written to before it is read again in the case the destination is the same as either operands. I wasn't arguing for the inlining, I was arguing that avoiding a function call by adding a branch is not worth it as the function call will most likely be inlined by Hotspot anyway.

I must still be misunderstanding you. As I explained above, you cannot use a method call there as that would lead to different semantics (and defeat the optimization).
Offline theagentd
« Reply #178 - Posted 2015-07-23 08:38:07 »

@Spasi

- You always need a position to know where to write. For limit, either you already know the size of your data and have a buffer that has the exact same size so limit==capacity, or your data is smaller than your buffer, in which case you probably NEED a limit to show the number of relevant bytes in the buffer.

- That's how the entirety of NIO works though.

- It's a newbie problem, sure, but knowing that you should always flip after writing to it is easier than having to figure out if you should flip or not. Not writing position just makes you keep track of the position yourself, and possibly set the limit and position at the end with flip anyway. Following NIO's way is better IMO.


My vote goes to incrementing position for consistency with NIO.

Myomyomyo.
Offline Riven
Administrator

« JGO Overlord »


Medals: 1371
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #179 - Posted 2015-07-23 09:14:23 »

Note that all those increments, rewinds and flips do memory writes. They do have an unnecessary performance cost.
Memory writes = only the position, limit, and/or mark variables so not really.
This is exactly what Spasi meant: memory writes to position/limit are relatively expensive, even if they are in L1. It is cheaper to keep track of the position with a local variable. Unless HotSpot already hoisted position/limit into localvars, ofcourse, but this is not guaranteed, and at one point they have to be read from main memory, and eventually the localvars have to be written back to main memory.

It's a convenience/performance tradeoff.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings!
Pages: 1 ... 4 5 [6] 7 8 ... 13
  ignore  |  Print  
 
 

 
Riven (845 views)
2019-09-04 15:33:17

hadezbladez (5786 views)
2018-11-16 13:46:03

hadezbladez (2602 views)
2018-11-16 13:41:33

hadezbladez (6202 views)
2018-11-16 13:35:35

hadezbladez (1498 views)
2018-11-16 13:32:03

EgonOlsen (4732 views)
2018-06-10 19:43:48

EgonOlsen (5789 views)
2018-06-10 19:43:44

EgonOlsen (3274 views)
2018-06-10 19:43:20

DesertCoockie (4174 views)
2018-05-13 18:23:11

nelsongames (5500 views)
2018-04-24 18:15:36
A NON-ideal modular configuration for Eclipse with JavaFX
by philfrei
2019-12-19 19:35:12

Java Gaming Resources
by philfrei
2019-05-14 16:15:13

Deployment and Packaging
by philfrei
2019-05-08 15:15:36

Deployment and Packaging
by philfrei
2019-05-08 15:13:34

Deployment and Packaging
by philfrei
2019-02-17 20:25:53

Deployment and Packaging
by mudlee
2018-08-22 18:09:50

Java Gaming Resources
by gouessej
2018-08-22 08:19:41

Deployment and Packaging
by gouessej
2018-08-22 08:04:08
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!