Java-Gaming.org    
Featured games (81)
games approved by the League of Dukes
Games in Showcase (492)
Games in Android Showcase (112)
games submitted by our members
Games in WIP (556)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: [1]
  ignore  |  Print  
  JOGL JSR implementation performance with JNI  (Read 4476 times)
0 Members and 1 Guest are viewing this topic.
Offline K.I.L.E.R

Senior Member




Java games rock!


« Posted 2005-11-22 03:09:43 »

JNI part of JOGL would become part of the standard JDK and so I assume Sun will allow inlining and optimisation of JOGL's JNI stuff, therefore making it a lot faster?
Not that I have performance issues with JOGL but I'd like to know if JNI code will be inlined like the rest of Sun's JNI code.

ALSO: Ken Russel you have a PM.
Thanks.

Vorax:
Is there a name for a "redneck" programmer?

Jeff:
Unemployed. Wink
Offline Ken Russell

JGO Coder




Java games rock!


« Reply #1 - Posted 2005-11-23 06:49:44 »

The performance of the current JOGL and JNI in general are just fine in my opinion. We built a prototype of JOGL earlier this year using a radically different and more efficient native method calling interface than JNI and weren't able to see any significant speedups on Jake2 on modern processors so it was hard to justify pushing the prototype further. If you have a real-world application showing a significant performance difference between some C/C++ OpenGL code and JOGL-based OpenGL code which is attributable to JNI overhead then please post or file a bug about it and we'll be glad to look into it.
Offline uran

Senior Newbie





« Reply #2 - Posted 2005-11-23 09:04:38 »

Recent change in JOGL (JSR-231), the one which made all NIO Buffer pointers significant made me recode good chunk of GL invocations from within a tight context, to C++.

In the process, I was also wondering myself if a noticeable performance improvement would result of this change.

Suprisingly, the performance angle of this change was insignificant. (I always was wondering how JNI affects performance lately; it appears JNI has improved)

On a quick note, if not secret, what is this radically new native calling interface of which you speak, Ken?
I have looked at CNI briefly, but after determining for myself that improvement from coding GL invocations in tight loops to C++ results were almost negligible, I pretty much left with impression that JNI does the job.
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline cwei

Senior Newbie




Let the robots do.


« Reply #3 - Posted 2005-11-24 01:20:34 »

Hi Ken,

nice to hear that you use Jake2 as a benchmark.

The performance of the current JOGL and JNI in general are just fine in my opinion. We built a prototype of JOGL earlier this year using a radically different and more efficient native method calling interface than JNI and weren't able to see any significant speedups on Jake2 on modern processors so it was hard to justify pushing the prototype further.

Have you tested your prototype with the renderer named "jogl"?
This one produces a lot of simple OpenGL calls. (ca. 25000 per frame)
If you want to stress the JNI layer, you can run the demos with 32 player models.
Type to console:
timedemo 1
cl_testentities 1
map q2demo1.dm2
...

The "fastjogl" renderer uses FloatBuffers and vertex arrays. (ca. 5000 OpenGL calls per frame)
Thats why its possible that you can't see any significant speedups.
But this impl uses a lot of FloatBuffer puts and gets.
(for FloatBuffer stress tests you can use this renderer (or lwjgl) and the same commands as above)

Are there any optimizations for FloatBuffers in jdk1.6.0?
(like the MappedByteBuffer-cast-trick for direct ByteBuffers)

bye
Carsten




Offline Ken Russell

JGO Coder




Java games rock!


« Reply #4 - Posted 2005-11-27 17:23:43 »

Have you tested your prototype with the renderer named "jogl"?
This one produces a lot of simple OpenGL calls. (ca. 25000 per frame)
If you want to stress the JNI layer, you can run the demos with 32 player models.
Type to console:
timedemo 1
cl_testentities 1
map q2demo1.dm2

We used the jogl renderer and the timedemo mode, but didn't try / know about the cl_testentities command. We tested on a few different processors and for any recent processor with good branch prediction (Pentium 4, Pentium M, or Opteron) the speedup with the faster native interface layer was no more than 5% to the best of my recollection. This was several months ago however so I don't remember all of the details. I do remember that for any processor without full SSE2 support the speed difference between Java and native C code was huge due to the inefficient code that must be generated in order to make the Intel x87 floating-point unit produce Java-compliant floating point results.

Quote
The "fastjogl" renderer uses FloatBuffers and vertex arrays. (ca. 5000 OpenGL calls per frame)
Thats why its possible that you can't see any significant speedups.
But this impl uses a lot of FloatBuffer puts and gets.
(for FloatBuffer stress tests you can use this renderer (or lwjgl) and the same commands as above)

As mentioned above we deliberately used the jogl renderer, not the fastjogl renderer, to do our comparisons.

Quote
Are there any optimizations for FloatBuffers in jdk1.6.0?
(like the MappedByteBuffer-cast-trick for direct ByteBuffers)

The Java HotSpot server compiler implements bimorphic call inlining in Mustang meaning that if there are for example two hot data types at a particular call site it will inline both with a type test at the top. This speeds up certain uses of NIO where direct and non-direct buffers are mixed in the same application.
Offline rexguo

Junior Member




Real-Time Java


« Reply #5 - Posted 2005-11-28 09:19:47 »

Quote
I do remember that for any processor without full SSE2 support the speed difference between Java and native C code was huge due to the inefficient code that must be generated in order to make the Intel x87 floating-point unit produce Java-compliant floating point results.

Hi Ken, would you care to elaborate on what you
mean by the above? AFAIK there're still a lot of CPUs
out there that do not have SSE2 and we'll have to
support those and make sure there's an acceptable
level of performance for them.

So my questions are:
1. How is SSE2 used in JSR-231?
2. How huge are the performance differences and what's the context?
3. 'inefficient code that must be generated' ?

Many thanks in advance!

.rex

http://www.rexguo.com - Technologist + Designer
Offline princec

JGO Kernel


Medals: 369
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #6 - Posted 2005-11-28 11:35:35 »

This is a red herring; we are talking about sub 5% performance degradation and if the machine doesn't have SSE2 it's also likely not to have a particularly speedy graphics card anyway and that's probably going to be the bottleneck.

Cas Smiley

Offline K.I.L.E.R

Senior Member




Java games rock!


« Reply #7 - Posted 2005-11-28 15:24:42 »

5% JNI performance impact would equate to 5% hit in a very small part of the code (JNI layer).
It's added latency at worst.

Vorax:
Is there a name for a "redneck" programmer?

Jeff:
Unemployed. Wink
Offline Ken Russell

JGO Coder




Java games rock!


« Reply #8 - Posted 2005-11-30 10:33:34 »

The Quake II engine is fairly floating-point intensive and when the x87 floating-point stack is used for floating-point computations there are frequent stores of intermediate floating-point values to and from main memory in order to make values visible to the program IEEE compliant. It is a well known fact that this incurs a significant amount of overhead in FP-intensive Java programs and while I don't have any concrete numbers in front of me it can easily be much more than 5%. Do a Google search for "java floating point x87" for references. The good news is that with the Pentium III and later, single-precision Java FP computations can use the IEEE-compliant SSE registers, and with the Pentium IV and later, both single- and double-precision Java FP computations can use the IEEE-compliant SSE2 registers. The use of these registers eliminates this overhead.

As I stated above on a PIII the overhead of using the x87 expression stack and associated stores to main memory appeared to be very high; Jake2 ran at roughly half the speed of the C version. On this processor eliminating the JNI overhead yielded roughly a 15% speedup to the best of my recollection, but the scores were still pretty far from those of the C version. On a Pentium M, Pentium IV or Opteron processor (all of which support SSE2) the differences between the Java and C versions of Quake II were very small (actually, I think the Java version ran faster on some of the processors, possibly because it was using the SSE registers and the C version wasn't) and eliminating the JNI overhead didn't yield a significant speedup, indicating that the better branch prediction in the P4 and later processors is already doing a good enough job of reducing the multiple function call overhead of JNI.
Offline uran

Senior Newbie





« Reply #9 - Posted 2005-11-30 16:16:07 »

Although moving bulk of GL invocations down to native on a P3 @ 500 didn't yield a very signifant improvement, I'm still persuing this path namely due to the scenario in which I'm in:

Basically with JSR-231 I have to synchronize two threads, and although lack of synchronization in general ca be considered dangerous, it works very well and is stable in JOGL 1.1.1. I have one thread which is a filler, and it pipes into a direct ByteBuffer new texture data every frame. Another thread is a renderer, which actually sends this ByteBuffer to GL.

I must disclaim that this problem is very domain-specific, so don't take my writing as sour grapes, rather as an experience.
In JOGL 1.1.1, at worst you can get some garbled data which in itself is very unlikely, but the application is stable.

In JSR-231, due to the pointer significance (sorry if sounds like dead record by now) the app crashes indeterministically. (Especially pronounced on my dual test machine)

By adding synchronization to this process, I'm sure the stability problem can be eliminated; with a good bit of performance. Although I'm mostly speaking from experience, and haven't actually taken the sync route in this case, what I did was, basically move GL invocations that were required to native layer.

This has removed the pointer signifance problem and even yielded an improvement, albeit slight one (not too many invocations, and not FP-intensive). Although I said that performance improvement was slight, this is only when dealing with a single stream; I have to process multiple streams, so improvement adds up quickly. Overall, I'm happy with this development.

So basically Java end of things handles all the management of textures, shaders, and other data ... and establishes the GL context on the EDT, then I call into native and do the actual scene invocations from there.

Anyway, 15% JNI overhead if it is correct, IMO is still quite significant. I guess part of me still feels that I should sqeeze out as much as I can, since I'm targeting lowish-end hardware (CPU/GPU) (I'm developing on a 500P3 using Radeon 9k)

Good to know your statistics though.
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline Ken Russell

JGO Coder




Java games rock!


« Reply #10 - Posted 2005-11-30 17:02:03 »

How many OpenGL calls are you making per stream? I would have expected something like five, like glVertexPointer, glNormalPointer, glTexCoordPointer, glDrawElements. I have a hard time believing that trading so few native method calls for one (are you calling down in to native code to make your OpenGL calls, or do you have a loop down in your native code?) will have a significant performance improvement.

Likewise for synchronization. It is possible to perform many, many monitor notifications per second and if the app is structured properly then you could have a round-robin pool of buffers to fill so that your data streams down from your compute thread to your rendering thread without forcing the compute thread to explicitly rendezvous with the rendering thread (unless the rendering thread doesn't keep up and you run out of available buffers in the compute thread). The Java2D OpenGL pipeline works like this to the best of my knowledge and I have personally worked on Java-based 3D apps which did this (see this paper[/url) to achieve high frame rates on 1998-era JVMs and hardware. Basically my point is that you should avoid prematurely optimizing your application by making it unsafe (eliding necessary synchronization) but instead first concentrate on making it correct.

I'd be interested to know how your project and its performance is going. Please post updates as your work continues.
Offline uran

Senior Newbie





« Reply #11 - Posted 2005-11-30 18:11:08 »

> How many OpenGL calls are you making per stream?

Worst case scenario (Using all 6 texture units on the R200 class hardware) I'm making on the order 70 GL calls per stream.
3 of the textures are planar YUV data, the other 3 textures are RGB gamma ramps. Shader does all the conversion.

Then I have to send some geometry, not to mention setup state (6 texture units), render, teardown. So, if using gamma correction ~70 calls, per stream, per frame.

Additionally, there is world setup, render, teardown (shadows, etc). So all in all, quite a few GL calls. Although I've only moved the stream-based GL invocations into the native world, rest is still done from Java. Its important to underline my dedication to doing as much as I can from Java, but sometimes I just have to do what I must.

As for circular buffer, yep using it. And as for synchronization, trust me where it really counts, I'm using all proper threading techniques. I can skimp on the native buffers since the behaviour seems to be defined, and is thouroughly tested (at least wrt JOGL 1.1.1) over a period of time. As of premature, I guess it can be considered, although some parts are mature in terms of features and I'm doing optimizations on those parts in band with development of other elements of the system, which overall is quite diverse.

Also, I've made a simple testcase about the JDesktopPane behaviour, if you want to take a look at it. I'm going to look at it this weekend, as its the only time I get to fix bugs. fun fun.

Regards.

EDIT:

As for loop in native code, yes I have a version which does JAWT acquisition/release from native code. But so far I've found this method to be very unstable, so as of now I'm letting JOGL manage GL context. I intend to do more research into this when time allows, but this I do consider to be a premature optimization as overall the performance (with all latest developments) is acceptable.
Offline princec

JGO Kernel


Medals: 369
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #12 - Posted 2005-11-30 18:46:56 »

For some reason I seriously don't believe you, at all. You must be abusing the GL API horribly to get anywhere near this kind of performance degradation through JNI. In fact I'm absolutely 100% certain of it, so perhaps you could find out where you're abusing it and fix that?

Cas Smiley

Offline uran

Senior Newbie





« Reply #13 - Posted 2005-11-30 19:26:56 »

What?

See if I care whether or not you believe me. Here I think I'm having an intelling discussion, only to be accused of lying for (insert your reason).

...
Offline kaffiene
« Reply #14 - Posted 2005-12-01 00:33:03 »

What?

See if I care whether or not you believe me. Here I think I'm having an intelling discussion, only to be accused of lying for (insert your reason).

...

Probably because there's really no way that the overhead of 70 method calls per frame can that much significance.

Maybe if you were doing thousands of operations, then sure...
Offline uran

Senior Newbie





« Reply #15 - Posted 2005-12-01 02:34:26 »

Somehow the premise got skewed here...

I've never said that overhead was _tremendous_. However, it is my job to sqeeze every ounce of performance out of the CPU (since the CPU is going to be doing a heck of alot more than simply pushing verts to the GPU, infact one of the reasons I went GL is to avoid the rendering hit), and since I've ported ton of native code to Java already, for reasons which I won't even begin to list, I'm going to stick to it, making appropriate adjustments where necessary. (Its been a hybrid all along, heh, politics play no role in development of this project)

I think earlier attempt , or should I say jab, was coarse;  I couldn't care less.
I've already said that my task is rather specific, and without being in position to know what I must accomplish, simply throwing out baseless assumptions is kind of silly.

I joined these forums when I started out with JOGL, to report a bug. Forums like these shouldn't have a barrier to entry.

Auf Wiedersehen
Offline Ken Russell

JGO Coder




Java games rock!


« Reply #16 - Posted 2005-12-01 03:22:55 »

Getting back to the topic at hand, have you considered modeling the OpenGL state in your engine to avoid possibly-redundant native method calls to OpenGL? This might reduce the 70 method calls per stream to something smaller so that as your number of streams gets larger you have overall fewer calls to make. Additionally you might consider sorting your streams if you aren't already so that similarly-enabled ones are rendered sequentially to take advantage of such an optimization.
Offline uran

Senior Newbie





« Reply #17 - Posted 2005-12-01 03:59:42 »

I've done some preliminary state management, such as enabling the textures backwards, coalescing calls, among other minor things. Good suggestion, I will look into it more thouroughly for sure.

The gamma corrective op that runs on r200 seems kind of bloated, but its not. There was an "option" of packing RGB ramps into a single texture, (saving some tex bind, state calls) but in order to do a dependent texture fetch on radeon 9k, which doesn't support ARB fragment programs, but rather a semi-limited (instruction count wise, 8 per pass) ATI proprietary shader, I kept running out of instructions in the second pass to do a dependent texture fetch.  (first pass is maxed out for YUV & EQ)

So thats why I opted for using 6 textures, each of the textures (top 3) contains the R,G,B ramps while lower textures contain the YUV planes. I've also considered packing the YUV into interleaved format, but that would induce a hit on the CPU since data comes from source in a planar format, so things like that I've been mucking with. (cpu will be doing quite a bit of DCT, so no go) (also trying to minimize the amount of contenders for the branch prediction table, I hear even on modern CPUs its a limited resource, so with all those other things looping everywhere, cache coherency can be a problem)

I have a branch (Src Control) which does most of this stuff using ARB programs where you have the POW instruction, so gamma correction does not require dependent fetch into separate ramps ... and its so much nicer, I must say. (as well as other optimizations which I couldn't pull on such ancient HW, but its a requirement)

I figured if r200 has 6 tex units, why not use them. So with those 6 units, I must enable each, setup pixel alignment & unpack modes, ...then setup VBO, send some geomertry and restore states.

All in all, I want to minimize amount of time that gets spent in GL overall, but its inevitable (feature creep made me add features I wouldn't of otherwise opted for myself) so I'm spinning wheels looking for ways. But, I'll be doing some thorough profiling to iron out the remaining hotspots, currently few preliminary profiler passes have been done.

There was a good quote about time or lack thereof, can't remember now. So much to do, so little time.
Offline Vorax

Senior Member


Projects: 1


System shutting down in 5..4..3...


« Reply #18 - Posted 2005-12-01 04:07:18 »


Probably because there's really no way that the overhead of 70 method calls per frame can that much significance.

Maybe if you were doing thousands of operations, then sure...

I also agree - something else is causing the performance problem.  I would go to a profiler for this one to see what's causing the delays or strip out *everything* except the calls you think are causing problems, then test your FPS.  My guess is that even in a PIII 500 you will be over a 1000 FPS with just 70 calls and an ATI9000.   Using stencil shadows for example will cause a huge fillrate hit and some nasty state changes (and probably CPU depending on how you did it), depending on the textures and your bus speed and video ram, that can also be a huge hit, what's the texture filtering - another potential hit...you said in another part that you have a thread rendering to a byte buffer, then another thread uploading that - if that's per frame that will be a HUGE ugly, must never do on low end computer type of performance hit....etc..etc.

Offline Ken Russell

JGO Coder




Java games rock!


« Reply #19 - Posted 2005-12-01 07:13:35 »

I think people are missing the point that this application has 70 OpenGL calls per *object*, not per frame. If lots of objects are on the screen then that factor of 70 can add up pretty quickly. I generally agree that JNI overhead should be pretty minimal but it doesn't sound like this is an unfounded complaint.
Offline kaffiene
« Reply #20 - Posted 2005-12-01 10:47:49 »

I think people are missing the point that this application has 70 OpenGL calls per *object*, not per frame. If lots of objects are on the screen then that factor of 70 can add up pretty quickly. I generally agree that JNI overhead should be pretty minimal but it doesn't sound like this is an unfounded complaint.

Ahh... mea culpa.  I did indeed misread that.
Offline princec

JGO Kernel


Medals: 369
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #21 - Posted 2005-12-01 14:29:18 »

Sorry Uran, I think something may have gotten lost in the tone of my post; I don't think you're lying at all, I just didn't think you'd quite divulged all the pertinent information. The picture is a little clearer now but I still believe that if you are calling OpenGL 70 times to render a single thing you could be doing it better.

You might for example run it in 1.4.2 and show us the output of -Xprof, which shows the amount of time spent in stubs and native code for JNI calls. If it turns out you were calling glVertex3f 70 times per object you'd be rightly accused of abusing the OpenGL binding, but on the other hand you were actually managing to make 70 client-side state changes per object then I'd stand corrected.

Cas Smiley


Offline Mithrandir

Senior Member




Cut from being on the bleeding edge too long


« Reply #22 - Posted 2005-12-02 02:09:38 »

I can't see how anyone could do a serious app with only 70 calls per frame anyway! Even if so, then there's some very significant ways the rendering can be sped up. With that few calls, it sounds like all of your geometry is in one big array, which means there's no culling, state sorting or any other very significant ways of speeding up the rendering process.

That said, Uran, you've gone about this completely the wrong way. You think that multithreaded access is the problem, when it isn't. It's the way you've designed your code. Anyone that has multithreaded access to a data structure and doesn't protect it by either running mutliple buffers (the one being filled versus the one being rendered) or some form of mutex system doesn't know how to code multithreaded applications. It's the classic simple producer-consumer problem. How do you expect native code to solve this problem by shifting to native code? You haven't solved anything at all, just shifted the failure point into a different set of code and it appears fixed because right now something else is slowing the system down or the timing is marginally different such that the problem is not currently evident due to platform-specific behaviour. What happens when you shift this code to a different machine with marginally different architecture - for example a dual core or slightly faster CPU? Those multithreaded crashes are going to come right back at you with even a bigger vengence than the mild ones you've got now. With native code you're simply not going to know what crashed. At least keeping it in Java will have some nicely readable stack traces etc. You'd be far, far better off properly architecting the code first and then worrying about performance later.  This seems to be very much a case of the tradesman blaming his tools.

Kaffiene; The one place that this could slow down is the writing to the NIOBuffers themselves. If there's thousands of put() calls each with a single byte/float/etc then this will have an effect on the performance relative to a bulk put(). But, agreed, there's no way possible that the GL calls are actually the cause of the significant slowdown. There's something else in the way he's coded the app that is causing the issue.

The site for 3D Graphics information http://www.j3d.org/
Aviatrix3D JOGL Scenegraph http://aviatrix3d.j3d.org/
Programming is essentially a markup language surrounding mathematical formulae and thus, should not be patentable.
Offline uran

Senior Newbie





« Reply #23 - Posted 2005-12-02 05:16:51 »

Ok, while I appreciate the helpful comments, I will say that aside from Ken, nobody seems to have grasped the situation. No worries though, heh, initially I wasn't even going to spill anything since I'm doing quite allright when it comes to optimizations.

Anywho, thanks anyways.

EDIT:

@princec

No worries.

@Mithrandir

Again, 70 calls per object. Thas one, secondly, what you perceive to be a problem of "shifting a problem to native code" is not the case. As for thread synchronization, where it is required, it is in place, it goes without saying.

And finally, subset of this project's capabilities are targeted at low-end hardware, while dual CPU is one of the multiple arch on which this system is being developed/tested. Oh, one more thing ... there are places where producer/consumer paradigm is warranted, and used. There are places where synchronized version of this paradigm is not warranted, and there are many ways to perform synchronization without using explicit sync primitives. Doug Lea's book Concurrent Programming in Java 2nd Edition is a good read.
Also, as for blaming my tools? Heh, hardly, I brag about having the best tools in the industry.
Pages: [1]
  ignore  |  Print  
 
 
You cannot reply to this message, because it is very, very old.

 

Add your game by posting it in the WIP section,
or publish it in Showcase.

The first screenshot will be displayed as a thumbnail.

Nickropheliac (16 views)
2014-08-31 22:59:12

TehJavaDev (23 views)
2014-08-28 18:26:30

CopyableCougar4 (33 views)
2014-08-22 19:31:30

atombrot (42 views)
2014-08-19 09:29:53

Tekkerue (41 views)
2014-08-16 06:45:27

Tekkerue (35 views)
2014-08-16 06:22:17

Tekkerue (26 views)
2014-08-16 06:20:21

Tekkerue (37 views)
2014-08-16 06:12:11

Rayexar (73 views)
2014-08-11 02:49:23

BurntPizza (49 views)
2014-08-09 21:09:32
List of Learning Resources
by Longor1996
2014-08-16 10:40:00

List of Learning Resources
by SilverTiger
2014-08-05 19:33:27

Resources for WIP games
by CogWheelz
2014-08-01 16:20:17

Resources for WIP games
by CogWheelz
2014-08-01 16:19:50

List of Learning Resources
by SilverTiger
2014-07-31 16:29:50

List of Learning Resources
by SilverTiger
2014-07-31 16:26:06

List of Learning Resources
by SilverTiger
2014-07-31 11:54:12

HotSpot Options
by dleskov
2014-07-08 01:59:08
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!