Java-Gaming.org    
Featured games (81)
games approved by the League of Dukes
Games in Showcase (487)
Games in Android Showcase (112)
games submitted by our members
Games in WIP (553)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: [1]
  ignore  |  Print  
  OpenGL Perf Tuning  (Read 5955 times)
0 Members and 1 Guest are viewing this topic.
Offline darkprophet

Senior Member




Go Go Gadget Arms


« Posted 2006-04-07 12:42:56 »

Hi all,
In my opinion and limited experience, it seems that Java programmers make weird and wonderful algorithms and ideas to increase the performance out of OpenGL, while C/C++ programmers dont need to do that just yet; but when simple java benchmarks become as fast as C/C++ applications, Java's scenegraphs and such will have all these cool new ways to do stuff and hence are going to be faster than the equivilant C/C++ applications.

So feel free to post here any cool ideas regarding how you got some performance out of OpenGL that you think is uber cool. My contribution is state sorting:

From what ive read (very little), it seems that state changes flag the server to do a complete validation of the server states and the client states so that they both conform, and OpenGL seems to take this time to also do some memory juggling and that can take a while. So, a good way to avoid this is minimise the state changes that do occur. The obvious solution is to have some sort of bucket, clear the bucket everyframe, use some sort of cool sorting algorithm to sort your objects, then loop through the bucket and render. All is well and the kittens are drinking milk...

However, all this sorting every frame can be a little computationaly expensive especially when you have alot of objects, and from what ive seen, the fastest sort is the Radix sort which has a performance of O(n). My Contribution has an init sorting time of O (n log m) whereby n is the number of objects and m is the number of states, but during run time (i.e. frame to frame), there is no overhead. This does come with a limitation, i'll discuss that later.

As an example, lets say you have four states: Texture State, Lighting State, Pixel/Vertex State, and Depth state. They vary in their performance cost, but lets say for simplicity sake, that the above list is ordered from highest to lowest expense (nothing is better than actual costs, so do your own benchmarks and see which is more expensive)

Now during runtime you init a sort of tree, each node reflects a state change (i.e. an enable and a disable of that state) and the nodes contain a list of objects they hold (Look At the picture). You create your material with its fancy state changes and such, then you recurse the tree until you find that no more nodes below the current one support the material (a simple "isStateAvailable(Material)" method will do in each node that returns a boolean). In the diagram, if your object has Texture state, and lighting, the object will be stored in lighting node (on the left); As you can see, the tree is ordered from top-bottom and left-right. Now all of this is during init time, once you have found the tree, store the node of the render tree into the material.

When rendering, do "material.getStateNode().add(material);" for every object, then recurse the entire tree and draw the children at every node. No sorting during run time.

Now the only limitation of this is texture ID's, glBindTexture can also flag a validation and the expense is the equivilant of state switch, the only way left to do is do a proper sort on the children of the Texture State node. Now the diagram is just a little sub tree, next to the texture you would also have Lighting with its Depth and Shader children as nodes describe a glEnable function.

Whats your clever cool idea/algorithm ?

DP

Friends don't let friends make MMORPGs.

Blog | Volatile-Engine
Offline princec

JGO Kernel


Medals: 364
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #1 - Posted 2006-04-07 14:39:05 »

My clever algorithm uses RadixSort too, to sort a triangle pump. You pour in as many Triangles as you like, and each Triangle is associated with a RenderingState.

The triangle pump then sorts the triangles on rendering state but in no concretely defined manner, up to "n" times. The advantage of the radix sort algorithm is that it can be run sequentially to sort any number of times, y'see. The final sorting pass at the very least sorts triangles by their first index to generally get the best automatic stripification possible (hint to GL programmers: triangle strips are no longer worth doing; just use GL_TRIANGLES and you'll find the hardware vertex cache does it all for you).

Because figuring out what state changes are expensive and what changes aren't is largely guesswork in OpenGL, the trick is not to think too hard about it all. I think in terms of only 3 different levels: memory uploads to GL, being the most expensive - so that's texture state changes basically; serverside state changes; then clientside state changes.

There are states above and below that which relate to the order in which things should be drawn. A simple integer for a rendering order, for example.

Code's all buried in SPGL somewhere.

Cas Smiley

Offline darkprophet

Senior Member




Go Go Gadget Arms


« Reply #2 - Posted 2006-04-07 21:25:10 »

I thought the reason for using triangle strips is to reduce the data you send down the pipe ? It would be around a third since you only need to define an extra point for a further triangle...

Anyways, keep them coming!

DP

Friends don't let friends make MMORPGs.

Blog | Volatile-Engine
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline princec

JGO Kernel


Medals: 364
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #3 - Posted 2006-04-08 11:15:22 »

If you think about how the driver actually sends triangles you'll realise why strips are an anachronism: these days people don't use glBegin(GL_TRIANGLES), instead they use glDrawElements using an array of indices. The GPU has the last few converted and transformed points available in a hardware cache and associated with an index number; so if you simply send a triangle 1, 2, 3 followed by another triangle 2, 3, 4, you'll find that vertices 2 & 3 are already transformed and in the hardware cache, so the GPU doesn't need any of the vertex data from the client apart from vertex 4 - in other words exactly the same operation as if you've sent a triangle strip, it just did the work for you. So sort on one of the triangle indices and you'll find that 90% of your triangles are automatically stripified. Result Smiley

Cas Smiley

Offline Orangy Tang

JGO Kernel


Medals: 56
Projects: 11


Monkey for a head


« Reply #4 - Posted 2006-04-08 12:44:30 »

I thought the reason for using triangle strips is to reduce the data you send down the pipe ? It would be around a third since you only need to define an extra point for a further triangle...
It'll only be a third less data if you get perfect stripification. If you look at the output from Stripe thats obviously not likely. Every strip begin/end is going to add quite a bit of overhead as you either make another draw call or add in a whole bunch of degenerate tris.

[ TriangularPixels.com - Play Growth Spurt, Rescue Squad and Snowman Village ] [ Rebirth - game resource library ]
Online Spasi
« Reply #5 - Posted 2006-04-08 14:40:27 »

So sort on one of the triangle indices and you'll find that 90% of your triangles are automatically stripified.

90% sounds excessive. Finding the optimal indices order for any mesh is an NP-complete problem. There's also the problem of different cache sizes on different GPUs.

The way I deal with this is a bit tricky, but might be interesting to some of you. When a dotXSI mesh is converted to our internal format, I calculate the average cache miss rate per triangle (ACMR). This is calculated by going through the indices, calculating how many misses we had (for a given cache size) and dividing by the triangle count (IIRC). The result is a number between 0.5 and 3.0 (the lower the better). Then I run the indices through a little program that is used for solving NP-complete sorting problems (can't remember which exactly, will re-post when I get back from Crete). The result is a new indices order, that produces a new ACMR. This is done recursively until it can't lower the ACMR any further (a user specified threshold applies here).

With the above technique I've seen models with a >2.0 ACMR fall to under 1.0. And even though the ACMR is calculated using a specific cache size, the final order is independent of it, so the optimization is general.
Offline princec

JGO Kernel


Medals: 364
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #6 - Posted 2006-04-08 23:37:04 »

The nifty thing about simply sorting on triangle index A is that it can be done on the fly, is incredibly fast, amazingly simple to understand, and 95% as effective in 95% of situations Smiley

Cas Smiley

Offline g666

Junior Member





« Reply #7 - Posted 2006-04-09 13:13:22 »

just wondering, would that work with lots of polys, isnt it a bit expensive to sort 10s of 1000s of tris every frame?

desperately seeking sanity
Offline princec

JGO Kernel


Medals: 364
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #8 - Posted 2006-04-09 19:04:14 »

Nope, it's a very inexpensive operation to perform considering the performance gains on the GPU, and the fact it's just the last step in a chain of radixsorts (imagine you had 6 state layers, you'd add this sort at the end to make it 7, so total sorting time would be increased by 1/6th). O'course you might want to benchmark it to be sure.

Cas Smiley

Online Spasi
« Reply #9 - Posted 2006-04-10 21:42:20 »

Link to paper and software: Universal Rendering Sequences for Transparent Vertex Caching of Progressive Meshes. I'm using the recursive cut algorithm, although the MLA one should be a little better.

Cas, you're right, sorting on index A is amazingly simple and fast, especially if it needs to be done at runtime. But it doesn't even come close to what the above algorithms can do. The difference is tiny on simple meshes, but my tests on more complex meshes (> 1000 triangles) showed significant gains. For example, I didn't see any ACMR under ~1.5 with index A sorting, whereas the recursive cut could easily get around 0.9.

Highly recommended for offline use. Wink
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline princec

JGO Kernel


Medals: 364
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #10 - Posted 2006-04-11 11:52:36 »

I bet the sort might even be more effective if you sort on the mean average of the vertex indices as well. Anyway - it's the fastest technique I can think of that works in realtime. Very handy if the entire scene is constructed from dynamic geometry.

Cas Smiley

Offline erikd

JGO Ninja


Medals: 16
Projects: 4
Exp: 14 years


Maximumisness


« Reply #11 - Posted 2006-04-24 15:51:39 »

I was thinking of having 2 of such sorted lists of openGL objects: One for opaque objects, and one for blended objects.
My reasoning was that you need to draw the blended objects always at last, so you only have to do a sort on depth on the blended object list and do other kinds of state sorting on the opaque object list. You can then simply draw the sorted opaque object list first, and the blended object list at last. Does this make sense?

Offline Riven
« League of Dukes »

JGO Overlord


Medals: 783
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #12 - Posted 2006-04-24 17:47:01 »

Yes, that's even pretty normal these days.

You can/should go even further, and *coarsely sort* front-to-back the objects in your opaque queue, so that overdraw is reduced (depth test).

This sorting should be like putting every object nearer than N meter in queue 1, and all others in queue 2.



Another tip:
If you have lots of small low-poly items (like grass, small rocks, etc) don't render them on a per-item basis, but use the data from the vertex-arrays and the indices to construct a new geometry with new indices containing all items of that type. Relatively easy to do, and in my scenes gave a performance-boost of factor 20. If you want controllable waving grass (no wicked vertex-shader stuff), without uploading new data to the gpu, try hardware-accelerated keyframe-interpolation, which is also pretty easy to figure out.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline oNyx

JGO Coder


Medals: 2


pixels! :x


« Reply #13 - Posted 2006-04-25 08:55:13 »

>hardware-accelerated keyframe-interpolation

Whats that? (any pointers are welcome)

弾幕 ☆ @mahonnaiseblog
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 783
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #14 - Posted 2006-04-25 10:16:17 »

Use your creativity to send two vertex-arrays and two normal-arrays to the gpu, and interpolate between them in a vertex-shader. Kiss


P.S.
Knowing there *is* a solution to a problem, makes the brain more creative in finding ways to discover it.

P.P.S
Don't waste your time on google, as nobody seems to have figured it out and published it as of yet.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline oNyx

JGO Coder


Medals: 2


pixels! :x


« Reply #15 - Posted 2006-04-25 10:36:12 »

Now I feel cheated, because you wrote "no wicked vertex-shader stuff" Tongue

Using that kind of shaders would ramp up the system requirements a bit, doesnt it? Can those intel integrated shitsets (oops typo) or something like a gf2 do it?

弾幕 ☆ @mahonnaiseblog
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 783
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #16 - Posted 2006-04-25 10:52:04 »

I meant 'randomly' offseting vertices,  or some sin() function that normally looks very artificial. That was the "wicked" part.

And yes, it might ramp up the requirements, but heck, you can have a fall-back mode. With such gfx-cards, poly-count would normally be so low, that a cpu might as well do the interpolation and resend the vertex-array every frame, per item. Err, we were talking about stuff like animated grass, just make it static then.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline princec

JGO Kernel


Medals: 364
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #17 - Posted 2006-04-25 12:57:36 »

Fallback modes = 2x the work. Far better to just stick to some nice minimum guaranteed set of specs and work to it. Well, maybe not better, just more effective use of your time. Or better still get someone at Xith / JME to do it Wink

Cas Smiley

Offline oNyx

JGO Coder


Medals: 2


pixels! :x


« Reply #18 - Posted 2006-04-25 13:27:15 »

Yea... I also think that there is very little point in making something work even faster on fast machines. Its not like anyone would care if they get 200 or 400 fps.

I just wondered about it, because Kev said that keyframe interpolation was quite the performance sucker... even if its only a handfull of lowpoly models and well, even Q2 managed to do that just fine on really lame hardware. So, I thought that there might be some better way to do it (w/o using anything fancy).

弾幕 ☆ @mahonnaiseblog
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 783
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #19 - Posted 2006-04-25 13:48:53 »

Quote
So, I thought that there might be some better way to do it

Compare inefficient javacode (working on arrays / vec-objects) to lean-and-mean SIMD enabled C++ code. Performance difference is roughly factor 2-3.
Edit: this is based on actual realworld cases, using native libraries in Java.

--

Try rendering >256 keyframe animated guys/trees, each with >2000 polys. Now the CPU will just be swamped (quad-core comes in handy) with LERPs, the GPU will laugh at it, easily pumping over 100fps. At least that's what my fairly 'old' 9700pro thinks of it. It by far beats bone-based animation as the vertex-shader is only 6 lines of trival code (fetching data + 2x lerp (V+N)).


Quote
"Fallback modes = 2x the work"
It's only a few minutes to setup the fallback mode: you already have the index of frame X and frame Y, and the weight (from the original code). You lerp the 2 FloatBuffers into a 'global' 3rd and update a VBO (or upload a VA). Double the peanuts.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline erikd

JGO Ninja


Medals: 16
Projects: 4
Exp: 14 years


Maximumisness


« Reply #20 - Posted 2006-04-25 23:06:05 »

Why do fallback code for something which you need a fast GPU for anyway? Wouldn't the fallback mode just suck, performance wise?
I guess if you do a really wicked effect with the GPU, you'd need to write something completely different as a fall back ('cheat' a similar looking effect), or the fall back would be to just leave the effect out, no?

Offline Riven
« League of Dukes »

JGO Overlord


Medals: 783
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #21 - Posted 2006-04-25 23:10:04 »

If you have keyframe animation for your characters, you have to do it some way, either on the CPU or on the GPU

on the CPU it is "kinda costly", but very possible on low-poly models
on the GPU it is "almost free"

So yes, a fallback mode is - in this case - a justified feature.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline darkprophet

Senior Member




Go Go Gadget Arms


« Reply #22 - Posted 2006-04-26 00:41:18 »

Quote
Compare inefficient javacode (working on arrays / vec-objects) to lean-and-mean SIMD enabled C++ code. Performance difference is roughly factor 2-3.
Edit: this is based on actual realworld cases, using native libraries in Java.

Your forgetting JNI overhead (which at 1.6 is around 1100ns), and the perf factor grows to around 4.3ish with sqrt in the equation. The hardest part is actually getting the data to be nice and linear in the float buffers to avoid cache misses in the FIFO, this brings a whole new light to the word "complex"...

And how are you sending two vertex arrays and two normal arrays? Unless your filling the texture coordinate slots with data, then I dont see how your doing it. Ofcourse if you are filling the texture coordinates with data, then you've got a whole can of worms to deal with that makes it really really restrictive for anything more useful than "grass" (which you can use pseudo-instantiation for anyway)

DP

Friends don't let friends make MMORPGs.

Blog | Volatile-Engine
Offline kevglass

JGO Kernel


Medals: 159
Projects: 23
Exp: 18 years


Coder, Trainee Pixel Artist, Game Reviewer


« Reply #23 - Posted 2006-04-26 03:10:13 »

Quote
It's only a few minutes to setup the fallback mode: you already have the index of frame X and frame Y, and the weight (from the original code). You lerp the 2 FloatBuffers into a 'global' 3rd and update a VBO (or upload a VA). Double the peanuts.

Now I might not know as much about GL performance as you chaps but I tried this recently. First, my low end card doesn't support VBO. So, you end up pumping a VA of a full model to the card every interpolation step (most people seem to do this every frame for uber smoothness). Even if you have a VBO - if you're using something like a quake model then pretty much every vertex is changing every step so you're not actually going to save much by updating partial VBOs (given you've got to work out what to change aswell).

The problem with fallbacks is they have to at least be serviceable (i.e. look reasonable and run as fast as the "real" version). The right option to me is to choose who you're trying to hit and go for their hardware.

Kev

PS. Got my pseudo-interpolation running quick as a fall back by generating a lot of display lists at initialisation time - interpolating between the key frames read from disk - and then using these at runtime. Means you've only got a set of animations and you have to choose your interpolation interval up. This costs lots of graphics memory but this is something even low end cards tend to have ample of Smiley

Offline Riven
« League of Dukes »

JGO Overlord


Medals: 783
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #24 - Posted 2006-04-26 03:49:50 »

Do you know you can embed VAs in your displaylists, to take advantage of indexed-geometry? (saving quite some vRAM)

It can be done like this:



1  
2  
3  
4  
5  
glVertexPointer(...);

// begin list
glDrawElements(...);
// end list




Note that assigning the VA-pointers must not be done inside the list, according to the OpenGL-spec.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline darkprophet

Senior Member




Go Go Gadget Arms


« Reply #25 - Posted 2006-05-18 23:07:07 »

Not really a performance related feature, but more of a trick...

If your making an FPS, disable writing to the depth buffer when drawing your gun and draw the gun last. This saves you from the horrible effect of the gun going into the stuff around it (walls, trees...etc), it looks reasonable. Only bad point is when you have shadows, you'll see the shadow of the gun on the wall and the point of the gun on the wall with the shadow and the gun touching, giving the impression that the gun is right up to the wall. But if you keep rotating, the player would think the gun should have gone into the wall, but it doesn't.

If you keep the gameplay nice, the casual player wouldn't even notice it in the slightest.

DP Smiley

Friends don't let friends make MMORPGs.

Blog | Volatile-Engine
Pages: [1]
  ignore  |  Print  
 
 
You cannot reply to this message, because it is very, very old.

 

Add your game by posting it in the WIP section,
or publish it in Showcase.

The first screenshot will be displayed as a thumbnail.

TehJavaDev (13 views)
2014-08-28 18:26:30

CopyableCougar4 (25 views)
2014-08-22 19:31:30

atombrot (38 views)
2014-08-19 09:29:53

Tekkerue (31 views)
2014-08-16 06:45:27

Tekkerue (30 views)
2014-08-16 06:22:17

Tekkerue (19 views)
2014-08-16 06:20:21

Tekkerue (29 views)
2014-08-16 06:12:11

Rayexar (66 views)
2014-08-11 02:49:23

BurntPizza (42 views)
2014-08-09 21:09:32

BurntPizza (34 views)
2014-08-08 02:01:56
List of Learning Resources
by Longor1996
2014-08-16 10:40:00

List of Learning Resources
by SilverTiger
2014-08-05 19:33:27

Resources for WIP games
by CogWheelz
2014-08-01 16:20:17

Resources for WIP games
by CogWheelz
2014-08-01 16:19:50

List of Learning Resources
by SilverTiger
2014-07-31 16:29:50

List of Learning Resources
by SilverTiger
2014-07-31 16:26:06

List of Learning Resources
by SilverTiger
2014-07-31 11:54:12

HotSpot Options
by dleskov
2014-07-08 01:59:08
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!