Java-Gaming.org    
Featured games (79)
games approved by the League of Dukes
Games in Showcase (476)
Games in Android Showcase (106)
games submitted by our members
Games in WIP (533)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: 1 [2] 3
  ignore  |  Print  
  Immediate mode rendering is dead  (Read 22692 times)
0 Members and 1 Guest are viewing this topic.
Offline xinaesthetic

Senior Member


Medals: 1



« Reply #30 - Posted 2010-01-19 14:10:49 »

I can't fathom it either.  It proved quite a distraction from other things I should be doing; it was nearly 3am by the time I made my post last night, by which time I was starting to doubt the validity of my judgement... I haven't been totally rigorous but I can't see a major flaw in my methodology, and now jezek2 seems to be finding the same.

I was running 10,000 particles in some of my tests, 30,000 in others.  Either is enough to make particles easily the most expensive thing in the app; framerates around 30fps (much faster with particles off).

Interesting point about the pipeline... I've been planning to make some changes to the way my rendering works that would make implementing something along the lines of jezek2s suggestion very easy.

At some point, I might do my particle animation on the GPU instead... that really should be faster.  Also, I've certainly seen big gains going away from immediate mode in other parts of the program.
Offline jezek2
« Reply #31 - Posted 2010-01-19 14:21:22 »

For the record, I'm using it for quite small number of vertices (much smaller than the amount xinaesthetic mentioned), also it's not primarily for particles though I currently render the few particles using immediate mode too.
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 743
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #32 - Posted 2010-01-19 14:27:30 »

Vertex arrays: 17ms!!! (glMapBuffer was 30ms)
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
                javaSideBuffer.clear();

               FloatBuffer fb = javaSideBuffer.order(ByteOrder.nativeOrder()).asFloatBuffer();

               FloatBuffer vBuffer = (FloatBuffer) fb.slice().limit(fb.capacity() >> 1);
               FloatBuffer cBuffer = (FloatBuffer) fb.slice().position(fb.capacity() >> 1);

               for (int i = 0; i < 256; i++)
               {
                  glVertexPointer(3, 0, vBuffer);
                  glColorPointer(3, 0, cBuffer);
                  glDrawArrays(GL_TRIANGLES, 0, quadCount * 3 * 2);
               }

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline Eli Delventhal

JGO Kernel


Medals: 42
Projects: 11


Game Engineer


« Reply #33 - Posted 2010-01-19 16:51:48 »

Vertex arrays: 17ms!!! (glMapBuffer was 30ms)
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
                javaSideBuffer.clear();

               FloatBuffer fb = javaSideBuffer.order(ByteOrder.nativeOrder()).asFloatBuffer();

               FloatBuffer vBuffer = (FloatBuffer) fb.slice().limit(fb.capacity() >> 1);
               FloatBuffer cBuffer = (FloatBuffer) fb.slice().position(fb.capacity() >> 1);

               for (int i = 0; i < 256; i++)
               {
                  glVertexPointer(3, 0, vBuffer);
                  glColorPointer(3, 0, cBuffer);
                  glDrawArrays(GL_TRIANGLES, 0, quadCount * 3 * 2);
               }

It is recommended on OpenGL's site that you use glDrawArrays instead of immediate mode, so I'm not at all surprised.

See my work:
OTC Software
Offline princec

JGO Kernel


Medals: 342
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #34 - Posted 2010-01-19 17:39:53 »

Err... you should be surprised, because using plain old fashioned vertex arrays, he's got 2x the performance of VBOs. Which is exactly the opposite of what just happened to me in my sprite engine. Admittedly Riven's code is pretty much a microbenchmark and my sprite engine is a real-world application doing real things, so possibly my results are more relevant, though I need to test on the Mac, ATI cards, Intel cards, and various PC configurations before I draw any firm conclusions.

Cas Smiley

Offline DzzD
« Reply #35 - Posted 2010-01-19 18:02:34 »

Vertex arrays: 17ms!!! (glMapBuffer was 30ms)
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
                javaSideBuffer.clear();

               FloatBuffer fb = javaSideBuffer.order(ByteOrder.nativeOrder()).asFloatBuffer();

               FloatBuffer vBuffer = (FloatBuffer) fb.slice().limit(fb.capacity() >> 1);
               FloatBuffer cBuffer = (FloatBuffer) fb.slice().position(fb.capacity() >> 1);

               for (int i = 0; i < 256; i++)
               {
                  glVertexPointer(3, 0, vBuffer);
                  glColorPointer(3, 0, cBuffer);
                  glDrawArrays(GL_TRIANGLES, 0, quadCount * 3 * 2);
               }




hehe, not really surprised Smiley  I woud be also pretty confident with a compiled list (inded without redundant glbegin)

Offline Spasi
« Reply #36 - Posted 2010-01-19 18:10:34 »

Riven, could you try to submit 10 times (or more) the vertex data per iteration and post another comparison?

Btw, we can't really compare display lists here, this is rendering of dynamic geometry submitted to the GPU each frame (each iteration in Riven's code represents a rendered frame).
Offline DzzD
« Reply #37 - Posted 2010-01-19 18:13:54 »

Riven, could you try to submit 10 times (or more) the vertex data per iteration and post another comparison?

Btw, we can't really compare display lists here, this is rendering of dynamic geometry submitted to the GPU each frame (each iteration in Riven's code represents a rendered frame).
yes, that why I suppose in case of static geometrie displaylist would have be faster (a way faster as only one JNI call for thousands of draws), so everything have it own usage depending on application. nb: also mixing different displaylist (and recompil them on the fly) , reorder them too, can work pretty weel for dynamic scene rendering

but the comparaison can be made in a certain manner by drawing the same thing to the screen

Offline Riven
« League of Dukes »

JGO Overlord


Medals: 743
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #38 - Posted 2010-01-19 18:26:49 »

(16*16*2 = 512 tris) * 256 iterations
    VA (normal): 6ms
    VBO (mapped): 14ms <=
    VBO (subdata): 14ms <=

(32*32*2 = 2K tris) * 256 iterations
    VA (normal): 17ms
    VBO (mapped): 30ms
    VBO (subdata): 51ms <=

(64*64*2 = 8K tris) * 256 iterations
    VA (normal): 65ms
    VBO (mapped): 182ms <=
    VBO (subdata): 122ms

(128*128*2 = 32K tris) * 256 iterations
    VA (normal): 415ms
    VBO (mapped): 1225ms <=
    VBO (subdata): 762ms

it depends on the datasize which is the slowest persecutioncomplex

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline Eli Delventhal

JGO Kernel


Medals: 42
Projects: 11


Game Engineer


« Reply #39 - Posted 2010-01-19 18:36:21 »

Err... you should be surprised, because using plain old fashioned vertex arrays, he's got 2x the performance of VBOs. Which is exactly the opposite of what just happened to me in my sprite engine. Admittedly Riven's code is pretty much a microbenchmark and my sprite engine is a real-world application doing real things, so possibly my results are more relevant, though I need to test on the Mac, ATI cards, Intel cards, and various PC configurations before I draw any firm conclusions.

Cas Smiley
Yeah but you're never ever supposed to use immediate mode anymore. So if you're using a combination of VBOs and immediate mode then it makes sense to be slower than glDrawArrays or glDrawElements.

See my work:
OTC Software
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline princec

JGO Kernel


Medals: 342
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #40 - Posted 2010-01-19 19:16:36 »

Riven, you're throwing away the mapped Buffer each time. Can you try your tests passing the previous buffer into the method so that it can be re-used if possible?

Also - each iteration of the test isn't one frame, it's one object. Without a swapbuffers in there, and a few other state changes, this really isn't testing much of use.

Cas Smiley

Offline Riven
« League of Dukes »

JGO Overlord


Medals: 743
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #41 - Posted 2010-01-19 19:29:12 »

Yeah but you're never ever supposed to use immediate mode anymore. So if you're using a combination of VBOs and immediate mode then it makes sense to be slower than glDrawArrays or glDrawElements.

We are comparing VertexArrays <=> VBO, not glBegin/glEnd <=> VBO

(unless you mean VertexArray with 'immediate mode' but I'm fairly sure that is not correct)

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 743
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #42 - Posted 2010-01-19 19:33:13 »

Riven, you're throwing away the mapped Buffer each time. Can you try your tests passing the previous buffer into the method so that it can be re-used if possible?
I'm not throwing it away...
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
               glBufferDataARB(GL_ARRAY_BUFFER_ARB, byteCount, GL_STREAM_DRAW_ARB);

               ByteBuffer driverSideBuffer = null;

               for (int i = 0; i < 256; i++)
               {
                  driverSideBuffer = glMapBufferARB(GL_ARRAY_BUFFER_ARB, GL_WRITE_ONLY_ARB, byteCount, driverSideBuffer);
                  javaSideBuffer.clear();
                  driverSideBuffer.clear();
                  driverSideBuffer.put(javaSideBuffer);
                  glUnmapBufferARB(GL_ARRAY_BUFFER_ARB);
...
               }


Also - each iteration of the test isn't one frame, it's one object. Without a swapbuffers in there, and a few other state changes, this really isn't testing much of use.
Isn't this about throughput? If you really want to take everything into account, post a demo that can toggle among the 3 modes. Too much work? Well, I'm lazy too Smiley

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline princec

JGO Kernel


Medals: 342
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #43 - Posted 2010-01-19 20:09:25 »

Meh, what do I know anyway Smiley My sprite engine (well, it's more of a 2d scenegraph now I suppose) is faster, that's all I care about!

Cas Smiley

Offline lhkbob

JGO Knight


Medals: 32



« Reply #44 - Posted 2010-01-20 06:13:22 »

Perhaps the valuable lesson for now is that VAs are still valuable for very dynamic geometry since the vbo's slower update doesn't outweigh the rendering benefits.  From my personal experience VBOs offer a very consistent speed boost when they're not constantly being updated, but this is to be expected.

Also, for people who encourage the use of display lists, I've had troubling issues with them on Mac hardware.  I've seen cases where rendering with them is significantly slower than vbos or vas, and where it's much faster.  It's also been the cause (or a very coincidentally unrelated) of odd graphical glitches with the Mac windowing manager/compositor.

Offline princec

JGO Kernel


Medals: 342
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #45 - Posted 2010-01-20 10:54:53 »

The internal format of data in DLs is also very very slightly different in some cases than arrays or immediate mode, which leads to rendering artifacts. I forget where I read this - it was a long time ago - but it was the final nail in the coffin for me.

Cas Smiley

Offline xinaesthetic

Senior Member


Medals: 1



« Reply #46 - Posted 2010-01-20 19:21:39 »

Also, for people who encourage the use of display lists, I've had troubling issues with them on Mac hardware.  I've seen cases where rendering with them is significantly slower than vbos or vas, and where it's much faster.
I concur that I've seen a program of mine making quite heavy use of display lists running much much worse on a powerbook with afaik semi-decent graphics than on a pretty basic older windows laptop with integrated graphics... it was ok on a PowerMac with I think 8600GT (as one would hope).
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 743
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #47 - Posted 2010-01-21 11:10:02 »

The internal format of data in DLs is also very very slightly different in some cases than arrays or immediate mode, which leads to rendering artifacts. I forget where I read this - it was a long time ago - but it was the final nail in the coffin for me.

Cas Smiley

I bet this is only showing when rendering identical VA/VBO and DL geometry, (maybe) causing z-fighting and slightly different edges in the rasterization step. Minecraft is built entirely using DLs, and from what I see it is 'good enough'. Maybe you should ask Markus Persson what mysterious bug reports he gets from his players.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline princec

JGO Kernel


Medals: 342
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #48 - Posted 2010-01-21 12:39:35 »

If the whole thing's DLs then that's probably perfectly fine.

Cas Smiley

Offline Markus_Persson

JGO Wizard


Medals: 14
Projects: 19


Mojang Specifications


« Reply #49 - Posted 2010-02-03 16:27:44 »

Strangely, Minecraft runs way slower on my new computer than it did on my old one despite the graphics card being much better.

I'm going to try implementing a pure VBO rendering path.. some day.. soon, maybe.. This thread is very informative.
I'll post results.

Play Minecraft!
Offline princec

JGO Kernel


Medals: 342
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #50 - Posted 2010-02-03 19:15:33 »

That's quite possibly because DLs are being "emulated" now rather than a first-class driver citizen. If you see what I mean.

Cas Smiley

Offline princec

JGO Kernel


Medals: 342
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #51 - Posted 2010-02-04 14:51:20 »

Another followup:

Current bottleneck is glCheckError() calls at 35% native time. Obviously when performance testing (and even in a released game) I don't care to check for errors - but LWJGL inconveniently and definitely wrongly forces a call to glCheckError() on every display update. Unfortunately this causes a pipeline flush for some reason (hence the unreasonably lengthy time spent in this method). I hacked it out of LWJGL, so that it now only occurs when in LWJGL debug mode.

Next bottleneck - glMapBufferARB() is making a call to the driver to get the current size of the currently mapped buffer - again causing a pipeline flush/stall. Now that's taking 35% of my native time. So I switched to the latest LWJGL nightly (and reapplied the check error hack) and used the new glMapBuffer() method that takes a size argument - why the method doesn't take the capacity() of the buffer is a bit odd but there we go, as that's the only safe argument to actually pass in at this point as the limit() can change after the mapping is made. Some small improvement in framerate is made - good. I'm on the right track here definitely.

Now glMapBufferARB() itself is the actual bottleneck. Hmm. Why should this be taking 20% of my native time? Ahh of course - because it's probably locked by the GPU. The solution is very simple - double buffer it. So I now use two identically sized VBOs, and swap them each frame. The GPU reads from one while I write to the other.

Suddenly I'm getting a 50% increase in frame rate. There may be a bit more to come if I try triple buffering the VBOs as well but I'm not quite sure if that's actually going to make any difference (even if my display is triple buffered).

Now StrictMath.floor() is the native bottleneck - grr - using a surprisingly large 5% of my native time for what I thought was a trivially intrinsified operation (turns out it's not - at least, not on my Turion). Anybody got a quickie workaround hack to avoid using floor()?

Cas Smiley

Offline Riven
« League of Dukes »

JGO Overlord


Medals: 743
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #52 - Posted 2010-02-04 15:00:11 »

Now StrictMath.floor() is the native bottleneck - grr - using a surprisingly large 5% of my native time for what I thought was a trivially intrinsified operation (turns out it's not - at least, not on my Turion). Anybody got a quickie workaround hack to avoid using floor()?

from Ken Perlin's simplex noise:
1  
2  
3  
4  
// This method is a *lot* faster than using (int)Math.floor(x)
private static int fastfloor(double x) {
return x>0 ? (int)x : (int)x-1;
}

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline Mike

JGO Ninja


Medals: 71
Projects: 1
Exp: 5 years


Java guru wanabee


« Reply #53 - Posted 2010-02-04 15:14:02 »

Current bottleneck is glCheckError() calls at 35% native time.

I belive Matzon said that the plan was to have a lwjgl and a lwjgl-debug (where one is used for development and one for production).

Also, what app/add-on do you use to check the native time? I mostly debug my apps using VantageAnalyzer but that one only tells the method times inside the .class.

Mike

My current game, Minecraft meets Farmville and goes online Smiley
State of Fortune | Discussion thread @ JGO
Offline princec

JGO Kernel


Medals: 342
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #54 - Posted 2010-02-04 15:54:15 »

Good old Ken.

I'm using -Xprof - works well enough for my purposes (even though it does slow things down a little itself).

Cas Smiley

Offline pjt33
« Reply #55 - Posted 2010-02-04 16:02:08 »

from Ken Perlin's simplex noise:
1  
2  
3  
4  
// This method is a *lot* faster than using (int)Math.floor(x)
private static int fastfloor(double x) {
return x>0 ? (int)x : (int)x-1;
}

Should be >=, not >, unless you want floor(0) == -1.
Offline Markus_Persson

JGO Wizard


Medals: 14
Projects: 19


Mojang Specifications


« Reply #56 - Posted 2010-02-04 16:27:58 »

It's still not correct. Consider fastFloor(-1)
But it is fast. Smiley

Play Minecraft!
Offline elias4444

Junior Member





« Reply #57 - Posted 2010-02-04 16:32:55 »

Hey again Cas,

I went ahead and ran my benchmarker with XProf so I could compare it to your findings. I'm using an LWJGL nightly build from a few days ago (right after the ATI driver issue was fixed). I don't even get glCheckError() as a blip on the radar. The big one for me is MacOSXContextImplementation.nSwapBuffers (which kind of makes sense) and then glDrawArrays. Am I just missing something?

Offline Spasi
« Reply #58 - Posted 2010-02-04 16:33:48 »

Current bottleneck is glCheckError() calls at 35% native time. Obviously when performance testing (and even in a released game) I don't care to check for errors - but LWJGL inconveniently and definitely wrongly forces a call to glCheckError() on every display update. Unfortunately this causes a pipeline flush for some reason (hence the unreasonably lengthy time spent in this method). I hacked it out of LWJGL, so that it now only occurs when in LWJGL debug mode.

Next bottleneck - glMapBufferARB() is making a call to the driver to get the current size of the currently mapped buffer - again causing a pipeline flush/stall. Now that's taking 35% of my native time. So I switched to the latest LWJGL nightly (and reapplied the check error hack)

That's weird, you shouldn't need any hack for that. Since 2.2.0 glCheckError() is only called during display update when org.lwjgl.util.Debug is set to true. See this post.
Offline princec

JGO Kernel


Medals: 342
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #59 - Posted 2010-02-04 16:43:25 »

Hm, I'm almost absolutely certain I had to put an if (LWJGLUtil.DEBUG) {} check around the call last night to stop it from checking. I will report back later when I get back from work.

@4x4: if you're blocked in swapBuffers, that just means that the GPU still has some rendering to do to finish the current frame. Triple buffering can help a bit here I think, but that's buried in the drivers/OS and beyond LWJGL's direct control.

Cas Smiley

Pages: 1 [2] 3
  ignore  |  Print  
 
 
You cannot reply to this message, because it is very, very old.

 

Add your game by posting it in the WIP section,
or publish it in Showcase.

The first screenshot will be displayed as a thumbnail.

pw (18 views)
2014-07-24 01:59:36

Riven (17 views)
2014-07-23 21:16:32

Riven (14 views)
2014-07-23 21:07:15

Riven (17 views)
2014-07-23 20:56:16

ctomni231 (45 views)
2014-07-18 06:55:21

Zero Volt (40 views)
2014-07-17 23:47:54

danieldean (32 views)
2014-07-17 23:41:23

MustardPeter (36 views)
2014-07-16 23:30:00

Cero (51 views)
2014-07-16 00:42:17

Riven (50 views)
2014-07-14 18:02:53
HotSpot Options
by dleskov
2014-07-08 03:59:08

Java and Game Development Tutorials
by SwordsMiner
2014-06-14 00:58:24

Java and Game Development Tutorials
by SwordsMiner
2014-06-14 00:47:22

How do I start Java Game Development?
by ra4king
2014-05-17 11:13:37

HotSpot Options
by Roquen
2014-05-15 09:59:54

HotSpot Options
by Roquen
2014-05-06 15:03:10

Escape Analysis
by Roquen
2014-04-29 22:16:43

Experimental Toys
by Roquen
2014-04-28 13:24:22
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!