xinaesthetic
Full Member   Posts: 204 Medals: 1
|
 |
«
Reply #30 on:
2010-01-19 08:10:49 » |
|
I can't fathom it either. It proved quite a distraction from other things I should be doing; it was nearly 3am by the time I made my post last night, by which time I was starting to doubt the validity of my judgement... I haven't been totally rigorous but I can't see a major flaw in my methodology, and now jezek2 seems to be finding the same.
I was running 10,000 particles in some of my tests, 30,000 in others. Either is enough to make particles easily the most expensive thing in the app; framerates around 30fps (much faster with particles off).
Interesting point about the pipeline... I've been planning to make some changes to the way my rendering works that would make implementing something along the lines of jezek2s suggestion very easy.
At some point, I might do my particle animation on the GPU instead... that really should be faster. Also, I've certainly seen big gains going away from immediate mode in other parts of the program.
|
|
|
|
|
jezek2
Sr. Member   Posts: 354 Medals: 3
|
 |
«
Reply #31 on:
2010-01-19 08:21:22 » |
|
For the record, I'm using it for quite small number of vertices (much smaller than the amount xinaesthetic mentioned), also it's not primarily for particles though I currently render the few particles using immediate mode too.
|
|
|
|
|
Riven
« League of Dukes » JGO Kernel      Posts: 5509 Medals: 204
Hand over your head.
|
 |
«
Reply #32 on:
2010-01-19 08:27:30 » |
|
Vertex arrays: 17ms!!! (glMapBuffer was 30ms) 1 2 3 4 5 6 7 8 9 10 11 12 13
| javaSideBuffer.clear();
FloatBuffer fb = javaSideBuffer.order(ByteOrder.nativeOrder()).asFloatBuffer();
FloatBuffer vBuffer = (FloatBuffer) fb.slice().limit(fb.capacity() >> 1); FloatBuffer cBuffer = (FloatBuffer) fb.slice().position(fb.capacity() >> 1);
for (int i = 0; i < 256; i++) { glVertexPointer(3, 0, vBuffer); glColorPointer(3, 0, cBuffer); glDrawArrays(GL_TRIANGLES, 0, quadCount * 3 * 2); } |
|
Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social rankings
|
|
|
Games published by our own members! Go get 'em!
|
|
Eli Delventhal
« League of Dukes » JGO Kernel      Posts: 3478 Medals: 39
Game Engineer
|
 |
«
Reply #33 on:
2010-01-19 10:51:48 » |
|
Vertex arrays: 17ms!!! (glMapBuffer was 30ms) 1 2 3 4 5 6 7 8 9 10 11 12 13
| javaSideBuffer.clear();
FloatBuffer fb = javaSideBuffer.order(ByteOrder.nativeOrder()).asFloatBuffer();
FloatBuffer vBuffer = (FloatBuffer) fb.slice().limit(fb.capacity() >> 1); FloatBuffer cBuffer = (FloatBuffer) fb.slice().position(fb.capacity() >> 1);
for (int i = 0; i < 256; i++) { glVertexPointer(3, 0, vBuffer); glColorPointer(3, 0, cBuffer); glDrawArrays(GL_TRIANGLES, 0, quadCount * 3 * 2); } |
It is recommended on OpenGL's site that you use glDrawArrays instead of immediate mode, so I'm not at all surprised.
|
See my work:OTC Software<br /> Currently Working On:Secret project... I edit JGO in production, because I simply don't waste time writing bugs
|
|
|
princec
« League of Dukes » JGO Kernel      Posts: 7800 Medals: 77
Eh? Who? What? ... Me?
|
 |
«
Reply #34 on:
2010-01-19 11:39:53 » |
|
Err... you should be surprised, because using plain old fashioned vertex arrays, he's got 2x the performance of VBOs. Which is exactly the opposite of what just happened to me in my sprite engine. Admittedly Riven's code is pretty much a microbenchmark and my sprite engine is a real-world application doing real things, so possibly my results are more relevant, though I need to test on the Mac, ATI cards, Intel cards, and various PC configurations before I draw any firm conclusions. Cas 
|
|
|
|
DzzD
|
 |
«
Reply #35 on:
2010-01-19 12:02:34 » |
|
Vertex arrays: 17ms!!! (glMapBuffer was 30ms) 1 2 3 4 5 6 7 8 9 10 11 12 13
| javaSideBuffer.clear();
FloatBuffer fb = javaSideBuffer.order(ByteOrder.nativeOrder()).asFloatBuffer();
FloatBuffer vBuffer = (FloatBuffer) fb.slice().limit(fb.capacity() >> 1); FloatBuffer cBuffer = (FloatBuffer) fb.slice().position(fb.capacity() >> 1);
for (int i = 0; i < 256; i++) { glVertexPointer(3, 0, vBuffer); glColorPointer(3, 0, cBuffer); glDrawArrays(GL_TRIANGLES, 0, quadCount * 3 * 2); } |
hehe, not really surprised  I woud be also pretty confident with a compiled list (inded without redundant glbegin)
|
|
|
|
Spasi
JGO Ninja    Posts: 566 Medals: 22
Molon Lave
|
 |
«
Reply #36 on:
2010-01-19 12:10:34 » |
|
Riven, could you try to submit 10 times (or more) the vertex data per iteration and post another comparison?
Btw, we can't really compare display lists here, this is rendering of dynamic geometry submitted to the GPU each frame (each iteration in Riven's code represents a rendered frame).
|
|
|
|
|
DzzD
|
 |
«
Reply #37 on:
2010-01-19 12:13:54 » |
|
Riven, could you try to submit 10 times (or more) the vertex data per iteration and post another comparison?
Btw, we can't really compare display lists here, this is rendering of dynamic geometry submitted to the GPU each frame (each iteration in Riven's code represents a rendered frame).
yes, that why I suppose in case of static geometrie displaylist would have be faster (a way faster as only one JNI call for thousands of draws), so everything have it own usage depending on application. nb: also mixing different displaylist (and recompil them on the fly) , reorder them too, can work pretty weel for dynamic scene rendering but the comparaison can be made in a certain manner by drawing the same thing to the screen
|
|
|
|
Riven
« League of Dukes » JGO Kernel      Posts: 5509 Medals: 204
Hand over your head.
|
 |
«
Reply #38 on:
2010-01-19 12:26:49 » |
|
(16*16*2 = 512 tris) * 256 iterations VA (normal): 6ms VBO (mapped): 14ms <= VBO (subdata): 14ms <= (32*32*2 = 2K tris) * 256 iterations VA (normal): 17ms VBO (mapped): 30ms VBO (subdata): 51ms <= (64*64*2 = 8K tris) * 256 iterations VA (normal): 65ms VBO (mapped): 182ms <= VBO (subdata): 122ms (128*128*2 = 32K tris) * 256 iterations VA (normal): 415ms VBO (mapped): 1225ms <= VBO (subdata): 762ms it depends on the datasize which is the slowest 
|
Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social rankings
|
|
|
Eli Delventhal
« League of Dukes » JGO Kernel      Posts: 3478 Medals: 39
Game Engineer
|
 |
«
Reply #39 on:
2010-01-19 12:36:21 » |
|
Err... you should be surprised, because using plain old fashioned vertex arrays, he's got 2x the performance of VBOs. Which is exactly the opposite of what just happened to me in my sprite engine. Admittedly Riven's code is pretty much a microbenchmark and my sprite engine is a real-world application doing real things, so possibly my results are more relevant, though I need to test on the Mac, ATI cards, Intel cards, and various PC configurations before I draw any firm conclusions. Cas  Yeah but you're never ever supposed to use immediate mode anymore. So if you're using a combination of VBOs and immediate mode then it makes sense to be slower than glDrawArrays or glDrawElements.
|
See my work:OTC Software<br /> Currently Working On:Secret project... I edit JGO in production, because I simply don't waste time writing bugs
|
|
|
Games published by our own members! Go get 'em!
|
|
princec
« League of Dukes » JGO Kernel      Posts: 7800 Medals: 77
Eh? Who? What? ... Me?
|
 |
«
Reply #40 on:
2010-01-19 13:16:36 » |
|
Riven, you're throwing away the mapped Buffer each time. Can you try your tests passing the previous buffer into the method so that it can be re-used if possible? Also - each iteration of the test isn't one frame, it's one object. Without a swapbuffers in there, and a few other state changes, this really isn't testing much of use. Cas 
|
|
|
|
Riven
« League of Dukes » JGO Kernel      Posts: 5509 Medals: 204
Hand over your head.
|
 |
«
Reply #41 on:
2010-01-19 13:29:12 » |
|
Yeah but you're never ever supposed to use immediate mode anymore. So if you're using a combination of VBOs and immediate mode then it makes sense to be slower than glDrawArrays or glDrawElements.
We are comparing VertexArrays <=> VBO, not glBegin/glEnd <=> VBO (unless you mean VertexArray with 'immediate mode' but I'm fairly sure that is not correct)
|
Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social rankings
|
|
|
Riven
« League of Dukes » JGO Kernel      Posts: 5509 Medals: 204
Hand over your head.
|
 |
«
Reply #42 on:
2010-01-19 13:33:13 » |
|
Riven, you're throwing away the mapped Buffer each time. Can you try your tests passing the previous buffer into the method so that it can be re-used if possible?
I'm not throwing it away... 1 2 3 4 5 6 7 8 9 10 11 12 13
| glBufferDataARB(GL_ARRAY_BUFFER_ARB, byteCount, GL_STREAM_DRAW_ARB);
ByteBuffer driverSideBuffer = null;
for (int i = 0; i < 256; i++) { driverSideBuffer = glMapBufferARB(GL_ARRAY_BUFFER_ARB, GL_WRITE_ONLY_ARB, byteCount, driverSideBuffer); javaSideBuffer.clear(); driverSideBuffer.clear(); driverSideBuffer.put(javaSideBuffer); glUnmapBufferARB(GL_ARRAY_BUFFER_ARB); ... } |
Also - each iteration of the test isn't one frame, it's one object. Without a swapbuffers in there, and a few other state changes, this really isn't testing much of use.
Isn't this about throughput? If you really want to take everything into account, post a demo that can toggle among the 3 modes. Too much work? Well, I'm lazy too 
|
Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social rankings
|
|
|
princec
« League of Dukes » JGO Kernel      Posts: 7800 Medals: 77
Eh? Who? What? ... Me?
|
 |
«
Reply #43 on:
2010-01-19 14:09:25 » |
|
Meh, what do I know anyway  My sprite engine (well, it's more of a 2d scenegraph now I suppose) is faster, that's all I care about! Cas 
|
|
|
|
lhkbob
JGO Neuromancer     Posts: 1111 Medals: 30
|
 |
«
Reply #44 on:
2010-01-20 00:13:22 » |
|
Perhaps the valuable lesson for now is that VAs are still valuable for very dynamic geometry since the vbo's slower update doesn't outweigh the rendering benefits. From my personal experience VBOs offer a very consistent speed boost when they're not constantly being updated, but this is to be expected.
Also, for people who encourage the use of display lists, I've had troubling issues with them on Mac hardware. I've seen cases where rendering with them is significantly slower than vbos or vas, and where it's much faster. It's also been the cause (or a very coincidentally unrelated) of odd graphical glitches with the Mac windowing manager/compositor.
|
|
|
|
princec
« League of Dukes » JGO Kernel      Posts: 7800 Medals: 77
Eh? Who? What? ... Me?
|
 |
«
Reply #45 on:
2010-01-20 04:54:53 » |
|
The internal format of data in DLs is also very very slightly different in some cases than arrays or immediate mode, which leads to rendering artifacts. I forget where I read this - it was a long time ago - but it was the final nail in the coffin for me. Cas 
|
|
|
|
xinaesthetic
Full Member   Posts: 204 Medals: 1
|
 |
«
Reply #46 on:
2010-01-20 13:21:39 » |
|
Also, for people who encourage the use of display lists, I've had troubling issues with them on Mac hardware. I've seen cases where rendering with them is significantly slower than vbos or vas, and where it's much faster.
I concur that I've seen a program of mine making quite heavy use of display lists running much much worse on a powerbook with afaik semi-decent graphics than on a pretty basic older windows laptop with integrated graphics... it was ok on a PowerMac with I think 8600GT (as one would hope).
|
|
|
|
|
Riven
« League of Dukes » JGO Kernel      Posts: 5509 Medals: 204
Hand over your head.
|
 |
«
Reply #47 on:
2010-01-21 05:10:02 » |
|
The internal format of data in DLs is also very very slightly different in some cases than arrays or immediate mode, which leads to rendering artifacts. I forget where I read this - it was a long time ago - but it was the final nail in the coffin for me. Cas  I bet this is only showing when rendering identical VA/VBO and DL geometry, (maybe) causing z-fighting and slightly different edges in the rasterization step. Minecraft is built entirely using DLs, and from what I see it is 'good enough'. Maybe you should ask Markus Persson what mysterious bug reports he gets from his players.
|
Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social rankings
|
|
|
princec
« League of Dukes » JGO Kernel      Posts: 7800 Medals: 77
Eh? Who? What? ... Me?
|
 |
«
Reply #48 on:
2010-01-21 06:39:35 » |
|
If the whole thing's DLs then that's probably perfectly fine. Cas 
|
|
|
|
Markus_Persson
JGO Kernel      Posts: 2092 Medals: 10
Mojang Specifications
|
 |
«
Reply #49 on:
2010-02-03 10:27:44 » |
|
Strangely, Minecraft runs way slower on my new computer than it did on my old one despite the graphics card being much better.
I'm going to try implementing a pure VBO rendering path.. some day.. soon, maybe.. This thread is very informative. I'll post results.
|
|
|
|
princec
« League of Dukes » JGO Kernel      Posts: 7800 Medals: 77
Eh? Who? What? ... Me?
|
 |
«
Reply #50 on:
2010-02-03 13:15:33 » |
|
That's quite possibly because DLs are being "emulated" now rather than a first-class driver citizen. If you see what I mean. Cas 
|
|
|
|
princec
« League of Dukes » JGO Kernel      Posts: 7800 Medals: 77
Eh? Who? What? ... Me?
|
 |
«
Reply #51 on:
2010-02-04 08:51:20 » |
|
Another followup: Current bottleneck is glCheckError() calls at 35% native time. Obviously when performance testing (and even in a released game) I don't care to check for errors - but LWJGL inconveniently and definitely wrongly forces a call to glCheckError() on every display update. Unfortunately this causes a pipeline flush for some reason (hence the unreasonably lengthy time spent in this method). I hacked it out of LWJGL, so that it now only occurs when in LWJGL debug mode. Next bottleneck - glMapBufferARB() is making a call to the driver to get the current size of the currently mapped buffer - again causing a pipeline flush/stall. Now that's taking 35% of my native time. So I switched to the latest LWJGL nightly (and reapplied the check error hack) and used the new glMapBuffer() method that takes a size argument - why the method doesn't take the capacity() of the buffer is a bit odd but there we go, as that's the only safe argument to actually pass in at this point as the limit() can change after the mapping is made. Some small improvement in framerate is made - good. I'm on the right track here definitely. Now glMapBufferARB() itself is the actual bottleneck. Hmm. Why should this be taking 20% of my native time? Ahh of course - because it's probably locked by the GPU. The solution is very simple - double buffer it. So I now use two identically sized VBOs, and swap them each frame. The GPU reads from one while I write to the other. Suddenly I'm getting a 50% increase in frame rate. There may be a bit more to come if I try triple buffering the VBOs as well but I'm not quite sure if that's actually going to make any difference (even if my display is triple buffered). Now StrictMath.floor() is the native bottleneck - grr - using a surprisingly large 5% of my native time for what I thought was a trivially intrinsified operation (turns out it's not - at least, not on my Turion). Anybody got a quickie workaround hack to avoid using floor()? Cas 
|
|
|
|
Riven
« League of Dukes » JGO Kernel      Posts: 5509 Medals: 204
Hand over your head.
|
 |
«
Reply #52 on:
2010-02-04 09:00:11 » |
|
Now StrictMath.floor() is the native bottleneck - grr - using a surprisingly large 5% of my native time for what I thought was a trivially intrinsified operation (turns out it's not - at least, not on my Turion). Anybody got a quickie workaround hack to avoid using floor()?
from Ken Perlin's simplex noise: 1 2 3 4
| private static int fastfloor(double x) { return x>0 ? (int)x : (int)x-1; } |
|
Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social rankings
|
|
|
Mickelukas
JGO Ninja    Posts: 687 Medals: 24
Java guru wanabee
|
 |
«
Reply #53 on:
2010-02-04 09:14:02 » |
|
Current bottleneck is glCheckError() calls at 35% native time. I belive Matzon said that the plan was to have a lwjgl and a lwjgl-debug (where one is used for development and one for production). Also, what app/add-on do you use to check the native time? I mostly debug my apps using VantageAnalyzer but that one only tells the method times inside the .class. Mike
|
|
|
|
princec
« League of Dukes » JGO Kernel      Posts: 7800 Medals: 77
Eh? Who? What? ... Me?
|
 |
«
Reply #54 on:
2010-02-04 09:54:15 » |
|
Good old Ken. I'm using -Xprof - works well enough for my purposes (even though it does slow things down a little itself). Cas 
|
|
|
|
pjt33
JGO Strike Force    Posts: 890 Medals: 17
|
 |
«
Reply #55 on:
2010-02-04 10:02:08 » |
|
from Ken Perlin's simplex noise: 1 2 3 4
| private static int fastfloor(double x) { return x>0 ? (int)x : (int)x-1; } |
Should be >=, not >, unless you want floor(0) == -1.
|
|
|
|
|
Markus_Persson
JGO Kernel      Posts: 2092 Medals: 10
Mojang Specifications
|
 |
«
Reply #56 on:
2010-02-04 10:27:58 » |
|
It's still not correct. Consider fastFloor(-1) But it is fast. 
|
|
|
|
elias4444
Full Member   Posts: 231
|
 |
«
Reply #57 on:
2010-02-04 10:32:55 » |
|
Hey again Cas,
I went ahead and ran my benchmarker with XProf so I could compare it to your findings. I'm using an LWJGL nightly build from a few days ago (right after the ATI driver issue was fixed). I don't even get glCheckError() as a blip on the radar. The big one for me is MacOSXContextImplementation.nSwapBuffers (which kind of makes sense) and then glDrawArrays. Am I just missing something?
|
|
|
|
Spasi
JGO Ninja    Posts: 566 Medals: 22
Molon Lave
|
 |
«
Reply #58 on:
2010-02-04 10:33:48 » |
|
Current bottleneck is glCheckError() calls at 35% native time. Obviously when performance testing (and even in a released game) I don't care to check for errors - but LWJGL inconveniently and definitely wrongly forces a call to glCheckError() on every display update. Unfortunately this causes a pipeline flush for some reason (hence the unreasonably lengthy time spent in this method). I hacked it out of LWJGL, so that it now only occurs when in LWJGL debug mode.
Next bottleneck - glMapBufferARB() is making a call to the driver to get the current size of the currently mapped buffer - again causing a pipeline flush/stall. Now that's taking 35% of my native time. So I switched to the latest LWJGL nightly (and reapplied the check error hack) That's weird, you shouldn't need any hack for that. Since 2.2.0 glCheckError() is only called during display update when org.lwjgl.util.Debug is set to true. See this post.
|
|
|
|
|
princec
« League of Dukes » JGO Kernel      Posts: 7800 Medals: 77
Eh? Who? What? ... Me?
|
 |
«
Reply #59 on:
2010-02-04 10:43:25 » |
|
Hm, I'm almost absolutely certain I had to put an if (LWJGLUtil.DEBUG) {} check around the call last night to stop it from checking. I will report back later when I get back from work. @4x4: if you're blocked in swapBuffers, that just means that the GPU still has some rendering to do to finish the current frame. Triple buffering can help a bit here I think, but that's buried in the drivers/OS and beyond LWJGL's direct control. Cas 
|
|
|
|
|