Java-Gaming.org    
Featured games (79)
games approved by the League of Dukes
Games in Showcase (477)
Games in Android Showcase (109)
games submitted by our members
Games in WIP (537)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: [1]
  ignore  |  Print  
  FloatBuffer and Batching  (Read 1601 times)
0 Members and 1 Guest are viewing this topic.
Offline ryeTech

Junior Newbie





« Posted 2014-04-06 00:47:04 »

After a few weeks, I have finally created my basic sprite batcher!
It uses a VBO (DYNAMIC creation hint) and IBO (Static creation hint), where the IBO is pre-filled with the needed index data.
Whenever I need to draw a sprite I have method that places the vertex and texture data into a giant float array.
Then when I reach my sprite limit or need to change textures, I shoot everything to the GPU!
Basically I call a single Put method on a floatbuffer, where my giant float array is the data source, and I draw everything needed.

1  
mySuperFloatBuffer.put(myGiantFloatArray);



So whats left? The only thing left... performance tests... unfortunately I don't like the numbers too much Sad

I'm testing my batcher against its STATIC counterpart. Meaning my VBO is set as STATIC, my vertex data (X Y Z) is pregenerated, and the floatbuffer used by the VBO is prefilled.
The other way, I'm writing over the floatbuffer every frame with random vertex data from my giant float array using a single Put call.

Now here are the numbers
This is using 20,000 textured quads(sized 32 x 32) over a 800x480 screen space

Dynamic (Normal Way): Delta Time = 0.048 seconds or 22.72 FPS
Static (Testing Purpose way): Delta Time = 0.020  seconds or 50 FPS


So as you can see that is a pretty large difference, so what gives?
I know I won't be able to get them completely matching. Since one way I'm changing the data every frame and the other way is prefilled never changing.
BUT I feel like my numbers should be a bit better

My vertex and fragment shaders are extremely simple. The vertex shader just gets pretransformed vertex points. The fragment shader just uses the texture, no manipulation of any kind.
I believe my problem lies on the CPU side. Specifically, how I treat vertex data, I mean I am doing this

Raw vertex data into giant float array then one big copy using the Put method to place it the floatbuffer. Before finally sending it to gpu using glBufferSubData

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
//In my draw method
myGiantFloatArray[currentSize] = 0.0f;
myGiantFloatArray[currentSize + 1] = 0.0f;
myGiantFloatArray[currentSize + 2] = 0.0f;
myGiantFloatArray[currentSize + 3] = 1.0f;
/* other vertex data */

//------------------------------------------------------

//In my Render (flush batch) method
mySuperFloatBuffer.put(myGiantFloatArray);
mySuperFloatBuffer.position(0);

/*Other openGL stuff  */
glBufferSubData(/*Params*/);

/*Any remaining stuff and index draw call */


My real heavy hitter in that above code is the Put. Taking up 0.010 seconds or so of my precious time!
So finally, is there a better way to handle things on the CPU side? Any help would be greatly appreciated.
Let me know if I need to add more info on anything

Thanks!

By the way, I'm using openGL ES 2.0, which does not have any glMapBuffer calls Sad


Offline princec

JGO Kernel


Medals: 343
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #1 - Posted 2014-04-06 11:57:45 »

The fundamental mistake you are making is to write your data out to a float array first. One of the whole points of VBOs is that you do not place your vertex data in an intermediary buffer. There are three reasons for this:

1. Firstly you've got to allocate twice the RAM for no reason. However this isn't as important as...
2. ... writing to that intermediary buffer just trashes your data caches. The idea of of write-only VBOs is that the caches are ignored and written straight past, leaving all the data in the caches that you need to carry on computing vertex data without just flushing it out constantly
3. ... and in any case you're still doing twice the memory movement, writing it out once, and then writing it out again.

Cas Smiley

Offline ryeTech

Junior Newbie





« Reply #2 - Posted 2014-04-07 01:24:56 »


1. Firstly you've got to allocate twice the RAM for no reason. However this isn't as important as...
2. ... writing to that intermediary buffer just trashes your data caches. The idea of of write-only VBOs is that the caches are ignored and written straight past, leaving all the data in the caches that you need to carry on computing vertex data without just flushing it out constantly
3. ... and in any case you're still doing twice the memory movement, writing it out once, and then writing it out again.


I know I'm doing some bad things here, but I'm not sure how else to do it Sad

The core reason I place everything into a giant float array first and then into the floatbuffer is that the single Put call has been the fastest.
The other way (using many Put calls. EG indexed Put, a basic Put, or etc) has limited me to only being able to hit 5,000 or less textured quads while maintaining a measly 15 - 20 FPS

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
//------[+] This is way faster i have found so far!

//In my draw method
myGiantFloatArray[currentSize] = 0.0f;
myGiantFloatArray[currentSize + 1] = 0.0f;
myGiantFloatArray[currentSize + 2] = 0.0f;
myGiantFloatArray[currentSize + 3] = 1.0f;
/* other vertex data */

//------------------------------------------------------

//In my Render (flush batch) method
mySuperFloatBuffer.put(myGiantFloatArray);
mySuperFloatBuffer.position(0);

/*Other openGL stuff  */
glBufferSubData(GL_ARRAY_BUFFER, 0, currentDataSize * BYTES_PER_FLOAT, mySuperFloatBuffer);
     

//=======================================================================

//---------[+] This is way slower (Many basic Put calls) :(

//In my draw method
mySuperFloatBuffer.put(0.0f);
mySuperFloatBuffer.put(0.0f);
mySuperFloatBuffer.put(0.0f);
mySuperFloatBuffer.put(1.0f);
/* other vertex data */

//------------------------------------------------------

//In my Render (flush batch) method
mySuperFloatBuffer.position(0);

/*Other openGL stuff  */
glBufferSubData(GL_ARRAY_BUFFER, 0, currentDataSize * BYTES_PER_FLOAT, mySuperFloatBuffer);


Also just as a 'in case' this is how it all comes together
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
//In my main Render Method

batcher.begin(); //Setup the batch basics (Shader program, resets, etc)

//Call the draw method
for(int i = 0; i < 20000; i++)
{
    batch.draw((float)rand.nextInt(480), (float)rand.nextInt(800), texture2);  
}

batch.end(); //Call the Flush method and any other ending items



Could you explain more along the lines of what you are thinking? I'm not sure as to how to efficiently place data into the VBO. As I said before, unfortunately I can not map directly into it Sad
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline princec

JGO Kernel


Medals: 343
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #3 - Posted 2014-04-07 10:33:07 »

There were some gotchas about the fastest way to call put() in a FloatBuffer but I can't remember exactly what they were. Personally I use .put() just fine and I'm managing an order of magntitude more sprites than you are - maybe the problem lies elsewhere? (Have you profiled it? try -Xprof on the commandline)

Cas Smiley

Offline deathpat
« Reply #4 - Posted 2014-04-07 10:58:01 »

I had exactly the same issue as ryeTech when doing my batcher for Daedalus, and came up with the same solution: using an array next to the FloatBuffer.
I came to this after doing some profiling on the game ( not on an isolated test case ), I noticed that the put method was very costly (CPU-wise), and it was really faster to have a separate array and doing only one call to put(float[]). After switching to an array next to the FloatBuffer, I gained in CPU and FPS ( 'cause I was CPU limited )

If there is any other way to populate the FloatBuffer without degrading the performance ( for me memory usage was not an issue ), I'm all ears Smiley

Just a note: I'm talking about cases where you need to refresh most or all of the VBO data ( meaning doing like 10000 put calls on the FloatBuffer in my case ), otherwise I'm pretty sure that a handful of put() calls should be at least as fast as a put(float[]).

work in progress : D A E D A L U S
Offline princec

JGO Kernel


Medals: 343
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #5 - Posted 2014-04-07 11:07:01 »

The Hotspot VM is supposed to intrinsify the put() call such that it should be identical in performance to the float[] access... maybe this isn't happening for some reason?

Cas Smiley

Offline Nate

JGO Kernel


Medals: 145
Projects: 4
Exp: 14 years


Esoteric Software


« Reply #6 - Posted 2014-04-07 11:10:21 »

SpriteBatch in libgdx caches in a float[] and then flushes to a VA. This is the fastest way to do it. See the Sprite shootout thread on JGO. This is for geometry that changes each frame. If yours doesn't, see SpriteCache which writes to a VBO.

Offline deathpat
« Reply #7 - Posted 2014-04-07 11:17:27 »

The Hotspot VM is supposed to intrinsify the put() call such that it should be identical in performance to the float[] access... maybe this isn't happening for some reason?

Cas Smiley

even on a direct FloatBuffer ? I have no idea how it works internally but I imagine that accessing a direct buffer is a bit different than a non-direct one

work in progress : D A E D A L U S
Offline princec

JGO Kernel


Medals: 343
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #8 - Posted 2014-04-07 13:23:11 »

Theoretically yes. But then Java performance theory has always been a bit of a vague and slippery concept which seems to differ from practice rather a lot.

Cas Smiley

Offline Orangy Tang

JGO Kernel


Medals: 56
Projects: 11


Monkey for a head


« Reply #9 - Posted 2014-04-07 15:16:08 »

The Hotspot VM is supposed to intrinsify the put() call such that it should be identical in performance to the float[] access... maybe this isn't happening for some reason?

Cas Smiley

Last time I did this it was indeed roughly identical... on desktop VMs. The android vm was much, much slower and combining everything into a big float[] was way faster. Probably why libgdx is doing it this way too.

[ TriangularPixels.com - Play Growth Spurt, Rescue Squad and Snowman Village ] [ Rebirth - game resource library ]
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline loom_weaver

JGO Coder


Medals: 17



« Reply #10 - Posted 2014-04-07 17:09:52 »

I've been exploring VBOs recently but on the desktop.  java.nio.FloatBuffers for vertices and textures + ByteBuffers for colors gives pretty good performance.  glBufferData/glBufferSubData is giving 1.5 msec render time for 160000 quads.  Note that I'm not put'ing every vertex every frame though.

No idea about ES and your platform though but have you confirmed that desktop performance is adequate?
Offline ryeTech

Junior Newbie





« Reply #11 - Posted 2014-04-08 03:15:36 »

I'm back... with numbers!

So I have been running some numbers again. I decided to 'level' the playing field with my test cases

Before I was trying to bench mark against a Static Prefilled X Y Z VBO vs a Dynamic Non-prefilled (new X Y every frame)
I felt like I was benchmarking how well the VM would generate random numbers for me. So I ran this test instead! And remember this is using a single Put with a giant float array

My Test Case:
X number of Quads
32x32 Size
Textured
Tinted Blue
Pregenerated Random X & Y location
VBO and IBO (IBO is Static Prefilled)
System Info:
Android 4.1.2 (JellyBean)
CPU - Dual-Core 972 MHz
GPU - Qualcomm Adreno 305
Resolution - 480x800

Number of Quads
10000
20000
30000
40000
50000
-------
10000
20000
30000
40000
50000
-------
10000
20000
30000
40000
50000
VBO Hint
Dynamic
Dynamic
Dynamic
Dynamic
Dynamic
----------
Stream 
Stream 
Stream 
Stream 
Stream 
----------
Static   
Static   
Static   
Static   
Static   
Delta Time (seconds)
0.017 - 0.016
0.033 - 0.032
0.050 - 0.049
0.065 - 0.064
0.082 - 0.081
-----------------
0.017 - 0.016
0.031 - 0.032
0.048           
0.065 - 0.064
0.081 - 0.080
-----------------
0.019           
0.037 - 0.036
0.056 - 0.055
0.073 - 0.072
0.092 - 0.090
FPS
58.82 - 62.5   
30.30 - 31.25
20 - 20.40     
15.38 - 15.62
12.19 - 12.34
-----------------
58.82 - 62.50
31.25 - 32.26
20.83           
15.38 - 15.62
12.34 - 12.50
-----------------
52.63           
27.02 - 27.77
17.85 - 18.18
13.69 - 13.88
10.86 - 11.11


As for setting the quad to a new location every frame (new random values [2 randoms per draw call); I'm still bottoming out at 20,000 quads with a .043 Delta Time (seconds) or roughly 23 FPS


If anyone has any better way to increase my numbers that would be awesome! I want to make sure I am truly at my limit Tongue
Offline loom_weaver

JGO Coder


Medals: 17



« Reply #12 - Posted 2014-04-08 05:16:40 »

What are your metrics like without iterating over every quad and calling Random?

Even on my system, if I iterate over every one of the 160000 quads and change the vertices using Random, it's quite slow... i.e. +15 msec per render frame where without the iteration+Random it's 1.5 msec/frame using VBO+glBufferSubData.  This is not something I would even consider as there shouldn't be any need to visit every single quad every single frame in this manner for a normal game.

Once you take this iteration out of the picture and the calls to Random you should be able to get better metrics of the throughput of memory to the video card.  Have you tried modifying just a small subset of the quads just so that you can verify that your render loop is picking up changes?
Offline princec

JGO Kernel


Medals: 343
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #13 - Posted 2014-04-08 11:27:45 »

Sorry, I hadn't realised you were on Android and doing this stuff, in which case do ignore any advice I've given you and listen to Nate instead. On Android:
1. VBOs are not actually "accelerated" in any way at all, they're just system RAM and slow as hell and
2. Buffers aren't intrinsified at all and
3. Dalvik doesn't have any of that clever stuff with inlining and so on

Cas Smiley

Offline ryeTech

Junior Newbie





« Reply #14 - Posted 2014-04-09 02:40:25 »

What are your metrics like without iterating over every quad and calling Random?

Even on my system, if I iterate over every one of the 160000 quads and change the vertices using Random, it's quite slow... i.e. +15 msec per render frame where without the iteration+Random it's 1.5 msec/frame using VBO+glBufferSubData.  This is not something I would even consider as there shouldn't be any need to visit every single quad every single frame in this manner for a normal game.

When I rig it to use 20,000 quads and only update the first 10,000 with a new position every frame.
I get a delta time of 0.036 - 0.026 (38.46 to 27.77 FPS)

So that further solidifies that the new random value each frame is the real killer just like my metric pointed out.
I just wish it could be faster you know Smiley

Sorry, I hadn't realised you were on Android and doing this stuff, in which case do ignore any advice I've given you and listen to Nate instead. On Android:
1. VBOs are not actually "accelerated" in any way at all, they're just system RAM and slow as hell and
2. Buffers aren't intrinsified at all and
3. Dalvik doesn't have any of that clever stuff with inlining and so on

Cas Smiley
Are you saying that VBO are really doing nothing for me? Should I just use a plan old vertex array then?
Can I use a IBO then too? I can remember if the IBO requires a VBO (I don't think it does)

Also can you explain 2 and 3. I'm not really too sure what 'intrinsified' buffers mean, nor do I know what you mean by inlining
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 757
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #15 - Posted 2014-04-09 09:04:19 »

Are you sure you are measuring VBO upload performance and that you are not capped by your GPU fillrate? Make sure every sprite is 1px or less (like having all geometry outside the viewport).

Additionally, you can create a PRNG with reasonable quality output that's way faster than Random.next() / Math.random()

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline princec

JGO Kernel


Medals: 343
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #16 - Posted 2014-04-09 12:51:04 »

What are your metrics like without iterating over every quad and calling Random?

Even on my system, if I iterate over every one of the 160000 quads and change the vertices using Random, it's quite slow... i.e. +15 msec per render frame where without the iteration+Random it's 1.5 msec/frame using VBO+glBufferSubData.  This is not something I would even consider as there shouldn't be any need to visit every single quad every single frame in this manner for a normal game.

When I rig it to use 20,000 quads and only update the first 10,000 with a new position every frame.
I get a delta time of 0.036 - 0.026 (38.46 to 27.77 FPS)

So that further solidifies that the new random value each frame is the real killer just like my metric pointed out.
I just wish it could be faster you know Smiley

Sorry, I hadn't realised you were on Android and doing this stuff, in which case do ignore any advice I've given you and listen to Nate instead. On Android:
1. VBOs are not actually "accelerated" in any way at all, they're just system RAM and slow as hell and
2. Buffers aren't intrinsified at all and
3. Dalvik doesn't have any of that clever stuff with inlining and so on

Cas Smiley
Are you saying that VBO are really doing nothing for me? Should I just use a plan old vertex array then?
Can I use a IBO then too? I can remember if the IBO requires a VBO (I don't think it does)

Also can you explain 2 and 3. I'm not really too sure what 'intrinsified' buffers mean, nor do I know what you mean by inlining

You'll be needing to look up terms like this in the near future if you're going to embark upon a career as a programmer, but I'll start the ball rolling:

"Intrinsification", which may not actually be a real word outside of compiler design circles, is where the compiler detects some fairly complex high-level code eg. Math.sqrt(), FloatBuffer.put(), and replaces the function call with a single machine code instruction (or maybe several), thus making that code as fast as it is possible to be. The desktop JVMs do a lot of this, but the Dalvik VM isn't quite so clever and doesn't manage to do so much of it.

"Inlining" is where a small method, eg. public int getX() { return x; } is simply copied verbatim into the callsite - rather like cut n paste on the fly. Instead of pushing a bunch of arguments on the stack, jumping to a subroutine in a totally different area of memory, executing the code, and then popping the return value off the stack, the code is just executed in place, saving all those shenanigans from happening and providing a huge speedup. Inlining can be recursive, that is, inlined functions may themselves have inlined functions. It's tuneable with some commandline args on the desktop VMs. Again, though, the Dalvik VM doesn't appear to do much in the way of inlining, though the latest versions of Android might have improved it a bit.

Intrinsification and inlining are two of the reasons why Java has made such leaps and bounds in speed versus C++ over the last decade. There are a bunch more things that also help a lot such as bounds check elimination, escape analysis, monomorphic callsite detection, loop unrolling, lock elision, and huge advances in garbage collection and allocation strategy... again, none of which made it in to Dalvik. (You can search these very boards for discussions about all those things, and Google will provide further information).

And finally...

yes, VBOs gain you absolutely diddly squat on any current ARM devices. Same goes for iOS as Android. There is no discrete GPU memory, no separate DMA bus, and usually, what bus there is, is a crappy 16 bit or maybe 32 bit wide one anyway. The only reason for VBOs on ARM chipsets is that a) one day they might have these things though this is a tenuous reason at best and b) it makes it rather easier to port the same code between desktop and ARM devices. See libgdx. Yes, you can use an index array with plain vertex arrays.

Cas Smiley

Offline ryeTech

Junior Newbie





« Reply #17 - Posted 2014-04-10 02:27:59 »

Are you sure you are measuring VBO upload performance and that you are not capped by your GPU fillrate? Make sure every sprite is 1px or less (like having all geometry outside the viewport).

Additionally, you can create a PRNG with reasonable quality output that's way faster than Random.next() / Math.random()

I went and tried this (using the same test case I had except the quads were 1x1). I'm getting a DT of 0.082 - 0.084 at 85,000 quads. This about the same as the metric for the 32x32 quad test I have, where I'm only pulling 50,000 quads.

So it looks like I'm GPU fillrate bound right, since my quad amount increased when the were only a pixel big?

yes, VBOs gain you absolutely diddly squat on any current ARM devices. Same goes for iOS as Android. There is no discrete GPU memory, no separate DMA bus, and usually, what bus there is, is a crappy 16 bit or maybe 32 bit wide one anyway. The only reason for VBOs on ARM chipsets is that a) one day they might have these things though this is a tenuous reason at best and b) it makes it rather easier to port the same code between desktop and ARM devices. See libgdx. Yes, you can use an index array with plain vertex arrays.

Cas Smiley

That makes me sad... Sad

Offline Riven
« League of Dukes »

JGO Overlord


Medals: 757
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #18 - Posted 2014-04-10 08:24:13 »

Yes, fillrate capped.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline ryeTech

Junior Newbie





« Reply #19 - Posted 2014-04-10 17:30:08 »

Yes, fillrate capped.

Riven my friend that makes me sad a little bit

I'll need to go back over my stuff and see where I can gain more numbers. Although, I think I'm at my limit.
I see that I get a better DT of 0.077 secs with 50,000 quads (same test case) when I don't use VBOs.

Which makes sense since I would no longer need the glBufferSubData call and Prince says VBOs arent  doing anything for me anyway
Pages: [1]
  ignore  |  Print  
 
 

 

Add your game by posting it in the WIP section,
or publish it in Showcase.

The first screenshot will be displayed as a thumbnail.

CogWheelz (17 views)
2014-08-01 22:53:16

CogWheelz (15 views)
2014-08-01 22:51:43

CopyableCougar4 (18 views)
2014-08-01 19:37:19

CogWheelz (19 views)
2014-07-30 21:08:39

Riven (27 views)
2014-07-29 18:09:19

Riven (16 views)
2014-07-29 18:08:52

Dwinin (14 views)
2014-07-29 10:59:34

E.R. Fleming (42 views)
2014-07-29 03:07:13

E.R. Fleming (13 views)
2014-07-29 03:06:25

pw (44 views)
2014-07-24 01:59:36
Resources for WIP games
by CogWheelz
2014-08-01 18:20:17

Resources for WIP games
by CogWheelz
2014-08-01 18:19:50

List of Learning Resources
by SilverTiger
2014-07-31 18:29:50

List of Learning Resources
by SilverTiger
2014-07-31 18:26:06

List of Learning Resources
by SilverTiger
2014-07-31 13:54:12

HotSpot Options
by dleskov
2014-07-08 03:59:08

Java and Game Development Tutorials
by SwordsMiner
2014-06-14 00:58:24

Java and Game Development Tutorials
by SwordsMiner
2014-06-14 00:47:22
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!