lsgames
Senior Newbie 
|
 |
«
Posted
2010-06-02 14:38:10 » |
|
I am working on a particle system using OpenGL on Android 2.1. To communicate with OpenGL a FloatBuffer is used. Allocated as such:
buffer = ByteBuffer.allocateDirect(FLOAT_SIZE * size * 2).order(ByteOrder.nativeOrder()).asFloatBuffer();
used as such:
buffer.put(index, f)
I have noticed that buffer.put() takes at least 10 times as long time as assigning an ordinary float array. This becomes a real bottleneck and the limiting factor as to how many particles I can have.
Has anyone noticed this problem or have any suggestions as to how to get around it?
Thanks,
Martin
|
|
|
|
Riven
|
 |
«
Reply #1 - Posted
2010-06-02 14:48:53 » |
|
Write everything to a float[] and use FloatBuffer.put(float[]) ?
|
Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social rankings!
|
|
|
lsgames
Senior Newbie 
|
 |
«
Reply #2 - Posted
2010-06-02 16:00:46 » |
|
Yes tried that. It had no effect. So I found the source code and noticed that that method just iterates over the array and calls pu(index, float) on each element.
|
|
|
|
Games published by our own members! Check 'em out!
|
|
lsgames
Senior Newbie 
|
 |
«
Reply #3 - Posted
2010-06-02 16:04:28 » |
|
Was wondering if there could be an alternative way to construct the FloatBuffer for OpenGL. Haven't really been able to think one up though. From the source code it appear what is taking 10 times as long time is different checks and function calls. Nothing much but when applied 1000 times pr frame in a particle system it really ads up.
|
|
|
|
Riven
|
 |
«
Reply #4 - Posted
2010-06-02 16:07:56 » |
|
Create one FloatBuffer and slice() it in 1000 buffers.
But ehm... why would you want 1000 buffers per frame? Can't you store all particles in the same buffer?
|
Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social rankings!
|
|
|
lsgames
Senior Newbie 
|
 |
«
Reply #5 - Posted
2010-06-02 17:34:11 » |
|
No, I meant the put is done many times pr frame. There is only one floatbuffer. But as the particles move each frame I have to update all positions in the floatbuffer.
|
|
|
|
Riven
|
 |
«
Reply #6 - Posted
2010-06-02 17:39:53 » |
|
No, I meant the put is done many times pr frame. There is only one floatbuffer. But as the particles move each frame I have to update all positions in the floatbuffer.
Well, you said 'construct'. According to: http://apistudios.com/hosted/marzec/badlogic/wordpress/?p=478Heap-floatbuffers have better performance. Upon copying to a VBO, the 'driver' seems to make its own (fast) copy.
|
Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social rankings!
|
|
|
EgonOlsen
|
 |
«
Reply #7 - Posted
2010-06-02 21:04:39 » |
|
Write everything to a float[] and use FloatBuffer.put(float[]) ?
This actually helps tremendously on Android. I've no idea which it doesn't in your case... 
|
|
|
|
lsgames
Senior Newbie 
|
 |
«
Reply #8 - Posted
2010-06-03 08:01:07 » |
|
Hmmm it is probably hardware specific which implementation of the put method you get. I am running on a Nexus One. And after putting in breakpoints I could see that the implementation of put(float[]) I got was one that just traversed the array and called put(float) on each element.
|
|
|
|
Riven
|
 |
«
Reply #9 - Posted
2010-06-03 10:58:45 » |
|
Hmmm it is probably hardware specific which implementation of the put method you get. I am running on a Nexus One. And after putting in breakpoints I could see that the implementation of put(float[]) I got was one that just traversed the array and called put(float) on each element.
This wouldn't be the first time a profiler alters the optimisation of an application. It'd be safe to assume you ran it without a profiler too?
|
Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social rankings!
|
|
|
Games published by our own members! Check 'em out!
|
|
lsgames
Senior Newbie 
|
 |
«
Reply #10 - Posted
2010-06-03 12:24:04 » |
|
Yes, wrote a small test app that times the different methods (one at a time, float[], FloatBuffer.wrap(float[]). One at a time is fastest, then float [], the slowest is to pass it a wrapped FloatBuffer.
|
|
|
|
EgonOlsen
|
 |
«
Reply #11 - Posted
2010-06-03 12:41:49 » |
|
Why not upload this test somewhere so that we can benchmark on various platforms!? My experience is that putting one float[] is between 500 and 600% faster than single puts...at least on 1.5. Maybe they have worked on singles puts in 2.1 or something.
BTW: Are you sure that you are measuring direct buffer performance? You can't create a direct buffer by using wrap(...), can you?
|
|
|
|
lsgames
Senior Newbie 
|
 |
«
Reply #12 - Posted
2010-06-03 13:02:38 » |
|
Good idea Egon. You can download source here: http://games.martineriksen.net/PerformanceTest.zipIn the "bin" folder there is the APK which you can install using adb install. Then you can run the test by starting the PerformanceTest app on your phone. The test will output the data in LogCat - these are the interesting lines (seen on a Nexus One 2.1): 06-03 14:56:01.445: INFO/System.out(22956): time: 247.8s >> vertex buffer single puts 06-03 14:56:01.445: INFO/System.out(22956): time: 254.2s >> vertex buffer single puts with specified positions 06-03 14:56:01.445: INFO/System.out(22956): time: 264.3s >> vertex buffer full array puts 06-03 14:56:01.445: INFO/System.out(22956): time: 0.3s >> vertex buffer wrapping 06-03 14:56:01.445: INFO/System.out(22956): time: 285.3s >> wrapped array to vertex buffer Here is the interesting code that runs this part of the test: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
| FloatBuffer nativeDirectFloatBuffer = OpenGlMemoryUtil.makeFloatBuffer(FLOAT_BUFFER_SIZE); float[] floatArray = new float[FLOAT_BUFFER_SIZE]; for (int i = 0; i < FLOAT_BUFFER_SIZE; i++) { floatArray[i]=0.5f; } time = print("Going VertexBuffers"); for (int i = 0; i < TESTSIZE_MILLIONTH; i++) { nativeDirectFloatBuffer.position(0); for (int k = 0; k < FLOAT_BUFFER_SIZE; k++) { nativeDirectFloatBuffer.put(0.5f); } } time = PerfLogUtil.logTime(time, "vertex buffer single puts", logindex++, TESTSIZE_THOUSANDS); time = print("Going VertexBuffers"); for (int i = 0; i < TESTSIZE_MILLIONTH; i++) { for (int k = 0; k < FLOAT_BUFFER_SIZE; k++) { nativeDirectFloatBuffer.put(k,0.5f); } } time = PerfLogUtil.logTime(time, "vertex buffer single puts with specified positions", logindex++, TESTSIZE_THOUSANDS);
time = print("Going VertexBuffers"); for (int i = 0; i < TESTSIZE_MILLIONTH; i++) { for (int k = 0; k < FLOAT_BUFFER_SIZE; k++) { floatArray[k]=.5f; } nativeDirectFloatBuffer.position(0); nativeDirectFloatBuffer.put(floatArray); } time = PerfLogUtil.logTime(time, "vertex buffer full array puts", logindex++, TESTSIZE_THOUSANDS); FloatBuffer floatBufferWrappedArray = FloatBuffer.wrap(floatArray); time = PerfLogUtil.checkPoint(); for (int i = 0; i < TESTSIZE_MILLIONTH; i++) { floatBufferWrappedArray = FloatBuffer.wrap(floatArray); } time = PerfLogUtil.logTime(time, "vertex buffer wrapping", logindex++, TESTSIZE_THOUSANDS); for (int i = 0; i < TESTSIZE_MILLIONTH; i++) { for (int k = 0; k < FLOAT_BUFFER_SIZE; k++) { floatArray[k]=.5f; } nativeDirectFloatBuffer.position(0); floatBufferWrappedArray.position(0); nativeDirectFloatBuffer.put(floatBufferWrappedArray); } time = PerfLogUtil.logTime(time, "wrapped array to vertex buffer", logindex++, TESTSIZE_THOUSANDS); |
|
|
|
|
Riven
|
 |
«
Reply #13 - Posted
2010-06-03 13:16:51 » |
|
Please put your code between [ code ] and [/ code ], otherwise[ i ] will be converted into italic styled text.
|
Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social rankings!
|
|
|
lsgames
Senior Newbie 
|
 |
«
Reply #14 - Posted
2010-06-03 14:02:35 » |
|
Done :-) Thanks for the info Riven.
|
|
|
|
EgonOlsen
|
 |
«
Reply #15 - Posted
2010-06-03 15:08:52 » |
|
Tried it on my Samsung Galaxy with Android 1.5. The results are similar (but slower of course): 1 2 3 4 5
| 06-03 16:49:07.297: INFO/System.out(2166): time: 1137.7s >> vertex buffer single puts 06-03 16:49:07.297: INFO/System.out(2166): time: 1079.4s >> vertex buffer single puts with specified positions 06-03 16:49:07.297: INFO/System.out(2166): time: 1175.3s >> vertex buffer full array puts 06-03 16:49:07.297: INFO/System.out(2166): time: 1.4s >> vertex buffer wrapping 06-03 16:49:07.297: INFO/System.out(2166): time: 1220.6s >> wrapped array to vertex buffer |
However, this changes once you add a variable instead of 0.5f, i.e. do something like this: 1 2 3 4 5 6 7 8
| for (int i = 0; i < TESTSIZE_MILLIONTH; i++) { nativeDirectFloatBuffer.position(0); float val=0; for (int k = 0; k < FLOAT_BUFFER_SIZE; k++) { nativeDirectFloatBuffer.put(val); val+=0.1f; } } |
This results in: 1 2 3 4 5
| 06-03 17:03:43.307: INFO/System.out(2782): time: 1387.5s >> vertex buffer single puts 06-03 17:03:43.307: INFO/System.out(2782): time: 1308.4s >> vertex buffer single puts with specified positions 06-03 17:03:43.307: INFO/System.out(2782): time: 1187.9s >> vertex buffer full array puts 06-03 17:03:43.317: INFO/System.out(2782): time: 1.4s >> vertex buffer wrapping 06-03 17:03:43.317: INFO/System.out(2782): time: 1192.8s >> wrapped array to vertex buffer |
I'm still not sure why it helped that much more in my code (which is a bit more complex than this simple benchmark or course) to go with float[]s...  Dalvik is strange...slow and strange...
|
|
|
|
EgonOlsen
|
 |
«
Reply #16 - Posted
2010-06-03 19:17:53 » |
|
I've reverted my own stuff to use single puts to see what happens then...i have a loop with 6 "puts" in each iteration filling two different buffers. With float[] instead of single puts, this is 3 times faster on my device.
|
|
|
|
lsgames
Senior Newbie 
|
 |
«
Reply #17 - Posted
2010-06-04 11:56:54 » |
|
OK - is your code in a format that you can send so I can try and replicate your test run?
Also after instrumenting and further analysis I found that TraceView had exaggerated the cost of the buffer puts in relation to the whole programme execution. Traceview said that the buffer puts were 17% of all time spent whereas actually they are only 3%. So like Riven suggested the profiler was not quite truthful. It is still the case that the puts take 10 times longer than array puts - but the overall impact is lower than I thought.
Still it would be great to find a way to write the buffers faster.
/Martin
|
|
|
|
EgonOlsen
|
 |
«
Reply #18 - Posted
2010-06-04 12:39:03 » |
|
Sure. Code looks like this: 1 2 3 4 5 6 7 8 9 10 11 12 13 14
| int ix=0; for (c = 0; c < endII; c++) { vcoords[ix] = x[c]; ncoords[ix++] = nx[c]; vcoords[ix] = y[c]; ncoords[ix++] = ny[c]; vcoords[ix] = z[c]; ncoords[ix++] = nz[c]; }
...
vertices.put(vcoords); normals.put(ncoords); |
Single put code looks the same except that in the loop i'm doing 3 puts into vertices and 3 into normals instead of filling the array.
|
|
|
|
ryanm
|
 |
«
Reply #19 - Posted
2010-09-06 17:34:22 » |
|
I've just run into the bulk-put problem, and have noticed that IntBuffers do not suffer the same fate - bulk put( int[] ) calls are very quick. I'm seeing a x10 speedup using this for 10000-element arrays, and about x2 for 10 elements. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
| private static int[] intArray = new int[ 0 ];
public static void put( IntBuffer buff, float[] data ) { if( intArray.length < data.length ) { intArray = new int[ data.length ]; }
for( int i = 0; i < data.length; i++ ) { intArray[ i ] = Float.floatToIntBits( data[ i ] ); }
buff.put( intArray, 0, data.length ); } |
|
|
|
|
Riven
|
 |
«
Reply #20 - Posted
2010-09-06 20:51:31 » |
|
Wow. Nice find.
Maybe time to write a bug-report?
|
Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social rankings!
|
|
|
ryanm
|
 |
«
Reply #21 - Posted
2010-09-07 10:58:37 » |
|
I reckon so.  Note the logarithmic scale. IntBuffer bulk puts are essentially free, and I can't see any reason why FloatBuffers can't do the same. Benchmark code here. Can anyone spot any problems with this?
|
|
|
|
Riven
|
 |
«
Reply #22 - Posted
2010-09-07 13:04:51 » |
|
Benchmark code here. Can anyone spot any problems with this? Looks good enough. Seems like floats are 266x slower than ints. It probably goes through the FPU instead of a memcpy.
|
Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social rankings!
|
|
|
ryanm
|
 |
«
Reply #23 - Posted
2010-09-07 17:47:50 » |
|
Add your stars to Issue 11078. edit: actually don't bother, it's been fixed in Gingerbread already apparently Also, there's a more-or-less drop-in replacement for FloatBuffer over here. It'll automatically convert float arrays that you give it, and also allow you to pass in pre-converted int arrays
|
|
|
|
EgonOlsen
|
 |
«
Reply #24 - Posted
2010-09-07 21:02:59 » |
|
edit: actually don't bother, it's been fixed in Gingerbread already apparently
Too bad that 3.0 has some hefty hardware requirements and most likely wont make it to a lot of current phones... 
|
|
|
|
badlogicgames
|
 |
«
Reply #25 - Posted
2010-09-13 03:11:46 » |
|
I'm late to the party but i wanted to follow up on this. First off: thanks Ryan for posting that bug report. I went with bulk puts and never thought about testing single puts (old nio habit...). I wrote a quick JNI method which is even faster than your in[] array trick. You can find more info at http://apistudios.com/hosted/marzec/badlogic/wordpress/?p=904.
|
|
|
|
|