Java-Gaming.org    
Featured games (79)
games approved by the League of Dukes
Games in Showcase (477)
Games in Android Showcase (107)
games submitted by our members
Games in WIP (535)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: [1] 2
  ignore  |  Print  
  OpenGL lightning fast (managed) VBO mapping  (Read 6581 times)
0 Members and 1 Guest are viewing this topic.
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 743
Projects: 4
Exp: 16 years


Hand over your head.


« Posted 2012-12-21 18:21:45 »

For those not interested in an elaborate background story, I'll sum up the functionality of the code below:
  • It uses unsynchronized mapped VBOs
  • To make this guaranteed to be safe, it has 6 VBOs (worst case - see below)
  • For every frame, it picks the next VBO (round robin)
  • When it reuses a VBO, it is so 'old' (6 frames old) that it is guaranteed to be no longer in use by the GPU

Here is a straightforward code dump:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
47  
48  
49  
50  
51  
52  
53  
54  
55  
56  
57  
58  
59  
60  
61  
62  
63  
64  
65  
66  
67  
68  
69  
70  
71  
72  
73  
74  
75  
76  
77  
78  
79  
80  
81  
82  
83  
84  
85  
86  
87  
88  
89  
import java.nio.ByteBuffer;

import org.lwjgl.opengl.ARBMapBufferRange;
import org.lwjgl.opengl.GL15;
import org.lwjgl.opengl.GL30;
import org.lwjgl.opengl.GLContext;

import static org.lwjgl.opengl.GL15.*;

public class Unsync {
   // triple buffering in stereo mode is rather rare through..
  private static final int MAX_FRAMEBUFFER_COUNT = 2 * 3;

   private final int glTarget, glUsage;
   private final int[] bufferHandles, requestedSizes, allocatedSizes;
   private int currentBufferIndex;

   public Unsync(int glTarget, int glUsage) {
      this.glTarget = glTarget; // GL_ARRAY_BUFFER, GL_ELEMENT_ARRAY_BUFFER
     this.glUsage = glUsage; // GL_STATIC_DRAW, GL_STREAM_DRAW

      requestedSizes = new int[MAX_FRAMEBUFFER_COUNT];
      allocatedSizes = new int[MAX_FRAMEBUFFER_COUNT];

      bufferHandles = new int[MAX_FRAMEBUFFER_COUNT];
      for (int i = 0; i < this.bufferHandles.length; i++) {
         bufferHandles[i] = glGenBuffers();
      }

      currentBufferIndex = -1;
   }

   public void nextFrame() {
      currentBufferIndex = (currentBufferIndex + 1) % MAX_FRAMEBUFFER_COUNT;
   }

   public void bind() {
      glBindBuffer(glTarget, currentBufferHandle());
   }

   public int currentBufferHandle() {
      return bufferHandles[currentBufferIndex];
   }

   public void ensureSize(int size) {
      assert size > 0;

      requestedSizes[currentBufferIndex] = size;
      if (size > allocatedSizes[currentBufferIndex]) {
         glBufferData(glTarget, size, glUsage);
         allocatedSizes[currentBufferIndex] = size;
      }
   }

   public void trimToSize() {
      if (requestedSizes[currentBufferIndex] != allocatedSizes[currentBufferIndex]) {
         glBufferData(glTarget, requestedSizes[currentBufferIndex], glUsage);
         allocatedSizes[currentBufferIndex] = requestedSizes[currentBufferIndex];
      }
   }

   public ByteBuffer map() {
      long offset = 0;
      long length = requestedSizes[currentBufferIndex];

      if (GLContext.getCapabilities().OpenGL30) {
         int flags = GL30.GL_MAP_WRITE_BIT | GL30.GL_MAP_UNSYNCHRONIZED_BIT;
         return GL30.glMapBufferRange(glTarget, offset, length, flags, null);
      }

      if (GLContext.getCapabilities().GL_ARB_map_buffer_range) {
         int flags = ARBMapBufferRange.GL_MAP_WRITE_BIT | ARBMapBufferRange.GL_MAP_UNSYNCHRONIZED_BIT;
         return ARBMapBufferRange.glMapBufferRange(glTarget, offset, length, flags, null);
      }

      return GL15.glMapBuffer(glTarget, GL15.GL_WRITE_ONLY, null);
   }

   public void unmap() {
      glUnmapBuffer(glTarget);
   }

   public void deleteAll() {
      for (int i = 0; i < this.bufferHandles.length; i++) {
         glDeleteBuffers(this.bufferHandles[i]);
         this.bufferHandles[i] = -1;
      }
   }
}

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
Unsync vbo = new Unsync(...);
while(true) { // render loop
  vbo.nextFrame(); // mandatory


   vbo.bind();
   vbo.ensureSize(bytesInVBO);
   ByteBuffer mapped = vbo.map();
   // fill it
  vbo.unmap();

   // render things

   // swap buffers
}






OpenGL VBO performance is incredibly hard to optimize. After you managed to fill the VBO data as fast as possible, you're pretty much relying on the performance of glMapBuffer(..) and glUnmapBuffer() to pump the data over to the graphics card.

It turns out that these calls have a significant overhead, because the driver has to verify that the memory block it is about to return, is not currently in use by the GPU. Especially when doing many of these calls per frame for small batches of geometry, you'll see it drags the framerate under 60Hz quickly: on my particular (low end) graphics card, I render 64x 4 tiny triangles per frame and end up with an abysmal 45fps.

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
      // what not to do...
     while (!Display.isCloseRequested()) {
         glClearColor(0, 0, 0, 1);
         glClear(GL_COLOR_BUFFER_BIT);

         for (int x = 0; x < drawCalls; x++) {
           
            glVertexPointer(2, GL_FLOAT, stride, 0 << 2);
            glColorPointer(4, GL_UNSIGNED_BYTE, stride, 2 << 2);

           FloatBuffer fb = glMapBuffer(...).asFloatBuffer();
            for (int y = 0, i = 0; y < trisPerDrawCall; y++) {
               fb.position((i++) * floatStride);
               fb.put(x * 3 + 16).put(y * 3 + 16);
               fb.put(packRGBA(0xFF, 0x00, 0x00, 0xFF));

               fb.position((i++) * floatStride);
               fb.put(x * 3 + 32).put(y * 3 + 16);
               fb.put(packRGBA(0x00, 0xFF, 0x00, 0xFF));

               fb.position((i++) * floatStride);
               fb.put(x * 3 + 16).put(y * 3 + 32);
               fb.put(packRGBA(0x00, 0x00, 0xFF, 0xFF));
            }
           glUnmapBuffer();

            glDrawArrays(GL_TRIANGLES, 0, 3 * trisPerDrawCall);
         }
         //

         Display.update();
      }


Taking full advantage of the performance of the GPU, means we have to keep the driver from doing these costly verifications. At first I thought that guaranteeing that the VBOs are not used in rendering anymore, by using a pool of VBOs would be enough, but the driver simply has to perform a lot of checks to prove what the application already knows. We somehow have to make the driver trust our input and disable any check.

Fortunately, we have
glMapBufferRange( ... | GL_MAP_UNSYNCHRONIZED_BIT)
to do exactly that! But it leaves us with a problem, now we have to ensure that all VBO mapping we do is done on memory guaranteed not to be in use by the GPU. As the GPU is fully asynchronous, that's no easy feat.

But lets first check performance by simply using 1 VBO and using glMapBufferRange(...) instead of glMapBuffer(...). I got ~1450fps, that's an improvement of over factor 32! Awesome! The down side is that the rendering, as the specs say are 'undefined', and I indeed see a lot of garbled renderings on the framebuffer.

1  
2  
3  
4  
5  
6  
7  
8  
         // what not to do either...
           glVertexPointer(2, GL_FLOAT, stride, 0 << 2);
            glColorPointer(4, GL_UNSIGNED_BYTE, stride, 2 << 2);

-           FloatBuffer fb = glMapBuffer(..., GL_WRITE_ONLY, ...).asFloatBuffer();
+           FloatBuffer fb = glMapBufferRange(..., GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT, ...).asFloatBuffer();
            for (int y = 0, i = 0; y < trisPerDrawCall; y++) {
               fb.position((i++) * floatStride);


After a bit of messing around, I found a way to guarantee that the VBOs we map, are not in use anymore by the GPU. We can assume that on a (common) double buffered framebuffer, we know that 1 frame is being rendered into, and the other frame is displayed. On a triple buffered setup, 2 frames are being rendered into, and the last frame is displayed. On a stereo triple buffered setup, 4 frames are being rendered into, and the last pair of frames are displayed. This means that there can be up to 6 frames active in any game!

So if we create an array (of length 6) of lists of VBOs, we can pick a list of VBOs each frame, that has been used 6 frames ago, and therefore guanteed to be not used in any rendering. For every frame, we reuse and/or allocate as much VBOs as we need, trusting that we will encounter these VBOs again after 6 frames.



It wouldn't be a 'shared code' post, if I wouldn't dump the full code, so you can take advantage of the performance boost of doing to verification work of the driver yourself:

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
47  
48  
49  
50  
51  
52  
53  
54  
55  
56  
57  
58  
59  
60  
61  
62  
63  
64  
65  
66  
67  
68  
69  
70  
71  
72  
73  
74  
75  
76  
77  
78  
79  
80  
81  
82  
83  
84  
85  
86  
87  
88  
89  
90  
91  
92  
93  
94  
95  
import static org.lwjgl.opengl.GL11.*;
import static org.lwjgl.opengl.GL15.*;

import java.nio.*;

import org.lwjgl.*;
import org.lwjgl.opengl.*;

public class MappedVBOTest {
   private static float packRGBA(int r, int g, int b, int a) {
      return Float.intBitsToFloat((r << 0) | (g << 8) | (b << 16) | (a << 24));
   }

   public static void main(String[] main) throws LWJGLException {
      Display.setDisplayMode(new DisplayMode(800, 600));
      Display.create();

      {
         glMatrixMode(GL_PROJECTION);
         glLoadIdentity();
         glOrtho(0, 800, 600, 0, -1, +1);

         glMatrixMode(GL_MODELVIEW);
         glLoadIdentity();
      }

     boolean isUnsynchronized = true;
      MappedVertexBufferObjectProvider provider;
      provider = new MappedVertexBufferObjectProvider(GL_ARRAY_BUFFER, GL_STATIC_DRAW, isUnsynchronized);

      glEnableClientState(GL_VERTEX_ARRAY);
      glEnableClientState(GL_COLOR_ARRAY);

      int stride = (2 + 1) << 2;
      {
         // round up to multiple of 16 (for SIMD)
        stride += 16 - 1;
         stride /= 16;
         stride *= 16;
      }

      int strideFloat = stride >> 2;

      int drawCalls = 64;
      int trisPerDrawCall = 4;

      long lastSecond = System.nanoTime();
      int frameCount = 0;

      while (!Display.isCloseRequested()) {
        provider.nextFrame();

         glClearColor(0, 0, 0, 1);
         glClear(GL_COLOR_BUFFER_BIT);

         for (int x = 0; x < drawCalls; x++) {
           MappedVertexBufferObject vbo = provider.nextVBO();
            vbo.ensureSize(trisPerDrawCall * 3 * stride);

            glVertexPointer(2, GL_FLOAT, stride, 0 << 2);
            glColorPointer(4, GL_UNSIGNED_BYTE, stride, 2 << 2);

            FloatBuffer fb = vbo.map().asFloatBuffer();
            for (int y = 0, i = 0; y < trisPerDrawCall; y++) {
               fb.position((i++) * strideFloat);
               fb.put(x * 3 + 16).put(y * 3 + 16);
               fb.put(packRGBA(0xFF, 0x00, 0x00, 0xFF));

               fb.position((i++) * strideFloat);
               fb.put(x * 3 + 32).put(y * 3 + 16);
               fb.put(packRGBA(0x00, 0xFF, 0x00, 0xFF));

               fb.position((i++) * strideFloat);
               fb.put(x * 3 + 16).put(y * 3 + 32);
               fb.put(packRGBA(0x00, 0x00, 0xFF, 0xFF));
            }
            vbo.unmap();

            glDrawArrays(GL_TRIANGLES, 0, 3 * trisPerDrawCall);
         }
         //

         Display.update();

         frameCount++;
         if (System.nanoTime() > lastSecond + 1_000_000_000L) {
            lastSecond += 1_000_000_000L;
            Display.setTitle(frameCount + "fps / " + (1000.0f / frameCount) + "ms");
            frameCount = 0;
         }
      }

      Display.destroy();
   }
}


Set
isUnsynchronized = false
, and you'll see a framedrop of anything in the realm of factor 30 to 60 (!).

AMD Radeon 5500: 1450fps vs 45fps
AMD Radeon 5870: 5250fps vs 88fps


1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
47  
48  
49  
50  
51  
52  
53  
54  
55  
56  
57  
58  
59  
60  
61  
62  
63  
64  
65  
66  
67  
68  
69  
70  
71  
72  
73  
74  
75  
76  
77  
78  
79  
80  
81  
82  
83  
import java.util.*;

public class MappedVertexBufferObjectProvider {
   // triple buffering in stereo mode is rather rare through..
  private static final int MAX_WINDOW_BUFFER_COUNT = 2 * 3;

   private final int glTarget;
   private final int glUsage;
   private final boolean unsync;

   @SuppressWarnings("unchecked")
   public MappedVertexBufferObjectProvider(int glTarget, int glUsage, boolean unsync) {
      this.glTarget = glTarget; // GL_ARRAY_BUFFER, GL_ELEMENT_ARRAY_BUFFER
     this.glUsage = glUsage; // GL_STATIC_DRAW, GL_STREAM_DRAW
     this.unsync = unsync;

      frameToBufferObjects = new ArrayList[MAX_WINDOW_BUFFER_COUNT];
      for (int i = 0; i < frameToBufferObjects.length; i++) {
         frameToBufferObjects[i] = new ArrayList<>();
      }
   }

   final List<MappedVertexBufferObject>[] frameToBufferObjects;

   private int frameIndex = -1;
   private int vboIndex = -1;

   public void nextFrame() {
      frameIndex += 1;
      frameIndex %= frameToBufferObjects.length;

      vboIndex = -1;
   }

   public MappedVertexBufferObject nextVBO() {
      if (frameIndex == -1) {
         throw new IllegalStateException("not in a frame");
      }
      vboIndex += 1;

      List<MappedVertexBufferObject> vbos = frameToBufferObjects[frameIndex];
      if (vboIndex == vbos.size()) {
         vbos.add(new MappedVertexBufferObject(glTarget, glUsage, unsync));
      }

      MappedVertexBufferObject object = vbos.get(vboIndex);
      object.bind();
      return object;
   }

   public void orphanAll() {
      for (List<MappedVertexBufferObject> vbos : frameToBufferObjects) {
         for (MappedVertexBufferObject object : vbos) {
            object.orphan();
         }
      }
   }

   public void trimAllToSize() {
      for (List<MappedVertexBufferObject> vbos : frameToBufferObjects) {
         for (MappedVertexBufferObject object : vbos) {
            object.trimToSize();
         }
      }
   }

   public void delete() {
      for (List<MappedVertexBufferObject> vbos : frameToBufferObjects) {
         for (MappedVertexBufferObject object : vbos) {
            object.delete();
         }
      }
   }

   @Override
   public String toString() {
      int[] vboCounts = new int[frameToBufferObjects.length];
      for (int i = 0; i < vboCounts.length; i++) {
         vboCounts[i] = frameToBufferObjects[i].size();
      }
      return this.getClass().getSimpleName() + "[" + Arrays.toString(vboCounts) + "]";
   }
}


1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
47  
48  
49  
50  
51  
52  
53  
54  
55  
56  
57  
58  
59  
60  
61  
62  
63  
64  
65  
66  
67  
68  
69  
70  
71  
72  
73  
74  
75  
76  
77  
78  
79  
80  
81  
82  
83  
84  
85  
86  
87  
88  
89  
90  
91  
92  
93  
94  
95  
96  
97  
98  
99  
100  
101  
102  
103  
104  
105  
106  
107  
108  
109  
110  
111  
112  
113  
114  
115  
116  
117  
118  
119  
120  
121  
122  
123  
124  
125  
126  
127  
128  
129  
130  
131  
import static org.lwjgl.opengl.GL15.*;

import java.nio.*;

import org.lwjgl.opengl.*;

public class MappedVertexBufferObject {
   private final int glTarget, glUsage;
   private final int handle;
   private int requestedSize, allocatedSize;
   private boolean isMapped;
   private final boolean unsync;

   private static MappedVertexBufferObject bound;

   public MappedVertexBufferObject(int glTarget, int glUsage, boolean unsync) {
      this.glTarget = glTarget;
      this.glUsage = glUsage;
      this.unsync = unsync;
      this.handle = glGenBuffers();
   }

   public void bind() {
      if (bound == this) {
         throw new IllegalStateException("already bound");
      }
      bound = this;

      glBindBuffer(glTarget, this.handle);
   }

   public void ensureSize(int size) {
      assert size > 0;
      if (bound != this) {
         throw new IllegalStateException("not bound");
      }

      requestedSize = size;
      if (size > allocatedSize) {
         glBufferData(glTarget, size, glUsage);
         allocatedSize = size;
      }
   }

   public void trimToSize() {
      if (bound != this) {
         throw new IllegalStateException("not bound");
      }

      if (requestedSize != allocatedSize) {
         glBufferData(glTarget, requestedSize, glUsage);
         allocatedSize = requestedSize;
      }
   }

   public void orphan() {
      if (bound != this) {
         throw new IllegalStateException("not bound");
      }

      glBufferData(glTarget, 0, glUsage);
      allocatedSize = requestedSize = 0;
   }

   public ByteBuffer map() {
      if (bound != this) {
         throw new IllegalStateException("not bound");
      }
      if (requestedSize == 0) {
         throw new IllegalStateException("no data");
      }
      if (isMapped) {
         throw new IllegalStateException("already mapped");
      }
      isMapped = true;

      long offset = 0;
      long length = requestedSize;

      if (GLContext.getCapabilities().OpenGL30) {
         int access = GL30.GL_MAP_WRITE_BIT;
         if (unsync) {
            access |= GL30.GL_MAP_UNSYNCHRONIZED_BIT;
         }
         return GL30.glMapBufferRange(glTarget, offset, length, access, null);
      }

      if (GLContext.getCapabilities().GL_ARB_map_buffer_range) {
         int access = ARBMapBufferRange.GL_MAP_WRITE_BIT;
         if (unsync) {
            access |= ARBMapBufferRange.GL_MAP_UNSYNCHRONIZED_BIT;
         }
         return ARBMapBufferRange.glMapBufferRange(glTarget, offset, length, access, null);
      }

      int access = GL_WRITE_ONLY;
      return glMapBuffer(glTarget, access, null);
   }

   public void unmap() {
      if (bound != this) {
         throw new IllegalStateException("not bound");
      }
      if (!isMapped) {
         throw new IllegalStateException("not mapped");
      }
      isMapped = false;

      glUnmapBuffer(glTarget);
   }

   public void delete() {
      if (bound == this) {
         throw new IllegalStateException("still bound");
      }
      if (isMapped) {
         throw new IllegalStateException("still mapped");
      }

      glDeleteBuffers(handle);
   }

   public static void unbind(int glTarget) {
      if (bound == null) {
         throw new IllegalStateException("none bound");
      }

      glBindBuffer(glTarget, 0);
      bound = null;
   }
}

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline theagentd
« Reply #1 - Posted 2012-12-21 18:30:09 »

I wouldn't recommend that. The memory usage will skyrocket for heavy geometry. What's wrong with just dumping all the data into a single large VBO instead?

Myomyomyo.
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 743
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #2 - Posted 2012-12-21 18:32:08 »

Sure, it trades verification overhead for vRAM. I thought that was rather obvious...?

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 743
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #3 - Posted 2012-12-21 18:50:19 »

Another problem with streaming data using 1 large VBO is that your CPU is generating the data for long periods, at which time the GPU will not receive any instructions. So eventually the GPU will be idling, waiting for the glUnmapBuffer() or the next glDrawElements(..) call. It's best to chunk the data you send to the GPU to keep both the CPU and GPU busy at all times. The code I provided helps tremendously. Keep in mind that the usual glMapBuffer(...) call, has such a big overhead that even with only 64 calls per frame, you are likely to get under 100fps on the latest hardware, even when vbo-toggling / pooling.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline davedes
« Reply #4 - Posted 2012-12-21 18:54:03 »

I'd be interested to see how it compares to vertex arrays. Smiley

Offline Riven
« League of Dukes »

JGO Overlord


Medals: 743
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #5 - Posted 2012-12-21 18:59:40 »

I'd be interested to see how it compares to vertex arrays. Smiley


Rendering 16 small triangles, 64x per frame

VAgl*Pointer(*Buffer)880fps
VBOglMapBuffer(WRITE)45fps
VBOglMapBufferRange(WRITE)45fps
VBOglMapBufferRange(WRITE | INVALIDATE_BUFFER)86fps
VBOglMapBufferRange(WRITE | INVALIDATE_BUFFER) + orphaning135fps
VBOglMapBufferRange(WRITE | UNSYNCHRONIZED)1450fps
VBOglMapBufferRange(WRITE | UNSYNCHRONIZED) + orphaning1310fps
VBOglMapBufferRange()780fps
VBOglBufferData()740fps

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline lhkbob

JGO Knight


Medals: 32



« Reply #6 - Posted 2012-12-21 19:35:37 »

I remember reading that you could effectively detach data from a VBO by calling glBufferData(target, null).  The GPU would continue using the previous block of data to complete rendering, but any data manipulation commands would use the new block assigned with the glBufferData call.

Have you tried this at all?

Offline Riven
« League of Dukes »

JGO Overlord


Medals: 743
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #7 - Posted 2012-12-21 19:37:11 »

I remember reading that you could effectively detach data from a VBO by calling glBufferData(target, null).  The GPU would continue using the previous block of data to complete rendering, but any data manipulation commands would use the new block assigned with the glBufferData call.

Have you tried this at all?
Yup, it's called orphaning, and isn't remotely as fast as UNSYNC. I'd have to benchmark it for absolutele numbers though.


Update
Added orphaning performance to stats in previous post.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline theagentd
« Reply #8 - Posted 2012-12-21 19:37:56 »

Also, have you seen how Cas' sprite renderer works? It uses multiple VBOs too.

Myomyomyo.
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 743
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #9 - Posted 2012-12-21 19:38:33 »

Also, have you seen how Cas' sprite renderer works? It uses multiple VBOs too.
Yeah I know all about it, I've been optimizing the SpriteEngine.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline theagentd
« Reply #10 - Posted 2012-12-21 19:40:52 »

Another problem with streaming data using 1 large VBO is that your CPU is generating the data for long periods, at which time the GPU will not receive any instructions. So eventually the GPU will be idling, waiting for the glUnmapBuffer().
I doubt that your GPU will starve for unless you freeze up for more than a frame or have some kind of synchronization (clarification: read back data from the GPU). Are you sure this is really a problem? You can check GPU usage with GPU-Z.

Myomyomyo.
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 743
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #11 - Posted 2012-12-21 19:43:40 »

Whether it's a problem, depends on your usage of OpenGL. It's very rarely not a problem. In tech-demos it's not likely to be a problem. In, say, your particle engine code, the bottleneck is clearly the GPU, so you just feed it some data (if any) and it will chug along. In a real game, that is a lot more complex than that, a bunch of geometry (between state changes) is much less likely to keep the GPU active for too long, so you try to not batch up your uploads to keep the GPU busy while you're calculating/generating more dynamic data.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline theagentd
« Reply #12 - Posted 2012-12-21 19:53:05 »

o_O Are you really sure? I'm not questioning using multiple VBOs since there's clear evidence that mapping a buffer is costly, just your statement that mapping a buffer for a long time causes stalling. For example, in my CPU-based particle system I map almost 12MBs of data each frame. I find it hard to believe that I would gain performance (it's CPU limited) by splitting up this uploading and drawing it in batches.

Myomyomyo.
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 743
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #13 - Posted 2012-12-21 19:56:47 »

If you do 1 glMapBuffer(...) per frame, then there is nothing (well, little*) to be gained. As I said earlier, your tech dmeo is simple, it doesn't have state changes, you just shove geometry to the GPU, and let it work on it. If you have N different types of particles, that all have their own state, and you need depth-sorting at the same time... things become vastely more complex.


* If I vastely simplify the numbers, I get 64 glMapBuffer(...) calls at 45fps, which means that each glMapBuffer call has an overhead of 0.38ms, which can be reduced by factor 30-60, by using glMapBufferRange(..., UNSYNC, ...)

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 743
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #14 - Posted 2012-12-21 20:13:33 »

You can check GPU usage with GPU-Z.

glMapBuffer(...) @45fps -> GPU-Z says 41% GPU load
glMapBufferRange(...) @1450fps -> GPU-Z says 99% GPU load

Clearly GPU-Z's GPU Load statistic is not a good indicator of GPU effciency (maybe it too is in a busy loop, waiting for data from the CPU), as with a load of 41%, you'd expect to get ~600fps, not 45fps.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline theagentd
« Reply #15 - Posted 2012-12-21 20:47:16 »

Yes, but what's stopping me from just putting all those particles into a single VBO and drawing parts of that VBO with multiple draw calls, just like Cas' sprite renderer is doing now? The actual uploading would be exactly the same as my particle test, but the draw calls would be different. Wouldn't that make the performance of glMapBuffer() irrelevant since it's only called once per frame? The draw calls should stay the same, but you'll also have to switch VBOs between each draw call with your approach.

You can check GPU usage with GPU-Z.

glMapBuffer(...) @45fps -> GPU-Z says 41% GPU load
glMapBufferRange(...) @1450fps -> GPU-Z says 99% GPU load

Clearly GPU-Z's GPU Load statistic is not a good indicator of GPU efficiency (maybe it too is in a busy loop, waiting for data from the CPU), as with a load of 41%, you'd expect to get ~600fps, not 45fps.

A GPU is way too complex to measure load accurately, but the relative load can be very informative. If GPU load isn't close to 100%, you have a bottleneck somewhere else. It could be a VRAM bandwidth bottleneck, a CPU bottleneck or one of lots of different other things. The only thing it tells you is that the stream processors/CUDA cores aren't processing at 100% capacity.

EDIT: I'm also very interested in the actual results on the sprite rendering engine too since I once tried to optimize it too.

Myomyomyo.
Offline pitbuller
« Reply #16 - Posted 2012-12-21 20:54:17 »

In memory limited situations is vertex arrays best choise then?
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 743
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #17 - Posted 2012-12-21 20:56:02 »

In memory limited situations is vertex arrays best choise then?
It depends. But for dynamic content, it's a quick and dirty solution.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline Roquen
« Reply #18 - Posted 2012-12-21 20:58:46 »

Very minor point but my personal opinion would be to convert all the contract checks into asserts as this isn't targeting general purpose black-box usage.
Offline PaulCunningham

Junior Member


Medals: 2



« Reply #19 - Posted 2012-12-21 21:40:16 »

Also, have you seen how Cas' sprite renderer works? It uses multiple VBOs too.
Yeah I know all about it, I've been optimizing the SpriteEngine.

Can you optimise it for the dodgy jitter on the XBox while you're at it as well please  Smiley
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 743
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #20 - Posted 2012-12-21 21:43:39 »

I don't want to bring unrealistic expectations. I more than doubled/tripled the speed of the sprite sorter in the SpriteEngine by packing multiple ordering-arrays into one/two, and feeding that into the radix sort, after having been unsuccessfully fidding with the rendering code for hours. It has yet to be back-ported to SPGL1.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline PaulCunningham

Junior Member


Medals: 2



« Reply #21 - Posted 2012-12-21 21:48:58 »

It's something I'll probably dodge with Titan but I had to remove all the function calls from the Droid Assault demo - saved about 1.5 milliseconds IIRC (but I was only joking).

PS - nice work with the LOS shader too.
Offline gouessej
« Reply #22 - Posted 2012-12-21 23:37:52 »

Hi

Maybe my question is silly... Why do you use glMapBuffer instead of glBufferSubData? Thank you for this code. I didn't even know glMapBufferRange before reading your source code. Is your example useful for static VBOs?

Offline Riven
« League of Dukes »

JGO Overlord


Medals: 743
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #23 - Posted 2012-12-21 23:56:05 »

The graphics driver als has to do verification for glBufferData(...) and glBufferSubData(...).

I get ~780fps with glBufferSubData( )
I get ~740fps with glBufferData( )

Quite good, but nowhere as good at 1450fps.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 743
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #24 - Posted 2012-12-21 23:58:47 »

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
47  
48  
49  
50  
51  
52  
53  
54  
55  
56  
57  
58  
59  
60  
61  
62  
63  
64  
65  
66  
67  
68  
69  
70  
71  
72  
73  
74  
75  
76  
77  
78  
79  
80  
81  
82  
83  
84  
85  
86  
87  
88  
89  
90  
91  
92  
93  
94  
95  
96  
97  
98  
99  
100  
101  
102  
103  
104  
105  
106  
107  
108  
109  
110  
111  
112  
113  
114  
115  
116  
117  
118  
119  
120  
121  
122  
123  
124  
125  
126  
127  
128  
129  
130  
131  
132  
133  
134  
135  
136  
137  
138  
139  
140  
import static org.lwjgl.opengl.GL11.*;
import static org.lwjgl.opengl.GL15.*;

import java.nio.*;

import org.lwjgl.*;
import org.lwjgl.opengl.*;

public class MappedVertexBufferObjectTest {
   private static float packRGBA(int r, int g, int b, int a) {
      return Float.intBitsToFloat((r << 0) | (g << 8) | (b << 16) | (a << 24));
   }

   public static void main(String[] main) throws LWJGLException {
      Display.setDisplayMode(new DisplayMode(800, 600));
      Display.create();

      {
         glMatrixMode(GL_PROJECTION);
         glLoadIdentity();
         glOrtho(0, 800, 600, 0, -1, +1);

         glMatrixMode(GL_MODELVIEW);
         glLoadIdentity();
      }

      final int VA = 1;
      final int VBO_DATA = 2;
      final int VBO_SUBDATA = 3;
      final int VBO_MAPPED = 4;

     int renderStrategy = VBO_SUBDATA;

      boolean isOrphaning = false;
      boolean isUnsynchronized = true;
      MappedVertexBufferObjectProvider provider;
      provider = new MappedVertexBufferObjectProvider(GL_ARRAY_BUFFER, GL_STREAM_DRAW, isUnsynchronized);

      glEnableClientState(GL_VERTEX_ARRAY);
      glEnableClientState(GL_COLOR_ARRAY);

      int stride = (2 + 1) << 2;
      {
         // round up to multiple of 16 (for SIMD)
        stride += 16 - 1;
         stride /= 16;
         stride *= 16;
      }

      int strideFloat = stride >> 2;

      int drawCalls = 64;
      int trisPerDrawCall = 4;

      long lastSecond = System.nanoTime();
      int frameCount = 0;

      ByteBuffer bb = BufferUtils.createByteBuffer(trisPerDrawCall * 3 * stride);
      FloatBuffer fb = bb.asFloatBuffer();
      bb.position(2 << 2);

      while (!Display.isCloseRequested()) {
         provider.nextFrame();

         glClearColor(0, 0, 0, 1);
         glClear(GL_COLOR_BUFFER_BIT);

         for (int x = 0; x < drawCalls; x++) {
            MappedVertexBufferObject vbo = null;
            if (renderStrategy == VBO_DATA || renderStrategy == VBO_SUBDATA || renderStrategy == VBO_MAPPED) {
               vbo = provider.nextVBO();
               vbo.ensureSize(trisPerDrawCall * 3 * stride);

              glVertexPointer(2, GL_FLOAT, stride, 0 << 2);
              glColorPointer(4, GL_UNSIGNED_BYTE, stride, 2 << 2);

               if (renderStrategy == VBO_MAPPED) {
                  fb = vbo.map().asFloatBuffer();
               }
            }
            if (renderStrategy != VBO_MAPPED) {
               fb.clear();
            }

            for (int y = 0, i = 0; y < trisPerDrawCall; y++) {
               fb.position((i++) * strideFloat);
               fb.put(x * 3 + 16).put(y * 3 + 16);
               fb.put(packRGBA(0xFF, 0x00, 0x00, 0xFF));

               fb.position((i++) * strideFloat);
               fb.put(x * 3 + 32).put(y * 3 + 16);
               fb.put(packRGBA(0x00, 0xFF, 0x00, 0xFF));

               fb.position((i++) * strideFloat);
               fb.put(x * 3 + 16).put(y * 3 + 32);
               fb.put(packRGBA(0x00, 0x00, 0xFF, 0xFF));
            }

            if (renderStrategy == VBO_MAPPED) {
               vbo.unmap();
            } else {
               fb.flip();

               if (renderStrategy == VBO_DATA) {
                 glBufferData(GL_ARRAY_BUFFER, fb, GL_STREAM_DRAW);
               }
               if (renderStrategy == VBO_SUBDATA) {
                 glBufferSubData(GL_ARRAY_BUFFER, 0, fb);
               }
               if (renderStrategy == VA) {
                 glVertexPointer(2, stride, fb);
                 glColorPointer(4, true, stride, bb);
               }
            }

            //  

            glDrawArrays(GL_TRIANGLES, 0, 3 * trisPerDrawCall);

            if (renderStrategy == VBO_DATA || renderStrategy == VBO_SUBDATA || renderStrategy == VBO_MAPPED) {
               if (isOrphaning) {
                  vbo.orphan();
               }
            }
         }
         //

         Display.update();

         frameCount++;
         if (System.nanoTime() > lastSecond + 1_000_000_000L) {
            lastSecond += 1_000_000_000L;
            Display.setTitle(frameCount + "fps / " + (1000.0f / frameCount) + "ms");
            frameCount = 0;
         }
      }

      Display.destroy();
   }
}

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline CommanderKeith
« Reply #25 - Posted 2012-12-22 01:55:51 »

This is cool, thanks for sharing.
I notice that it uses GL1.5 and 1.1. Is that because you are trying to maintain backwards compatibility with older video cards? Would you use these methods with more recent video cards?

Offline Riven
« League of Dukes »

JGO Overlord


Medals: 743
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #26 - Posted 2012-12-22 02:01:47 »

glEnableClientState and glVertexPointer are indeed deprecated, but to use glVertexAttribPointer, I'm forced to use shaders. Not that there's anything wrong with that, but it would make this code dump rather verbose.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 743
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #27 - Posted 2012-12-22 19:25:53 »

Yes, but what's stopping me from just putting all those particles into a single VBO and drawing parts of that VBO with multiple draw calls [...] The draw calls should stay the same, but you'll also have to switch VBOs between each draw call with your approach.


1. Putting data into the VBO and sending a batch of draw-calls later:
1  
2  
VSYNC                                            | <-- GPU starts late in frame   VSNC
| [               generate data                 ][        draw data          ]    |


2. Putting data into the different VBOs, sending draw-calls immediately:
1  
2  
3  
VSYNC         | <-- GPU        | <-- GPU        | <-- GPU                         VSNC
| [ gen.data ]     [ gen.data ]     [ gen.data ]                                  |
              [ draw data ]    [ draw data ]    [ draw data ]                      


So... reduce the odds that you drop a frame by sending the GPU the draw calls (too) late in the vsync time slot.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline theagentd
« Reply #28 - Posted 2012-12-22 21:17:40 »

By default, the Nvidia driver will buffer up to 3 frames. The whole point of pipelining rendering is to avoid dropping a frame. I seriously doubt that using a single VBO will affect the FPS or cause a frame drop. The actual frame won't be ready until you do Display.update(). If you were to do an extreme number of OpenGL commands, the command buffer might get full and the command blocks until there's space in the command buffer. If these commands are very simple, the command buffer will be drained extremely quickly and the GPU will stall. That's called a CPU bottleneck and is completely unrelated to the problem you're trying to solve (glMapBuffer() performance).

I know some of the theory too, where's the numbers? =S

Myomyomyo.
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 743
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #29 - Posted 2012-12-22 22:55:26 »

I know some of the theory too, where's the numbers? =S

Rendering 8K trees (5 tris each), about 50x50 fragments.

databatchesframerate
glBufferData()1x glDrawArrays57fps
glBufferData()16x glDrawArrays53fps
glMapBuffer()1x glDrawArrays52fps
glMapBuffer()16x glDrawArrays40fps
glMapBufferRange(UNSYNC)1x glDrawArrays61fps
glMapBufferRange(UNSYNC)16x glDrawArrays64fps

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Pages: [1] 2
  ignore  |  Print  
 
 
You cannot reply to this message, because it is very, very old.

 

Add your game by posting it in the WIP section,
or publish it in Showcase.

The first screenshot will be displayed as a thumbnail.

pw (35 views)
2014-07-24 01:59:36

Riven (36 views)
2014-07-23 21:16:32

Riven (24 views)
2014-07-23 21:07:15

Riven (27 views)
2014-07-23 20:56:16

ctomni231 (56 views)
2014-07-18 06:55:21

Zero Volt (49 views)
2014-07-17 23:47:54

danieldean (39 views)
2014-07-17 23:41:23

MustardPeter (43 views)
2014-07-16 23:30:00

Cero (59 views)
2014-07-16 00:42:17

Riven (56 views)
2014-07-14 18:02:53
HotSpot Options
by dleskov
2014-07-08 03:59:08

Java and Game Development Tutorials
by SwordsMiner
2014-06-14 00:58:24

Java and Game Development Tutorials
by SwordsMiner
2014-06-14 00:47:22

How do I start Java Game Development?
by ra4king
2014-05-17 11:13:37

HotSpot Options
by Roquen
2014-05-15 09:59:54

HotSpot Options
by Roquen
2014-05-06 15:03:10

Escape Analysis
by Roquen
2014-04-29 22:16:43

Experimental Toys
by Roquen
2014-04-28 13:24:22
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!