Java-Gaming.org Hi !
Featured games (83)
games approved by the League of Dukes
Games in Showcase (527)
Games in Android Showcase (127)
games submitted by our members
Games in WIP (594)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: [1]
  ignore  |  Print  
  Maximizing VBO upload performance!  (Read 4288 times)
0 Members and 1 Guest are viewing this topic.
Offline theagentd

« JGO Bitwise Duke »


Medals: 361
Projects: 2
Exp: 8 years



« Posted 2014-02-20 20:50:37 »

Relevant previous threads

 - OpenGL slides detailing the technique I'm using
 - Riven's fast buffer mapping approach



Introduction

When Riven posted his cool new buffer mapping approach a year ago I realized how important data uploading to the GPU was when it comes to performance, and I've been using that approach since then. Recently Nvidia published some slides on reducing OpenGL hardware using the latest OpenGL extensions, and one thing they talk about is how to improve buffer mapping performance. Trying to implement this I encountered funny interactions with driver multithreading and instancing, but I believe I have finally emerged at the top! Sadly, the required functionality has only been implemented by Nvidia so far.



What's wrong with unsynchronized mapped buffers?

When I was first looking through the slides I was genuinely surprised to see that they claimed that mapping buffers was slow, so I decided to do some heavy profiling in the graphics engine I'm working on. I created a scene with a massive amount of shadow-casting point lights. The scene then has to be culled and rendered for each light 6 times to generate shadow maps for those lights, which is an extremely CPU intensive task. The most expensive OpenGL calls were glDrawElementsInstanced() at 27.7% (!) and glMapBufferRange() at 11.1% of the CPU time. Sure, buffer mapping is a real performance hog, but it's not much compared to frustum culling at around 25-30%. The test was rendered at an extremely low resolution, so the GPU load was at around 70%, indicating a CPU bottleneck.

What got me thinking was the claim Nvidia made that mapping a buffer forces driver server and client threads to synchronize. What server thread? In the Nvidia control panel, there is a setting called "Threaded optimization" which controls driver multithreading. I have been keeping this at "Off" since when left at the default setting "Auto" it can completely ruin performance in some of my programs. Forcing it to "On" caused performance to drop by 40% just as I remembered, but the profiling results were completely different. glMapBufferRange() and glUnmapBuffer() now account for 51.1% and 10.5% respectively. Holy shit! Another surprising entry is glGetError() which I call just twice per frame which take 9.4% of the CPU time. These 3 OpenGL functions take up 71.0% of my CPU time! glDrawElementsInstanced() on the other hand is now down to 0.2% of the CPU time. What the hell is going on here?!

From what I can see Threaded optimization adds an extra driver thread which all OpenGL commands are offloaded to. This extra threads complicates the OpenGL pipeline even further:

Before:
Game thread owning the OpenGL context --> OpenGL command buffer --> GPU

Now:
Game thread owning the OpenGL context --> Driver server thread ---> OpenGL command buffer --> GPU

Something you learn very quickly when working with OpenGL is to avoid certain functions like glReadPixels() (without using a PBO of course) that force the CPU to wait for the GPU to finish working since it forces the CPU and GPU to synchronize. Similarly, mapping a buffer forces the game thread to wait for the server thread to finish any pending operations, stalling the game thread until the buffer can be mapped. What we're seeing is not glMapBufferRange() becoming more expensive; we're seeing a driver stall! Using unsynchronized VBOs eliminates the synchronization with the GPU (to ensure the data is no longer in use), but the internal driver thread synchronization cannot be avoided this way.



The solution

The solution is actually ridiculously simple. Don't map buffers, or rather, don't map buffers every frame. The OpenGL extension ARB_buffer_storage allows you to specify two new buffer mapping bits, GL_MAP_PERSISTENT_BIT and GL_MAP_COHERENT_BIT. Before ARB_buffer_storage it was impossible to use a buffer on the GPU while it was mapped. GL_MAP_PERSISTENT_BIT gets rid of this restriction, allowing us to simply keep the buffer bound indefinitely. GL_MAP_COHERENT_BIT ensures that the data we write to the mapped pointer is immediately visible to the GPU without having to manually flush anything. Together, they completely eliminate the need of any buffer mapping except for when the game is first started when used in an unsynchronized rotating manner, just like with Riven's approach. Since the new method is so similar to Riven's approach, it's easy to create a generic interface which can be implemented both with unsynchronized VBOs and persistent VBOs. In the following code, map() is supposed to expand the VBO in case it is not large enough to satisfy the requested capacity. Buffer rotation is handled elsewhere.

1  
2  
3  
4  
5  
6  
7  
8  
public interface MappedVBO {

   public ByteBuffer map(int requestedCapacity);
   public void unmap();
   public void bind();
   public void dispose();
   
}


Here's an implementation using unsynchronized VBOs. The buffer is simply mapped each frame using GL_MAP_UNSYNCHRONIZED_BIT.
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
47  
48  
49  
50  
51  
52  
public class UnsynchronizedMappedVBO implements MappedVBO{
   
   protected int target;
   
   protected int vbo;
   protected int capacity;
   
   protected ByteBuffer mappedBuffer;
   
   public UnsynchronizedMappedVBO(int target){
     
      this.target = target;
     
      vbo = glGenBuffers();
      glBindBuffer(target, vbo);
      glBufferData(target, capacity, GL_STREAM_DRAW);
      capacity = 0;
   }

   @Override
   public ByteBuffer map(int requestedCapacity){
      glBindBuffer(target, vbo);
      if(capacity < requestedCapacity){
         setCapacity(requestedCapacity);
      }
     
      mappedBuffer = glMapBufferRange(target, 0, requestedCapacity, GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT, mappedBuffer);
      mappedBuffer.clear();
      return mappedBuffer;
   }

   void setCapacity(int requestedCapacity) {
      capacity = requestedCapacity * 2;
      glBufferData(target, capacity, GL_STREAM_DRAW);
   }

   @Override
   public void unmap(){
      glUnmapBuffer(target);
   }

   @Override
   public void bind(){
      glBindBuffer(target, vbo);
   }

   @Override
   public void dispose(){
      glDeleteBuffers(vbo);
   }
   
}


With persistent VBOs, map() no longer needs to actually map the buffer! We map the buffer once when its created and then we simply return the same ByteBuffer instance when map() is called! If the buffer is too small, we will need to reallocate the buffer and remap it though.
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
47  
48  
49  
50  
51  
52  
53  
54  
55  
56  
57  
public class PersistentMappedVBO implements MappedVBO{
   
   protected int target;
   
   protected int vbo;
   protected int capacity;
   
   protected ByteBuffer mappedBuffer;

   public PersistentMappedVBO(int target) {
      this.target = target;
      capacity = 0;
   }

   @Override
   public ByteBuffer map(int requestedCapacity) {
     
      glBindBuffer(target, vbo);
      if(requestedCapacity > capacity){
         setCapacity(requestedCapacity);
      }
     
      mappedBuffer.clear();
      return mappedBuffer;
   }

   protected void setCapacity(int requestedCapacity) {
     
      capacity = requestedCapacity * 2;
     
      if(mappedBuffer != null){
         glUnmapBuffer(target);
         glDeleteBuffers(vbo);
      }
     
      vbo = glGenBuffers();
      glBindBuffer(target, vbo);
      glBufferStorage(target, capacity, GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT);
      mappedBuffer = glMapBufferRange(target, 0, capacity, GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT, mappedBuffer);
   }

   @Override
   public void unmap() {
      //PERSISTENT MAGIC
   }

   @Override
   public void bind() {
      glBindBuffer(target, vbo);
   }

   @Override
   public void dispose() {
      glDeleteBuffers(vbo);
   }

}




Instancing?!

For a long time I've been trying to get Nvidia to fix their broken instancing performance. The whole point of instancing is to reduce the CPU overhead of drawing multiple identical objects in different locations, but currently instancing is much slower than simply batching together a few "instances" into a single VBO and rendering 64 at a time using a normal glDrawElements() call. Instancing is so slow that I actually implemented an Nvidia specific renderer for parts of my engine to avoid this problem. I'm happy to report that for one reason or another, instancing performance is through the roof when using persistent mapped buffers, matching or surpassing that of my Nvidia specific renderer!



Aaaaand finally some numbers


Threaded optimization off, using unsynchronized VBOs
45 FPS, 22.222 ms/frame

Threaded optimization off, using unsynchronized VBOs, using Nvidia specific renderer (no instancing, avoids most buffer mapping)
75 FPS, 13.333 ms/frame

Threaded optimization on, using persistent VBOs
88 FPS, 11.363 ms/frame



Conclusion

Although persistent buffers are a bit faster when it comes to raw speed, their other advantages are much more interesting. By allowing efficient use of threaded optimization (which gets enabled automatically when left at the default "Auto") the game's thread is free to do other things. In fact, around 25% of the frame time with persistent VBOs is due to Display.update() stalling due to the extra driver thread still being busy, meaning that I could add quite a bit of game logic in the main thread without affecting my game's frame rate. The massive improvement to instancing is also great, and responsible for most of the performance improvement. That means that I no longer have to maintain two separate renderers.

Myomyomyo.
Online BurntPizza

« JGO Bitwise Duke »


Medals: 274
Exp: 5 years



« Reply #1 - Posted 2014-02-20 21:25:43 »

Nicely done! That's a 2x increase in speed  Shocked

I'm not so much into this particular performance arena, but I can relate to trying to optimize everything, and you sir are a crazy person. That's a compliment.

Now just need to wait for more vendor support.
Offline theagentd

« JGO Bitwise Duke »


Medals: 361
Projects: 2
Exp: 8 years



« Reply #2 - Posted 2014-02-20 22:32:32 »

Nicely done! That's a 2x increase in speed  Shocked

I'm not so much into this particular performance arena, but I can relate to trying to optimize everything, and you sir are a crazy person. That's a compliment.

Now just need to wait for more vendor support.
It's worth noting that AMD GPUs do not suffer from the instancing performance problem, so the performance increase will not be that dramatic.

Myomyomyo.
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline princec

« JGO Spiffy Duke »


Medals: 425
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #3 - Posted 2014-02-21 11:27:13 »

It is indeed a massive embuggerance that ARB_buffer_storage is such a new extension because it's basically unavailable on the majority of machines out there at this time... which means I still have to code for the lowest common denominator as there's no point getting it to be "fast enough" on the latest hardware if it just runs poo on everyone else's machine. Even my main development machine doesn't have the extension Sad (nvidia gtx280)

It'd make my sprite engine vastly simpler as well. Bah.

Cas Smiley

Offline theagentd

« JGO Bitwise Duke »


Medals: 361
Projects: 2
Exp: 8 years



« Reply #4 - Posted 2014-02-24 20:39:34 »

To put more emphasis on the gains of eliminating the driver synchronization overhead by getting rid of buffer mapping every frame, I decided to time exactly how long my OpenGL calls took to execute, not just the resulting FPS.

TestTotal render() timeTime spent by OpenGL threadScene FPS
Unsynchronized single-threaded29.5 - 31.0 ms26.5 - 27.0 ms31 FPS
Unsynchronized multi-threaded48 - 55 ms45 - 50 ms19 FPS
Persistent single-threaded23.8 - 24.3 ms20.0 - 20.6 ms39 FPS
Persistent multi-threaded8.9 - 9.3 ms3.5 - 3.7 ms41 FPS

In essence, the persistent buffer handling cut the total CPU time of my rendering routine to less than 1/3rd, or 3.34x better performance. Even more impressive, the time spent by the thread owning the OpenGL context (my engine is multithreaded) had a massive CPU time reduction to less than 1/7th, or 7.3x better performance.



Profiling pictures!


Top left: frame time graph, the red line is 16.667ms = 60 FPS.
Bottom right: FPS counter.

The colored bars show how the engine worker threads spent their time. The top thread which does nothing at the beginning is the thread owning the OpenGL context (AKA rendering thread). The other 8 threads (running on hyper-threaded quad-core) are normal worker threads. (Fun fact: I also do stutter free texture streaming from yet another thread using a separate context.)

Task colors:
 - Yellow: Light culling (needs to be multithreaded).
 - Pink-red: Terrain culling for camera
 - Green: Model culling for camera and all shadow maps (the barely visible white dots are fully skeleton animated models, all casting shadows).
 - Red: Skeleton animation matrix generation.
 - Pink (small): Skeleton bone matrix uploading
 - Brown: Rendering from camera's perspective.
 - Gray blue (small): Particle updating and rendering (no particles visible in screenshots).
 - Cyan: Terrain culling for shadow maps.
 - White: Rendering of shadow maps and lights interleaved.
 - Black (small): Postprocessing.


The test scene consists of 500 lights of which 414 lights are visible (some are outside the bottom of the screen). The engine renders 2484 shadow maps each frame. These low-resolution shadow maps are packed into 7 shadow map passes to minimize the number of FBO switches required.


Unsynchronized single-threaded:


Persistent multi-threaded

Myomyomyo.
Offline basil_

« JGO Bitwise Duke »


Medals: 70
Exp: 12 years



« Reply #5 - Posted 2014-07-03 19:04:28 »

*edit* doh', should have read about Riven's fast mapped buffers first Tongue
Pages: [1]
  ignore  |  Print  
 
 

 

Add your game by posting it in the WIP section,
or publish it in Showcase.

The first screenshot will be displayed as a thumbnail.

PocketCrafter7 (14 views)
2014-11-28 16:25:35

PocketCrafter7 (9 views)
2014-11-28 16:25:09

PocketCrafter7 (10 views)
2014-11-28 16:24:29

toopeicgaming1999 (76 views)
2014-11-26 15:22:04

toopeicgaming1999 (67 views)
2014-11-26 15:20:36

toopeicgaming1999 (16 views)
2014-11-26 15:20:08

SHC (30 views)
2014-11-25 12:00:59

SHC (28 views)
2014-11-25 11:53:45

Norakomi (32 views)
2014-11-25 11:26:43

Gibbo3771 (28 views)
2014-11-24 19:59:16
Understanding relations between setOrigin, setScale and setPosition in libGdx
by mbabuskov
2014-10-09 22:35:00

Definite guide to supporting multiple device resolutions on Android (2014)
by mbabuskov
2014-10-02 22:36:02

List of Learning Resources
by Longor1996
2014-08-16 10:40:00

List of Learning Resources
by SilverTiger
2014-08-05 19:33:27

Resources for WIP games
by CogWheelz
2014-08-01 16:20:17

Resources for WIP games
by CogWheelz
2014-08-01 16:19:50

List of Learning Resources
by SilverTiger
2014-07-31 16:29:50

List of Learning Resources
by SilverTiger
2014-07-31 16:26:06
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!