Java-Gaming.org Hi !
Featured games (91)
games approved by the League of Dukes
Games in Showcase (803)
Games in Android Showcase (237)
games submitted by our members
Games in WIP (867)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: [1]
  ignore  |  Print  
  opengl MSAA resolve efficient ?  (Read 16900 times)
0 Members and 1 Guest are viewing this topic.
Offline basil_

« JGO Bitwise Duke »


Medals: 418
Exp: 13 years



« Posted 2014-08-14 21:49:37 »

not really java related but i think people around here are smarter then me. maybe you guys can help me getting my head around something. it's just a naive approach the topic to get a better understanding of it.

cannot find any useful information about "efficient deferred msaa resolve" on the googleinterwebs, but then i guess i just ask the wrong questions - or i'm just doing it all wrong.

i'm successfully rendering triangles into a FBO having multisampled textures attached to it. depth and color.

now, going into postprocessing .. say, hdr-tone-mapping (or depth-darkening (or depth-of-field-blurring)), one would render a fullscreen-quad into a non-msaa-fbo (like default framebuffer) using a shader which access the multisample-textures generated in the first step. now since we know how many samples those textures contain we can do a simple process and resolve like that :

(for the sake of completeness - this is a trivial fragment-shader used to resolve a msaa-depth-buffer into a linear non-msaa buffer)
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
#version 150

out float frag0_r;     // writing to a single-chan float-texture
in  vec2 st0;

uniform sampler2DMS aa_tex;  // aa-depth-buffer
uniform vec2        dim;     // buffer-dimensions, could be replaced with textureSize(aa_tex,0)
uniform int         samples = 4;

uniform float znear;
uniform float zfar;
uniform float zrange;

float linear0(in float depth) { return (znear * zfar) / (zfar - depth * zrange); }    

void main()
{
  ivec2 coord = ivec2(dim * st0);
  float d     = 0.0;
  for(int i   = 0; i < samples; i++) d += linear0(texelFetch(aa_tex, coord, i).r);
  frag0_r = d / samples ;
}


but what i am talking about is actually just
1  
for(int i = 0; i < samples; i++) sum += texelFetch(aa_tex, coord, i);
this could be also a color texture - the point is it is executed for every sample, even if all samples are the same. clearly, a bad idea. it works fine for simple tasks like the shader above (linear0()) but once we do heavy-per-sample-computation perfomance drops. not to talk about the amount of wasted precious gpu-power and memory at this point (.. and generating another texture which tests all samples and emits a single value .. which would be used to source another sample loop Lips Sealed).

when i understand the opengl-pipeline correctly then - during the triangle-rendering-pass, by default, not all samples are evaluated by the fragment-shader. only if the questionable fragment covers geometry partially (i know about GL_SAMPLE_SHADING and the pitfalls which force GL to run all samples). not evaluated samples are just copies.

leading me to https://www.opengl.org/sdk/docs/man/html/gl_SampleMaskIn.xhtml and the idea to use this information during the post-processing-resolve pass. if one would know how many distinct samples are in a texel - summing it up would be much more efficient.

again, maybe i'm just stuck at this path. ofc, you would just render all triangles again with a different shader, grabbing the sampleMaskIn data + gl_SampleID, grabbing the msaa-texel using gl_fragCoords and perform the resolve efficiently due to the default behaviour of "not executing all samples in the fragment". but let us assume we dont have the triangles accessible during this pass anymore. (dont ask persecutioncomplex) .. or, maybe just cos' it's nice to see how far one can get with "just using textures to assemble a high-quality image in reasonable time".

personally i'm coming from offline-rendering, like cinema4D and such.. so i like things to be visible .. in textures/buffers .. and layers. step by step, not too many different things at once. coding realtime graphics is just a hobby. here's the idea generated in my favourite offline-renderer :

a simple plane with no AA (scaled 200%) :


same thing with 4x AA :


and the samples per fragment visualised (blue = 1 sample, green = 2, red = 4) :


is it possible to create something like the third image with basic openGL (GL4.5 Tongue) using msaa-textures and the sampleMask ?

is it possible to store the sampleMask in a non-msaa-texture and use it as a lookup later at all ?
i know it is not allowed to mix multisample and regular rendertargets. MRT requires all targets to be the same dimension and sample-configuration.

why doesn't openGL just store the mask with the texture itself and makes my life easy ?  Emo

i know using the "usual" blitting is fast :
1  
2  
3  
GL30.glBindFramebuffer(GL30.GL_READ_FRAMEBUFFER,source_id);
GL30.glBindFramebuffer(GL30.GL_DRAW_FRAMEBUFFER,target_id);
GL30.glBlitFramebuffer(0,0,width,height,0,0,width,height,GL11.GL_COLOR_BUFFER_BIT,GL11.GL_NEAREST);
but it does not allow per-sample-operations. (http://mynameismjp.wordpress.com/2012/10/24/msaa-overview/ ("working with HDR and tone mapping" shows nicely the bad effect of using postprocessing after resolve))

is MSAA the wrong approach ? is it worth going into ML/FXAA ? to me the pixel-quality of msaa is still the best, isn't it ?

should I just reroll my rendering path ?

what am i missing ?  Clueless
i was so happy when they gave me msaa-rendertargets, now i hate it.

i'm very sure i'm not the first person running into this issue. mind sharing your experience with it ? (sorry for my bad english, i'm not native, scheiße)
Offline basil_

« JGO Bitwise Duke »


Medals: 418
Exp: 13 years



« Reply #1 - Posted 2014-08-16 16:19:00 »

i got something going on, pretty straight forward and not too complicated. not very optimised tho'.

first i tried to carry the mask through the stencil buffer but afaik it is not possible to write the stencil reference if https://www.opengl.org/registry/specs/ARB/shader_stencil_export.txt is not supported.

another FBO attachment works tho'.

- setup msaa-FBO
- attach color texture and whatnot
- attach an extra single-chan low precision msaa-texture (say attachment-1)

- during forward-rendering, write the mask MRT-like :
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
#version 400

layout(location = 0) out vec4 frag0;
layout(location = 1) out vec4 frag1;

// layout(location = 1) out float frag1_r; // blending may zero the output if not using a vec4

[...]

// no extra sampling = 0, some extra samples = 1
frag1 = vec4(gl_SampleMaskIn[0] == pow(2,gl_NumSamples)-1 ? 0.0 : 1.0, 0.0,0.0,1.0);


- blit into non-msaa-textures like (showing just attachment-1 blitting) :
1  
2  
3  
4  
5  
GL30.glBindFramebuffer(GL30.GL_READ_FRAMEBUFFER,msaa-fbo-id);
GL11.glReadBuffer(GL30.GL_COLOR_ATTACHMENT1);      // attachment-1 is the single-chan-texture
GL30.glBindFramebuffer(GL30.GL_DRAW_FRAMEBUFFER,non-msaa-fbo-id);
GL11.glDrawBuffer(GL30.GL_COLOR_ATTACHMENT1);      // target-fbo has the same attachment setup just not msaa
GL30.glBlitFramebuffer(0,0,width,height,0,0,width,height,GL11.GL_COLOR_BUFFER_BIT,GL11.GL_NEAREST);


- use the mask-texture during aa-resolve like :
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
#version 150

out float frag0_r;
in  vec2 st0;

uniform sampler2D   aamask; // non-msaa forward-rendering-frag1 output
uniform sampler2DMS aa_tex;

uniform vec2        dim;
uniform int         samples = 0;

uniform float znear;
uniform float zfar;
uniform float zrange;

// comment to see the difference
#define usemaskedresolve

float linear0(in float depth) { return (znear * zfar) / (zfar - depth * zrange); }    

void main()
{
  ivec2 coord = ivec2(dim * st0);
  #ifdef usemaskedresolve
   
    if ( texelFetch(aamask, coord, 0).r != 0.0 )
    {
      float d     = 0.0;
      for(int i   = 0; i < samples; i++) d += linear0(texelFetch(aa_tex, coord, i).r);
      frag0_r = d / samples ;
    }
    else
    {
      frag0_r = linear0(texelFetch(aa_tex, coord, 0).r); // just grab first sample since it's expected to be the same value for all samples
    }
  #else
 
    float d     = 0.0;
    for(int i   = 0; i < samples; i++) d += linear0(texelFetch(aa_tex, coord, i).r);
    frag0_r = d / samples ;
 
  #endif
}


- finally, use (pretty fast) resolved msaa-buffer.

clearly you see it could be optimised a bit more, but as a first step that works pretty good already. performance gain depends on the scene-complexity but in a usual real-world scenario i got 120 to 160% increase (1080p on a geforce 560ti). the mask is basicly just a boolean, could be more efficient to carry the whole mask and split the all-samples-loop into just-distinct-samples but then i do not have a clue how GL copies the distinct samples.

here's a test image rendered like that (8x msaa). it is no different to the output of a brute-force-resolve, just more efficient.


a result of that, a very overdriven depth-darkening effect (400%) rendered at 50% resolution 8x msaa :
using a non-msaa-depth-buffer (mind the jaggy edges around dark shadows) :


and with AA-depth (mind the smooth edges around dark shadows)


other things to consider :

- bool-texture ? Smiley
- better writing to a SSBO ?
- and ofc, per-sample-computing during resolve. in this example quality is still not perfect when just using a antialiased depth-buffer for depth-darkening.

o/
Offline theagentd
« Reply #2 - Posted 2014-08-17 01:56:03 »

I'll take a look at this tonight once I get back!!!

Myomyomyo.
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline theagentd
« Reply #3 - Posted 2014-08-17 09:41:10 »

Okay, your optimized version (although better) is far from optimal, but obviously better than supersampling your post processing.

You're still supersampling too much, but it's not very noticable with the simple geometry you have. Detecting triangle edges like you do results in redundant supersampling on internal edges of 3D models. A more tessellated 3D model would have triangle edges everywhere, but only a small fraction of those edges actually need more than 1 sample shaded. A much better approach is to get rid of the extra buffer you use for edge detection and instead run a fullscreen pass after rendering your scene. In this pass, you'd analyze the scene's depth and normals (If you don't have the normals available, you may need to write them as well to a second MSAA texture during scene rendering), check if there's a significant spread of the depth or if the normal varies a lot, and write out a mask to  a non-MSAA texture. This mask texture can then be used during post processing and other places.

Branching like that in the resolve shader is a bad idea. GPUs shade pixels in larger groups, between 8x8 and 16x16 pixels at a time (depends on your graphics card). If even a single pixel in this group of pixels requires per-sample postprocessing, the whole group has to pause while that single pixel does per sample computations so that the group can stay in sync. The gain you're seeing right now most likely comes from the saved bandwidth of not having to sample all samples for most pixels, but you should get an even higher increase if you modify how you do the per-sample computations. A nice OpenGL 3 trick is to compute a (non-MSAA) stencil mask of which pixels that need to have per-sample shading (the above edge detection can be modified to generate a stencil texture instead). You then do the resolving twice, the first time with the stencil test set to only process pixels that don't need MSAA, and the second pass so that it processes the pixels that DO need MSAA. This prevents the above problem as there's no branching in the shader (you have two shaders instead), and in the second pass the GPU can pack together pixels that need per-sample resolving into a single pixel group.

A OpenGL 4 alternative is to use a compute shader and reschedule pixels that need MSAA to a second pass. This is a bit complicated to describe, but you'd essentially postprocess the first sample of all pixels, and build a list per work group using an atomic counter and a shared array of pixels that need to have the rest of their samples shaded. After the first sample is shaded, you switch to processing the rest of the samples that need shading using all shaders.

More information can be found here: http://dice.se/wp-content/uploads/GDC11_DX11inBF3_Public.pdf This describes the second technique which they use for tile based deferred shading, but the same technique applies for your use case as well.

Myomyomyo.
Offline basil_

« JGO Bitwise Duke »


Medals: 418
Exp: 13 years



« Reply #4 - Posted 2014-08-17 12:13:07 »

i was hoping one could skip the mask-blitting (attachment-1 blit in my example) :

- gl_SampleMaskIn in "should" be the same for all samples of a fragment (right?)
- naively i tried to read just the first sample of the non-blitted-msaa-mask

.. but it makes sence, not all samples are processed by the fragment shader (during forward rendering). so even if the output is the same for all samples one cannot tell in which sample the information ends beeing stored. eventually sample-zero is never written.

using gl_SampleID or gl_SamplePosition could be used to work around this but those are methods : causes the entire fragment shader to be evaluated per-sample rather than per-fragment... what defeats the purpose of all this.

anyway, that would be just a minor optimisation.

---

thanks for the feedback agentd!

You're still supersampling too much, but it's not very noticable with the simple geometry you have. Detecting triangle edges like you do results in redundant supersampling on internal edges of 3D models. A more tessellated 3D model would have triangle edges everywhere, but only a small fraction of those edges actually need more than 1 sample shaded.

yes thats far from optimal Smiley .. interestingly that effect gets worse when lowering the resolution. here's an example : (white = many samples, black = 1 sample)

i was hoping https://www.opengl.org/sdk/docs/man2/xhtml/glEdgeFlag.xml would help GL to reduce the amount of sampling but's it's only affecting drawings of lines and points.

A much better approach is to get rid of the extra buffer you use for edge detection and instead run a fullscreen pass after rendering your scene. In this pass, you'd analyze the scene's depth and normals (If you don't have the normals available, you may need to write them as well to a second MSAA texture during scene rendering), check if there's a significant spread of the depth or if the normal varies a lot, and write out a mask to  a non-MSAA texture. This mask texture can then be used during post processing and other places.

sounds like the next step Smiley

Branching like that in the resolve shader is a bad idea. [...] the gain you're seeing right now most likely comes from the saved bandwidth of not having to sample all samples for most pixels, but you should get an even higher increase if you modify how you do the per-sample computations.

yes, i'm very aware of that. the perfomance-hog sits totally and mainly in the bandwidth. it's incredible when you think about how many texture fetches are required to process all samples of a 8x 1080p display.

just trying to keep everthing more readable then perfect. which branching are you referring to ? the "if" or the "for"-loop ?

A nice OpenGL 3 trick is to compute a (non-MSAA) stencil mask of which pixels that need to have per-sample shading (the above edge detection can be modified to generate a stencil texture instead).
that pretty much describes what i was thinking about this topic when it poped up.

my first attempt to write the stencil failed pretty bad. how would one create such map without having https://www.opengl.org/registry/specs/ARB/shader_stencil_export.txt available ?

i do not understand fully what you mean by modifying the edge detection.  Clueless .. oh wait .. you mean, by setting the pipeline just to write into stencil buffer ...
1  
2  
glColorMask(false, false, false, false);
glDepthMask(false);
and redraw all triangles with the shader discarding samples ... now it get lost, that stencil buffer is multisampled at this point. i get stucked on this part every time Smiley.

.. oh wait. not redrawing the triangles. just discarding (instead of processing) pixels (the msaa-gl_SampleMaskIn output) in a fullscreen-quad to generate the stencil and then rendering the 2-pass-stencil trick (thanks for the pointer to that, very neat!) with heavy computing ? i guess in my example the computing is not heavy enough to see gain but i can see guess it would work.

A OpenGL 4 alternative is to use a compute shader and reschedule pixels that need MSAA to a second pass. This is a bit complicated to describe, but you'd essentially postprocess the first sample of all pixels, and build a list per work group using an atomic counter and a shared array of pixels that need to have the rest of their samples shaded. After the first sample is shaded, you switch to processing the rest of the samples that need shading using all shaders.

i read about this technique, not exactly related to this topic (OIT,raytracing,etc). not sure if i really understand it yet but it seems like it is about ..

- building queues of elements to process
- process those in a way that utilises the available computing power in best possible way by avoiding waiting and automatic synchronising.

getting close ? Smiley

right now i think just relying on the driver to schedule everything is a bad idea but also good enough. sure, we do waste precious computing power.

my experiments with raytracing and compute-shaders vs. opencl showed something funny. while the code was almost the same, opencl outperformed the compute shader by far. i think it's related to the different compiler and the better and finer control over usage of memory and workgroup layout of opencl. it looks like opencl code gets interpreted and optimised way better and therefore executes faster.

so my guess is, over time, compute-shaders/glsl will profit from opencl and adapt more features and hopefully end up beeing able to schedule everything in a better way then people can think of. Smiley .. in my dreams.

anyway, i'm trying to approach this msaa-topic from the side of having least amount of application-code and exploiting the existing pipeline (letting GL/driver handle most of it). the stencil-trick makes me think this can actually end up being useful after all.

also i think one should be careful, one thing is resolving msaa textures another is doing postprocessing. postprocessing before or after resolve.

on the other hand compute shaders are looking more and more useful the deeper i dig into it. especially
More information can be found here: http://dice.se/wp-content/uploads/GDC11_DX11inBF3_Public.pdf This describes the second technique which they use for tile based deferred shading, but the same technique applies for your use case as well.
when it comes to deferred tile based rendering, what is afaik the state of the art atm.

thanks for your help Smiley
Offline theagentd
« Reply #5 - Posted 2014-08-17 14:24:53 »

i was hoping one could skip the mask-blitting (attachment-1 blit in my example) :

- gl_SampleMaskIn in "should" be the same for all samples of a fragment (right?)
- naively i tried to read just the first sample of the non-blitted-msaa-mask

.. but it makes sence, not all samples are processed by the fragment shader (during forward rendering). so even if the output is the same for all samples one cannot tell in which sample the information ends beeing stored. eventually sample-zero is never written.

using gl_SampleID or gl_SamplePosition could be used to work around this but those are methods : causes the entire fragment shader to be evaluated per-sample rather than per-fragment... what defeats the purpose of all this.

anyway, that would be just a minor optimisation.

I'm not sure what you're trying to achieve here since detecting triangle edges like that results in supersampling for edges that don't need it as you can see on the sphere, but anyway... The easiest way of detecting a triangle edge in the shader is to check if gl_FragCoord.xy is at the center of the pixel:

1  
2  
3  
4  
5  
if(fract(gl_FragCoord.xy) != vec2(0.5)){
    //edge!
}else{
    //not edge!
}

Of course this only works during scene rendering, not during the post processing fullscreen pass (that's just a fullscreen quad).


just trying to keep everthing more readable then perfect. which branching are you referring to ? the "if" or the "for"-loop ?

I'm referring to this code:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
    if ( texelFetch(aamask, coord, 0).r != 0.0 )
    {
      float d     = 0.0;
      for(int i   = 0; i < samples; i++) d += linear0(texelFetch(aa_tex, coord, i).r);
      frag0_r = d / samples ;
    }
    else
    {
      frag0_r = linear0(texelFetch(aa_tex, coord, 0).r); // just grab first sample since it's expected to be the same value for all samples
    }

You have two main problems:
1. As I said before, if just a single pixel enters the if() statement (= requires per sample processing), then the whole work group has to wait for that pixel to finish.
2. You should make the "samples" variable a constant or a #define so that the for-loop can be unrolled by the GLSL compiler and 1/samples can be precomputed for the division.

To improve the branching performance you can write your if-statement like this instead:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
//Inject this into the shader code before compiling the shader
#define SAMPLES 4

...


frag0_r = linear0(texelFetch(aa_tex, coord, 0).r);

if(texelFetch(aamask, coord, 0).r != 0.0 ){
    for(int i = 1; i < SAMPLES; i++){
        frag0_r += linear0(texelFetch(aa_tex, coord, i).r);
    }
    frag0_r /= SAMPLES;
}


With your original shader it's essentielly like this:


if(at least one pixel does NOT requires MSAA){
    those pixels sample 1 sample, the rest runs no-ops while waiting
}
if(at least one pixel requires MSAA){
    those pixels sample 4 samples, the rest runs no-ops while waiting
}


In most cases (a tile has both MSAA and non-MSAA pixels) you're effectively wating for it to sample 5 samples. With my modified version you're instead doing this:


sample 1 sample
if(at least one pixel requires MSAA){
    those pixels sample the remaining 3 samples, the rest runs no-ops
}


At worst, this would sample 4 instead of 5 samples. Although this doesn't really matter since you're bandwidth limited, it's a good trick which is applicable in a large number of cases.


A nice OpenGL 3 trick is to compute a (non-MSAA) stencil mask of which pixels that need to have per-sample shading (the above edge detection can be modified to generate a stencil texture instead).
that pretty much describes what i was thinking about this topic when it poped up.

my first attempt to write the stencil failed pretty bad. how would one create such map without having https://www.opengl.org/registry/specs/ARB/shader_stencil_export.txt available ?

i do not understand fully what you mean by modifying the edge detection.  Clueless .. oh wait .. you mean, by setting the pipeline just to write into stencil buffer ...
1  
2  
glColorMask(false, false, false, false);
glDepthMask(false);
and redraw all triangles with the shader discarding samples ... now it get lost, that stencil buffer is multisampled at this point. i get stucked on this part every time Smiley.

.. oh wait. not redrawing the triangles. just discarding (instead of processing) pixels (the msaa-gl_SampleMaskIn output) in a fullscreen-quad to generate the stencil and then rendering the 2-pass-stencil trick (thanks for the pointer to that, very neat!) with heavy computing ? i guess in my example the computing is not heavy enough to see gain but i can see guess it would work.

First you should create a (non-MSAA) renderbuffer and attach it to an FBO. I strongly recommend using GL_DEPTH24_STENCIL8​ instead of GL_STENCIL_INDEX8, as the latter is only guaranteed to be supported in GL4.3 and later. You'd attach this texture as a GL_DEPTH_STENCIL_ATTACHMENT to the FBO. There is no need to disable color writes or depth writes, as we have no color attachments and we won't enable the depth test in the first place.

To generate a stencil mask, you'd first clear the stencil buffer to all 0. Then you'd render a fullscreen quad to the FBO with the stencil test enabled like this:
1  
2  
3  
glEnable(GL_STENCIL_TEST); //Also enables stencil writes
glStencilFunc(GL_ALWAYS, 1, 0xFF); //Always succed, ref = 1, modify all bits (not necessary, but standard)
glStencilOp(GL_REPLACE, GL_REPLACE, GL_REPLACE); //Stencil test cannot fail, depth test cannot fail, if both succeed replace stencil value at pixel with ref (=1)

The problem is that this code will mark everything as an edge since the fullscreen quad covers all pixels. The solution is to create a custom shader which checks the MSAA samples of each pixel and discards it if MSAA  isn't necessary. discard; will prevent the stencil write.

So we have our stencil mask! We then add the same stencil renderbuffer to the postprocessing FBO so it can be used by the stencil test when doing postprocessing. We set the stencil func to GL_EQUALS and ref to 0. glStencilOp is set to GL_KEEP for all cases so we don't modify the pixels. This means that only the pixels with stencil=0 will get processed. Similarly, you just change ref to 1 and only stencil=1 (= needs MSAA) gets processed.





I'm starting to question the gains from doing this though. I'm coming from deferred shading where that stencil mask would be used for lighting as well as postprocessing, so the cost of generating the mask (which requires sampling the depth and normals of all samples) is offset by avoiding a lot of expensive lighting. From what I can see, your postprocessing shader is so cheap that it's actually surprising that you're getting any performance improvement at all considering you're increasing the bandwidth required during scene rendering with the additional render target and also resolving that extra render target. Unless the cost of generating the mask is less than the work saved by the mask it's better to just brute-force it. I have a feeling that the main reason you're seeing a performance increase is because a large part of the scene is the sky. In a more realistic scene I think your current method could actually be slower than simply brute-forcing it. If a larger number of pixels require MSAA; you'd essentially be brute forcing it anyway, but also computing and resolving the mask.

Myomyomyo.
Offline basil_

« JGO Bitwise Duke »


Medals: 418
Exp: 13 years



« Reply #6 - Posted 2014-08-17 15:21:18 »

I'm not sure what you're trying to achieve here since detecting triangle edges like that results in supersampling for edges that don't need it as you can see on the sphere, but anyway... The easiest way of detecting a triangle edge in the shader is to check if gl_FragCoord.xy is at the center of the pixel:

1  
2  
3  
4  
5  
if(fract(gl_FragCoord.xy) != vec2(0.5)){
    //edge!
}else{
    //not edge!
}

Of course this only works during scene rendering, not during the post processing fullscreen pass (that's just a fullscreen quad).
aye, exactly that! thanks again. never tried fract of frag-coords yet Smiley this is a better way then using the gl_SampleMaskIn input.

1  
frag2 = vec4(gl_SampleMaskIn[0] == pow(2,gl_NumSamples)-1 ? 0.0 : 1.0, 0.0,0.0,1.0);
generates exactly the same as
1  
frag2 = vec4(fract(gl_FragCoord.xy) != vec2(0.5) ? 1.0 : 0.0 ,0.0,0.0,1.0);
but does not require
1  
#version 400
what is great!

what i had on mind is : this generates a msaa-buffer. right now i'm blitting it into a non-msaa buffer and then carry on with the rest. i would like to skip the blitting if it would be possible to access the "edge or no edge" information from the first sample. but it might be store in any sample and sample zero might not be written at all. maybe i'm just way off tho' Wink

just trying to keep everthing more readable then perfect. which branching are you referring to ? the "if" or the "for"-loop ?
[...]
To improve the branching performance you can write your if-statement like this instead:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
//Inject this into the shader code before compiling the shader
#define SAMPLES 4

...

frag0_r = linear0(texelFetch(aa_tex, coord, 0).r);

if(texelFetch(aamask, coord, 0).r != 0.0 ){
    for(int i = 1; i < SAMPLES; i++){
        frag0_r += linear0(texelFetch(aa_tex, coord, i).r);
    }
    frag0_r /= SAMPLES;
}


[...]

In most cases (a tile has both MSAA and non-MSAA pixels) you're effectively wating for it to sample 5 samples. With my modified version you're instead doing this:

sample 1 sample
if(at least one pixel requires MSAA){
    those pixels sample the remaining 3 samples, the rest runs no-ops
}

aye. i changed it to
1  
2  
3  
4  
5  
6  
7  
frag0_r = linear0(texelFetch(aa_tex, coord, 0).r);
   
if ( texelFetch(aamask, coord, 0).r != 0.0 )
{
  for(int i = samples; i-- != 1;) frag0_r += linear0(texelFetch(aa_tex, coord, i).r);
  frag0_r /= samples ;
}


and even

1  
2  
3  
4  
int numsamples = max(1,int(sign(texelFetch(aamask, coord, 0).r) * samples));
       
for(int i   = 0; i < numsamples; i++) frag0_r += linear0(texelFetch(aa_tex, coord, i).r);
frag0_r /= numsamples ;


but all version kinda run at the same speed on my gpu. your suggestion makes most sense tho. Smiley

First you should create a (non-MSAA) renderbuffer and attach it to an FBO. I strongly recommend using GL_DEPTH24_STENCIL8​ instead of GL_STENCIL_INDEX8, as the latter is only guaranteed to be supported in GL4.3 and later. You'd attach this texture as a GL_DEPTH_STENCIL_ATTACHMENT to the FBO. There is no need to disable color writes or depth writes, as we have no color attachments and we won't enable the depth test in the first place.

To generate a stencil mask, you'd first clear the stencil buffer to all 0. Then you'd render a fullscreen quad to the FBO with the stencil test enabled like this:
1  
2  
3  
glEnable(GL_STENCIL_TEST); //Also enables stencil writes
glStencilFunc(GL_ALWAYS, 1, 0xFF); //Always succed, ref = 1, modify all bits (not necessary, but standard)
glStencilOp(GL_REPLACE, GL_REPLACE, GL_REPLACE); //Stencil test cannot fail, depth test cannot fail, if both succeed replace stencil value at pixel with ref (=1)

The problem is that this code will mark everything as an edge since the fullscreen quad covers all pixels. The solution is to create a custom shader which checks the MSAA samples of each pixel and discards it if MSAA  isn't necessary. discard; will prevent the stencil write.

So we have our stencil mask! We then add the same stencil renderbuffer to the postprocessing FBO so it can be used by the stencil test when doing postprocessing. We set the stencil func to GL_EQUALS and ref to 0. glStencilOp is set to GL_KEEP for all cases so we don't modify the pixels. This means that only the pixels with stencil=0 will get processed. Similarly, you just change ref to 1 and only stencil=1 (= needs MSAA) gets processed.

cheers. i did understand you then! Smiley

i've added an extra fbo-fullscreen-quad-pass to reduce the edges. so far it generates something like :


into :


this is still using just a single-chan texture. will change it to stencil. thanks for the heads up.

I'm starting to question the gains from doing this though. I'm coming from deferred shading where that stencil mask would be used for lighting as well as postprocessing, so the cost of generating the mask (which requires sampling the depth and normals of all samples) is offset by avoiding a lot of expensive lighting. From what I can see, your postprocessing shader is so cheap that it's actually surprising that you're getting any performance improvement at all considering you're increasing the bandwidth required during scene rendering with the additional render target and also resolving that extra render target. Unless the cost of generating the mask is less than the work saved by the mask it's better to just brute-force it. I have a feeling that the main reason you're seeing a performance increase is because a large part of the scene is the sky. In a more realistic scene I think your current method could actually be slower than simply brute-forcing it. If a larger number of pixels require MSAA; you'd essentially be brute forcing it anyway, but also computing and resolving the mask.

i do clear the aa-mask every frame with 0.0 so looking into the sky is as good as no-samples.

yet you're totally right. right now with just a depth-linearize pass, i'm not getting more bang.

using the very simple edge detection speeds it up by 120 to 160% tho'. gain is higher with higher resolution. that's a win already Smiley

running the normal+depth test over the edges drops performance by ~5%, so clearly calculating such mask just for the linear-depth is a bad idea.

makes sense in the long run tho'. i'd like to remove all standard blitting with this technique and move depth-darkening to proper per-sample computation, add depth of field and finally do the tonemapping before aa-resolve. with the stencil trick i think that will work out pretty good Smiley
Offline basil_

« JGO Bitwise Duke »


Medals: 418
Exp: 13 years



« Reply #7 - Posted 2014-08-19 19:45:33 »

threre you go :

https://software.intel.com/sites/default/files/m/d/4/1/d/8/lauritzen_deferred_shading_siggraph_2010.pdf
Offline Spasi
« Reply #8 - Posted 2014-08-19 21:46:59 »

You might want to have a look at the latest AA techniques presented at Siggraph 2014:

High Quality Temporal Supersampling (there's a link to the presentation in "About")
Hybrid Reconstruction AA

Lots of info on the technical difficulties and various trade-offs in there. Temporal AA easily beats anything you can do in post, so current research focus is on how to make it work better and combine it with other techniques. I believe it's a must, especially for deferred.
Offline basil_

« JGO Bitwise Duke »


Medals: 418
Exp: 13 years



« Reply #9 - Posted 2014-08-24 23:32:47 »

full length clip of that UE4 demo : https://www.youtube.com/watch?v=kr2oHPSJ0m8

just random update :

with stencil masking i end up with something like this ..

rendering :
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
...

#define out_color 0
#define out_normal 1
#define out_aa_mask 2
...
in vec4 color;
in vec3 pos;     // eye space
in vec3 normal;  // eye space

layout(location = out_color) out vec4 frag0;
layout(location = out_normal) out vec4 frag1;
layout(location = out_aa_mask) out float frag2;

main()
{
  vec4 c = color;
  vec3 _normal  = normalize(normal);
  [...]
  frag0 = c;
  frag1 = vec4(_normal,-pos.z);
  frag2 = fract(gl_FragCoord.xy) != vec2(0.5) ? 1.0 : 0.0;
}


converting into stencil map :
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
#version 150

uniform sampler2D aamask;  // blitted frag2 output
uniform sampler2DMS aanormaldistance; // ms frag1

#define samples 8
uniform vec2 dim;

in  vec2  st0;

#define edgeThreshold 0.9

void main(void)
{
  ivec2 coords = ivec2(st0 * dim);
 
  if(texelFetch(aamask,coords,0).r != 0.0)
  {
    vec4 normaldistance = texelFetch(aanormaldistance,coords,0);
    vec3  n   = normaldistance.xyz;    // normal
    float d   = normaldistance.w;      // distance
    float dth = max(1.0,d*d) * 0.0005; // magic number
    for(int i = samples; i-- != 1; )
    {
      vec4 s = texelFetch(aanormaldistance,coords,i);  // sample
      float nt = dot(n,s.xyz);
      if(nt == 0.0 || nt < edgeThreshold) return; // test normal
      if( abs(d - s.w) > dth ) return;            // test depth
    }
  }
 
  discard; // 0.0
}


using the map in 2 pass stencil test :
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
#version 400

subroutine void depthsourceType ( in ivec2 coords );

subroutine uniform depthsourceType depthsource;

out float frag0_r;
in  vec2 st0;

uniform sampler2D   tex;
uniform sampler2DMS aa_tex;

uniform vec2        dim;
#define samples 8

uniform float znear;
uniform float zfar;
uniform float zrange;

float linear0(in float depth) { return (znear * zfar) / (zfar - depth * zrange); }  

subroutine (depthsourceType) void nonaa(in ivec2 coords) { frag0_r = linear0(texelFetch(tex,coords,0).r); }
subroutine (depthsourceType) void aafirst(in ivec2 coords) { frag0_r = linear0(texelFetch(aa_tex,coords,0).r); }
subroutine (depthsourceType) void aasum(in ivec2 coords)
{
  for(int i = samples; i-- != 0;) frag0_r += linear0(texelFetch(aa_tex, coords, i).r);
  frag0_r /= samples ;
}  

void main()
{
  depthsource(ivec2(dim * st0));
}
made a difference on slower hardware.

now with all shaders playing along and writing normal and linear-depth, the linearise-depth example becomes obsolete Cranky

some test rendering :




o/
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline basil_

« JGO Bitwise Duke »


Medals: 418
Exp: 13 years



« Reply #10 - Posted 2014-08-27 22:18:46 »

tone mapping applied after resolve :


applied before resolve Grin :


basically gamma corrected msaa
Offline basil_

« JGO Bitwise Duke »


Medals: 418
Exp: 13 years



« Reply #11 - Posted 2014-10-08 21:42:57 »

not sure if should start a new topic, but i'm continuing this path of msaa-resolving and i've hit another wall. Emo

using the mentioned edge-detection, the aa-resolve-map works super fine ..
1  
float edge = fract(gl_FragCoord.xy) != vec2(0.5) ? 1.0 : 0.0);
.. feeding this into a stencil buffer to fetch subpixels and "single" pixels efficiently in two passes.

this new issue, i really cannot get my head around. it makes sense, then it doesn't  Stare

here's a picture, which does not use the map, but gathers all samples brute-force like. this is what i'd like to achieve basically :

nice smooth pixels, using 8 x msaa at 25% resolution.

now with the map, the triangles in this sample are overlapping what causes the edge-detection to fail Sad :
the pink marked edges are overlapping triangles which are not detected and render jaggy. yellow marked edges are fine.

- i dont really understand why the
fract(gl_FragCoord.xy) != vec2(0.5)
failes for the whole rendering. i mean, the first drawn triangle it's clearly not an edge. but for the second drawn triangle, which intersects, shouldn't the fragment-shader fetch at least that ? now that makes no sense again to me, it is not a triangle edge in the end.

- is there a neat way to detect this case ? i mean, without sampling neighbour pixels which would eat much gpu-time.

- should i just preprocess the mesh, split triangles along intersecting edges with other triangles and forget this issue ?

- what am i missing ?

- did you guys run into that too ? theagentd ? Pointing Smiley

o/

ps : here are some of the UE4 demos we can run - and then check how much a new gpu would cost:
http://www.techpowerup.com/downloads/Games/Demos/
http://www.neogaf.com/forum/showthread.php?t=809236
http://www.extremetech.com/gaming/181608-download-and-run-the-unreal-engine-4-elemental-benchmark-demo-on-your-pc
Offline theagentd
« Reply #12 - Posted 2014-10-09 00:34:17 »

It's because it's not a logical triangle edge. The samples there just didn't pass the depth test. gl_FragCoords is still centered. This is another reason why you should analyze the image, not depend on geometry edge detection.

EDIT:

To expand on this. According to the OpenGL spec, the depth test logically happens after the fragment shader is run. However, all GPUs try to do the depth test before the fragment shader to avoid shading pixels that just end up failing the depth test anyway. They're only allowed to do this when the result is exactly identical to if the depth test was run after the shader, which means that for alpha-tested geometry or shaders that contain a
discard;
command, the depth test is done after the fragment shader. Anyway, the point here is that fract(gl_FragCoord.xy) (which uses centroid interpolation to center the interpolation on the part of the pixel that was covered) will always produce 0.5 here, as the fragment shader has no information about what the result of the pending depth test is (as I said above, technically it does, but it's not allowed access to this information). In short, the depth and stencil test all run after the fragment shader, so the pixel is always treated as 100% covered. Centroid sampling only takes the triangle edge into consideration.

This is another reason why you should analyze the image, not depend on geometry edge detection. If this is not an option, there is an extremely new extension called EXT_post_depth_coverage that (together with forced early depth tests) gives the shader access to the post-depth test coverage mask. This extension was released less than a month ago and is only supported by Nvidia so far (as far as I know). Hardware support is expected on all OGL4-capable GPUs, but I doubt Intel will implement it in their drivers though.

Myomyomyo.
Offline basil_

« JGO Bitwise Duke »


Medals: 418
Exp: 13 years



« Reply #13 - Posted 2014-10-09 09:48:17 »

thanks for the heads up on the depth test order. that EXT looks very promising.

after thinking about your suggestion, i went a few steps back and realized ..

maybe it's obsolete to write the edge-detection mask in forward render pass altogether. i still have to test for subpixel similarities/discontinuities (for instance to handle quad-crossing-edges). looks like the appoach ..

- forward rendering, write edges, normal and linear-depth
- fullscreen, process edges, reducing the map, test for discontinuity in normal+depth only on edges, write into stencil buffer
- use this stencil buffer to resolve samples.

is wrong - too much and too early optimisation on mind!

processing the whole screen, testing all pixels for discontinuities, regardless of the edge-detection output is slower - but not much, this solves the whole issue persecutioncomplex. (.. and was just one line in a fragmentshader to delete Emo)

it's still not based on neighbour pixels, just testing discontinuities within subpixels, which is great. now, when the edge-rendertarget is not required it should speed things up again a bit.

thanks agent, sometimes things get much simpler when rethinking Smiley
Offline theagentd
« Reply #14 - Posted 2014-10-09 16:09:47 »

it's still not based on neighbour pixels, just testing discontinuities within subpixels, which is great.
You don't have to test neighboring pixels. Only test if the subpixels within that pixel differ too much.

Myomyomyo.
Pages: [1]
  ignore  |  Print  
 
 

 
Riven (397 views)
2019-09-04 15:33:17

hadezbladez (5280 views)
2018-11-16 13:46:03

hadezbladez (2204 views)
2018-11-16 13:41:33

hadezbladez (5544 views)
2018-11-16 13:35:35

hadezbladez (1150 views)
2018-11-16 13:32:03

EgonOlsen (4584 views)
2018-06-10 19:43:48

EgonOlsen (5462 views)
2018-06-10 19:43:44

EgonOlsen (3119 views)
2018-06-10 19:43:20

DesertCoockie (4015 views)
2018-05-13 18:23:11

nelsongames (4708 views)
2018-04-24 18:15:36
A NON-ideal modular configuration for Eclipse with JavaFX
by philfrei
2019-12-19 19:35:12

Java Gaming Resources
by philfrei
2019-05-14 16:15:13

Deployment and Packaging
by philfrei
2019-05-08 15:15:36

Deployment and Packaging
by philfrei
2019-05-08 15:13:34

Deployment and Packaging
by philfrei
2019-02-17 20:25:53

Deployment and Packaging
by mudlee
2018-08-22 18:09:50

Java Gaming Resources
by gouessej
2018-08-22 08:19:41

Deployment and Packaging
by gouessej
2018-08-22 08:04:08
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!