Hi !
Featured games (90)
games approved by the League of Dukes
Games in Showcase (739)
Games in Android Showcase (224)
games submitted by our members
Games in WIP (820)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
   Home   Help   Search   Login   Register   
  Show Posts
Pages: [1] 2 3 ... 123
1  Discussions / Miscellaneous Topics / Re: What I did today on: 2017-10-21 23:28:53
Life's been tough on me the last few weeks, especially the last few days, so I decided to do some extra fun coding this weekend.

3-4 years ago I made some threads about an extreme scale space game with realistic Newtonian physics. The game would require simulating a huge number of objects affected by gravity, with extreme speed collision detection. I am talking 100k+ ships, each orbiting a planet at 10km/second, with accurate collision detection. The technical challenges are enormous. After some spitballing here on JGO, I ended up implementing a test program using fixed-precision values (64-bit longs) to represent positions and velocities to get a constant amount of precision regardless of distance from origin. Simple circle-based collision detection was handled by sorting the ships along the X-axis, then checking collisions only for ships that overlap along the X-axis. The whole thing was completely multi-threaded, and I even tried out Riven's mapped struct library to help with cache locality. Even sorting was multithreaded using my home-made parallel insertion sort algorithm, tailor-made for almost-sorted data sets (the order along the X-axis did not change very quickly). It scaled well with more cores, but was still very heavy for my poor i7.

I realized that the only way to get decent performance for this problem on a server would be to run the physics simulation on the GPU. With a magnitude higher performance and bandwidth, the GPU should be able to easily beat this result as long as the right algorithms are used. The physics simulation is easy enough as it's an embarrassingly parallel problem and fits perfectly for the GPU. The collision detection (sorting + neighbor check) is a whole different game. GPU sorting is NOT a fun topic, at least if you ask me. The go-to algorithm for this is a parallel GPU radix sort, but with 64-bit keys that's very expensive. Just like my parallel insertion sort algorithm took advantage of the almost-sorted nature of the sorting, I needed something like that that could run on the GPU as well. That's when I stumbled upon a simple GPU selection sort algorithm.

The idea is simple. For each element, loop over the entire array of elements to sort. Calculate how many elements that should be in front of this element. You now know the new index of your element, so move it directly to that index. Obviously, this is O(n^2), so it doesn't scale too well. However, the raw power of the GPU can compensate for that to some extent. 45*1024 = 46 080 elements can be sorted in ~60FPS, regardless of how sorted the array is. By using shared memory as a cache, performance almost triples to 160 FPS, allowing me to sort 80*1024 = 81 920 elements at 60 FPS. Still not fast enough. Anything above 200k elements runs a big risk of causing the driver to time out and restart...

Enter block-based selection sort for almost sorted data-sets! The idea is to split the list up into blocks of 256 elements, then calculate the bounds of the values of each block. This allows us to skip entire blocks of 256 values if the block doesn't intersect with the current block we're processing. Most likely, only the blocks in the immediate vicinity of each block needs to be taken into consideration when sorting, while the rest of the blocks can be skimmed over. Obviously, this makes the data dependent, and the worst case is still the same as vanilla GPU selection sort if all blocks intersect with each other (which is essentially guaranteed for a list of completely random values). However, for almost sorted data sets this is magnitudes faster!

To simulate an almost sorted data-set, an array is filled with elements like this:
for(int i = 0; i < NUM_KEYS; i++){
   data.putLong(i*8, i + r.nextInt(1000));

This gives us an almost sorted array with quite a lot of elements with the exact same value, to test the robustness of the sort. The block-based selection sort algorithm is able to sort a 2048*1024 = 2 097 152 element list... at 75 FPS, way over the target of 100 000. It's time to implement a real physics simulation based on this!

Let's define the test scenario. 1024*1024 = 1 048 576 ships are in perfect circular orbits around the earth. The orbit heights range from low earth orbit (International Space Station height) to geosynchronous orbit. Approximately half of the ships are orbiting clockwise, the other half counterclockwise. The size of the earth, the mass, the gravity calculations, etc are physically accurate and based on real-life measurements.

Going back to my original threaded CPU implementation, it really can't handle one million ships very well. Just the physics simulation of the ships takes 20.43ms, and sorting another 18.75ms. Collision detection then takes another 10.16ms.

The compute shader implementation is a LOT faster. Physics calculations take only 0.27ms, calculating block bounds another 0.1ms and finally sorting takes 2.07ms. I have not yet implemented the final collision detection pass, but I have no reason to expect it to be inefficient on the GPU, so I'm optimistic about the final performance of the GPU implementation.

Each ship is drawn as a point. The color depends on the current index in the list of the ship, so the perfect gradient means that the list is perfectly sorted along the X-axis. 303 FPS, with rendering taking up 0.61ms, 370 FPS without rendering.
2  Discussions / Miscellaneous Topics / Re: What I did today on: 2017-10-16 14:53:13
I got fiber installed a couple of days ago.


3  Java Game APIs & Engines / Engines, Libraries and Tools / Re: SSAO in LibGDX sans Deferred Rendering? on: 2017-09-30 01:12:59
I've been super busy, sorry.

I didn't realize the random vectors essentially filled the same purpose as the random rotations. You can drop the rotation matrix I gave you and just use the random vector texture you had. Please post how you sample from it. I recommend a simple texelFetch() with the coordinates &-ed to keep them in range.
4  Java Game APIs & Engines / Engines, Libraries and Tools / Re: SSAO in LibGDX sans Deferred Rendering? on: 2017-09-27 14:23:37
I think the reason why you're getting wrong results is because you do the matrix multiplication the wrong way around. Remember that matA*matB != matB*matA. However, I've been thinking about this, and I think it's possible to simplify this.

What we really want to do is rotated the samples around the Z-axis. If we look at the raw sample offsets, this just means rotating the XY coordinates separately, leaving the Z intact. Such a rotation matrix should be much easier to construct:
   float angle = rand(texCoords) * PI2;
   float s = sin(angle);
   float c = cos(angle);
   mat3 rotation = mat3(
      c, -s, 0,
      s,  c, 0,
      0, 0, 1
   //We want to do kernelMatrix * (rotation * samplePosition) = (kernelMatrix * rotation) * samplePosition
   mat3 finalRotation = kernelMatrix * rotation;

This should be faster and easier to get right!
5  Java Game APIs & Engines / Engines, Libraries and Tools / Re: SSAO in LibGDX sans Deferred Rendering? on: 2017-09-26 18:29:47
A couple of tips:

 - The code you have is using samples distributed over a half sphere. Your best bet is a modified version of best candidate sampling over a half sphere, which would require some modification of the JOML code to get.

 - I'd ditch the rotation texture if I were you. Just generate a random angle using this snippet that everyone is using, then use that angle to create a rotation matrix around the normal (You can check the JOML source code on how to generate such a rotation matrix that rotates around a vector). You can then premultiply the matrix you already have with this rotation matrix, keeping the code in the sample loop the exact same.

 - To avoid processing the background, enable the depth test, set depth func to GL_LESS and draw your fullscreen SSAO quad at depth = 1.0. It is MUCH more efficient to cull pixels with the depth test than an if-statement in the shader. With an if-statement, the fragment shader has to be run for every single pixel, and if just one pixel in a workgroup enters the if-statement the entire workgroup has to run it. By using the depth test, the GPU can avoid running the fragment shader completely for pixels that the test fails for, and patch together full workgroups from the pixels that do pass the depth test. This massively improves the culling performance.

 - You can use smoothstep() to get a smoother depth range test of each sample at a rather small cost.

 - It seems like you're storing your normals in a GL_RGB8 texture, which means that you have to transform it from (0.0 - 1.0) to (-1.0 - +1.0). I recommend using a GL_RGB8_SNORM which can stores each value as a normalized signed byte, allowing you to write out the normal in the -1.0 to +1.0 range and sample it like that too. Not a huge deal of course, but gives you better precision and a little bit better performance.
6  Java Game APIs & Engines / Engines, Libraries and Tools / Re: SSAO in LibGDX sans Deferred Rendering? on: 2017-09-26 04:05:01
I'm not sure what your "kernel" is. Are those the sample locations for your SSAO? I'd recommend precomputing some good sample positions instead of randomly generating them, as you're gonna get clusters and inefficiencies from a purely random distribution. JOML has some sample generation classes in the org.joml.sampling package that may or may not be of use to you.

It doesn't look like you're using your noise texture correctly. A simple way of randomly rotating the samples is to place random normalized 3D vectors in your noise texture, then reflect() each sample against that vector. I'm not sure how you're using your random texture right now, but it doesn't look right at all. If you let me take a look at your GLSL code for that, I can help you fix it.
7  Java Game APIs & Engines / Engines, Libraries and Tools / Re: SSAO in LibGDX sans Deferred Rendering? on: 2017-09-25 14:11:52
To fix the SSAO going too far up along the cube's edges, you need to reduce the depth threshold.

I can also see some banding in your SSAO. If you randomly rotate the sample locations per pixel, you can trade that banding for noise instead, which is much less jarring to the human eye.
8  Java Game APIs & Engines / Engines, Libraries and Tools / Re: SSAO in LibGDX sans Deferred Rendering? on: 2017-09-24 18:18:23
Renderbuffers are a bit of a legacy feature. They are meant for exposing formats the GPU can render to but can't be read in a shader (read: multisampled stuff). The thing is that multisampled textures are supported by all OGL3 GPUs so they no longer fill any real purpose anymore. If you do the FBO setup yourself, you can attach a GL_DEPTH_COMPONENT24 texture as depth attachment and read it in a shader.
9  Java Game APIs & Engines / Engines, Libraries and Tools / Re: SSAO in LibGDX sans Deferred Rendering? on: 2017-09-24 13:14:04
Is LibGDX using a renderbuffer?
10  Java Game APIs & Engines / Engines, Libraries and Tools / Re: SSAO in LibGDX sans Deferred Rendering? on: 2017-09-23 14:07:54
You do not need to store depth in a color texture. You can simply bind the depth texture you use as depth buffer and bind that as any other texture. The depth value between 0.0 and 1.0 is returned in the first color channel (red channel) when you sample the texture with texture() or texelFetch().
11  Java Game APIs & Engines / Engines, Libraries and Tools / Re: SSAO in LibGDX sans Deferred Rendering? on: 2017-09-23 12:59:11
The traditional purpose of doing a depth pre-pass is to avoid shading pixels twice. By rendering the depth first, the actual shading can be done with GL_EQUAL depth testing, meaning each pixel is only shaded once. The depth pre-pass also rasterize at twice the speed as GPUs have optimized depth-only rendering for shadow maps, so by adding a cheap pre-pass you can eliminate overdraw in the shading.

To also output normals, you need to have a color buffer during the depth pre-pass, meaning you'll lose the double speed rasterization, but that shouldn't be a huge deal. You can store normal XYZ in the color, while depth can be read from the depth buffer itself and doesn't need to be explicitly stored.

If you have a lot of vertices, rendering the scene twice can be very expensive. In that case, it's possible for you to do semi-deferred rendering where you do lighting as you currently do but also output the data you need to do SSAO afterwards. This require using an FBO with multiple render targets, but it's not that complicated. The optimal strategy depends on the scene you're trying to render.
12  Java Game APIs & Engines / Engines, Libraries and Tools / Re: SSAO in LibGDX sans Deferred Rendering? on: 2017-09-23 07:33:36
Traditional SSAO doesn't require anything but a depth buffer. However, normals help quite a bit in improving quality/performance. You should be able to output normals from your forward pass into a second render target. It is also possible to reconstruct normals by analyzing the depth buffer, but this can be inaccurate if you got lots of depth discontinuities (like foliage).

EDIT: Technically, SSAO is <ambient> occlusion, meaning it should only be applied to the ambient term of the lighting equation. The only way to get "correct" SSAO is therefore to do a depth prepass (preferably output normal too), compute SSAO, then render the scene again with GL_EQUAL depth testing while reading SSAO from the current pixel. If you already do a depth prepass, this should essentially be free. If not, maybe you should! It could improve your performance.
13  Game Development / Game Play & Game Design / Re: Graphics Backend Abstraction on: 2017-09-06 11:57:31
<--- You can't contain this! =P
14  Java Game APIs & Engines / Java 2D / Re: Writing Java2D BufferedImage Data to Opengl Texture Buffer on: 2017-08-21 02:03:28
Am I insane? -> yes. But is it really crazier then drawing 2D stuff in OpenGL?
Yeah, it is. OpenGL is used for 2D stuff all the time. GPUs can really only draw 2D stuff in the first place; you project your 3D triangles to 2D triangles, so it makes a lot of sense to use the GPU for accelerating 2D stuff as well.
15  Java Game APIs & Engines / OpenGL Development / Re: [LWJGL] [JOML] Memory Behavior on: 2017-08-12 13:32:40
You can release the memory allocated on the native heap when you no longer need a direct NIO buffer by calling the cleaner yourself, I assume that it's what theagentd meant in his second suggestion.
Nope, I'm not a big fan of the Cleaner "hack". You'll generally get much better performance by managing memory yourself with malloc/free. In LWJGL, that can be done with MemoryUtil.memAlloc() and MemoryUtil.memFree(), which under the hood uses the best library for the job (usually Jemalloc).
16  Java Game APIs & Engines / OpenGL Development / Re: [LWJGL] [JOML] Memory Behavior on: 2017-08-10 01:30:55
This comes from you allocating a lot of native memory buffers using NIO, possibly using BufferUtils.create***Buffer() calls. All those objects are related to managing native memory, which lies outside the Java memory heap. You should:
 - try to reuse native memory buffers
 - manage native memory yourself instead to avoid GC overhead.
17  Java Game APIs & Engines / Engines, Libraries and Tools / Re: LibGDX How do I get the depth buffer from FrameBuffer? on: 2017-08-05 19:47:02
Ok, I will try that, but first I want to clarify something:
If my clipping range is between 1.0f and 1000.0f, then the normalised depth value from the shader (0.5 in this case) means 500f?
Anyway, thanks for help.
No, that is not the case. The depth value you get is calculated in a specific way to give more precision closer to the camera. This means that 0.5 will refer to something very close to the far plane, something like 3.0 (very rough estimate). How this is calculated depends on your projection matrix (the clipping range you mentioned). It is possible to "linearize" the depth value if you know the far and near planes.
18  Java Game APIs & Engines / Engines, Libraries and Tools / Re: LibGDX How do I get the depth buffer from FrameBuffer? on: 2017-08-05 14:32:43
the framebuffer should be configured with a depth-attachment attached to it, which is a texture after all.

internal-format should be GL_DEPTH_STENCIL or GL_DEPTH_COMPONENT and format should go like GL_DEPTH24_STENCIL8 or GL_DEPTH_COMPONENT24.

now when you render the color-attachment in a fullscreen-quad/triangle pass, just throw in the depth-texture like you do with any other texture.

next, sampling the depth-texture - i cannot rember the "proper" way.
You got the internal format and format switched around. Internal format (the format the GPU stores the data on the GPU in) should be GL_DEPTH_COMPONENT16, GL_DEPTH_COMPONENT24 or GL_DEPTH_COMPONENT32F, or if you also need a stencil buffer, GL_DEPTH24_STENCIL8 or GL_DEPTH32F_STENCIL8. Format (the format of the data you pass in to initialize the texture in glTexImage2D()) shouldn't really matter as you're probably just passing in null there and clearing the depth buffer for the first use anyway, but you need to pass in a valid enum even if you pass null, so GL_DEPTH_COMPONENT is a good choice.

To give you a list of steps:

1. Create a depth texture with one of the internal formats listed above.
2. Attach the depth texture to the FBO using glFramebufferTexture2D() to attachment GL_DEPTH_ATTACHMENT or GL_DEPTH_STENCIL_ATTACHMENT if your depth texture also has stencil bits.
3. Render to the FBO with GL_DEPTH_TEST enabled. If it's not enabled, the depth buffer will be completely ignored (neither read nor written to). If you don't need the depth "test" and just want to write depth to the depth buffer, you still need to enable GL_DEPTH_TEST and set glDepthFunc() to GL_ALWAYS.
4. Once you're done rendering to your FBO, you bind the texture to a texture unit and add a uniform sampler2D to your shader for the extra depth buffer.
5. Sample the depth texture like a color texture in your shader. A depth texture is treated as a single-component texture, meaning that the depth value is returned in the red channel (float depthValue = texture(myDepthSampler, texCoords).r;). The depth is returned as a normalized float value from 0.0 to 1.0, where 0.0.
19  Discussions / Miscellaneous Topics / Re: What I did today on: 2017-07-25 17:30:09
TL;DR: Compute shaders are frigging awesome!

Today I ported my blur shader to a compute shader. It does everything exactly the same, but it's significantly faster!

My blur shader is optimized using a kind of tile classification system. For each 16x16 tile on the screen, I calculate:
 - the dominant motion vector (DMV) direction and length of the tile
 - the highest circle of confusion (CoC = depth of field blur radius) of the tile
 - some additional data.
These tiles are then dilated based on their "reach". The point is to allow objects to blur over their borders by making sure the neighboring tiles are doing the blur calculations as well. Then, based on the dilated DMV length and max CoC of the tile, I check if the tile can be optimized. A sample count is calculated based on the blur area, so that for smaller blurs I use a lower sample count. In addition, I pick a "path" for the blur shader per tile as I've mentioned before, but here's a recap:
 - If the max CoC and DMV length is less than 0.5, the tile is completely sharp, so the blur shader can early out.
 - If the CoC and the DMV length doesn't vary much in the tile, a fast path is used instead as we don't need any fancy blending just to get an even blur for that.
 - Otherwise, the slow path is used as the tile contains complex geometry that needs to be blurred carefully.

The reason for doing this classification per tile is that it's cheaper to do, and that shaders are executed in large groups anyway. Hence, if pixel 1 in a group picks the fast path, pixel 2 picks the slow path and the rest want to early out, the entire shader group will need to run all 3 paths for all pixels as they all have to run in lockstep. This is actually slower than just running the slow path for all pixels. Hence, the idea is that by using 16x16 tiles, entire groups of pixels should be able to run only one path. To test this out, I created a specific scenario: The entire scene will be placed in focus with no motion (meaning the shader can early out immediately), but tiles in a checkerboard pattern will be forced to the slow path (in other words, have optimizations disabled). If the way the GPU groups up pixels into tiles is not EXACTLY aligned with the 16x16 tiles, it would need to execute the slow path for the entire screen, and the performance will reflect that. The key here is that compute shaders allow me to manually define the work group size, so I can force the workgroup size to perfectly align with the tiles. Let's see how this performs:

 - Fragment shader, all tiles using slow path: 3.94 ms
 - Fragment shader, checkerboard early out/slow: 3.94 ms (ouch!)
 - Compute shader, all tiles using slow path: 3.94 ms
 - Compute shader, checkerboard early out/slow: 2.07 ms (yes!)

In other words, the 16x16 tiles did NOT line up with the way the GPU placed pixels into groups, but with a compute shader they obviously do! This should give a nice (but obviously smaller) performance boost in real world scenarios!

In addition, I made some surprising findings while just playing around with it! Here's the performance of the compute shader compared to the fragment shader version when the entire screen is completely out of focus (max blur radius for every single pixel):
 - Fast path: 50% faster (15.0 ms ---> 10.0 ms)
 - Slow path: 17% faster (22.5 ms ---> 19.2 ms)

In other words, for some reason the raw performance of the shader is significantly better. This doesn't seem to be due to better ALU performance, but rather better texture sampling performance. There's no incoherent branching going on here either; every single pixel/invocation is executing the same thing (either the fast or the slow path). I guess since compute shaders bypass a lot of hardware in the GPU compared to a fullscreen triangle, maybe using a compute shader simply freed up cache space or changed the way the pixels are grouped together. I'm afraid I don't have a very good explanation for this, but damn is it awesome! =P

For a scene with mixed {early out/fast path/slow path} with dynamic sample count based on blur area (in other words, a typical scene in a normal game with all optimizations on):
 - 74% faster (7.5ms ---> 4.3 ms)

This is generating the exact same image with the same code, just executed as a compute shader instead of a fragment shader. The only difference is that I use gl_FragCoord.xy in the fragment shader version, which I need to calculate manually in the compute shader as (vec2(gl_GlobalInvocationID.xy) + 0.5). Also, this is BEFORE adding the packing of the data before. I expect that optimization to almost double the performance of both the fast and the slow path in the worst case scenario above (fast path 10 --> 5ms, slow path 20 --> 10ms), for a tiny performance cost in other scenarios.

EDIT: Here's a debug image showing the tile classification. Red = slow path, green = fast path, blue = early out. In addition, the color intensity represents the sample count; bright = more samples (up to 128).
20  Discussions / Miscellaneous Topics / Re: What I did today on: 2017-07-22 20:44:29
Stumbled upon some interesting results while optimizing the depth of field/motion blur shader.

Basically, there are three relevant things that can cause a bottleneck in the blur shader:

1. Too many ALU instructions, AKA too much mathz!!!11
2. Too many texture instructions. Each instruction, regardless of what it actually fetches, has a minimum cost.
3. Bad texture sample spatial cache locality. This is worsened if you're sampling a format which takes up a lot of memory in the texture cache.

The blur shader has three different paths:
a. The most expensive path is needed when there is a big difference between the circles-of-confusion or the motion vectors inside a tile. This path needs two texture samples per sample (1 for color, 1 for depth+CoC+motion vector length), and has quite a lot of math.
b. If the blur is uniform for the entire tile (and neighboring tiles), there's no point in doing all the fancy depth sorting and layer blending, so in that case I can just run a simple blur on the color on the scene directly. This version just reads the color texture and requires less math.
c. If I tile is completely in focus and has no major motion, the shader early outs as no blurring is needed. The shader is obviously incredibly fast if this path is used, so for the tests I will be disabling this path.

To rule out texture cache problems and get a baseline performance read, I ran path A and path B with zero CoCs and zero motion vectors. This causes all the texture samples to end up at the exact same place, so the texture cache should be able to do its work perfectly. Testing this, I got around 4.5ms for the expensive path and 2.2 ms for the cheaper path. Sadly, it's hard to tell what the bottleneck is alone from this data. The fact that performance is half as fast could mean that the bottleneck is either the sheer number of texture samples (as performance doubled for half as many samples), or it could just mean that the cheaper path has half as many math instructions. Next, I tested the shader with a 16 pixel radius CoC for every pixel on the screen, meaning that the samples are distributed over a 33x33 pixel area (over 1000 possible pixels). This caused the fast path to increase from 2.2 ms to 8.2. Even worse, the expensive path went from 4.5 ms all the way up to 21.5! Ouch!

As the blur size increases, the samples get less and less spatially coherent. However, there is no performance loss whatsoever as long as the samples fit in the texture cache. After that point, performance quickly gets a lot worse very quickly. That threshold also depends on the size of the texture format we're sampling, as a bigger format takes up more space in the texture cache ---> fewer texels can fit in it ---> the cache gets "shorter memory". In my case, both the color texture and the depth+CoC+motion vector length textures are GL_RGBA16F textures, meaning they should be taking up 8 bytes per sample each. As we can see, the fast path suffered a lot less from the worse cache coherency, only taking around 3.75x more time, but the slow path took around 4.75x as much time! This is because the fast path only needs to sample the color, and hence has less data competing for space in the cache.

Here's where it gets interesting: For a first test, I tried changing the texture format of the color texture from GL_RGBA16F (8 bits per texel) to GL_R11F_G11F_B10F (4 bits per texel). I expected this to possibly even double the performance of the fast path as it'd halve the size of each texel, but... Absolutely nothing happen. It performed exactly the same as GL_RGBA16F. The explanation for this is that the texture cache always stores data "uncompressed". For example, for DXT1-compressed textures, the texture colors of each 4x4 block are compressed to 2-bit indices used to interpolate between two 5-6-5-bit colors. The result of this interpolation does not result in exact 8-bit values, but the GPU will round the values to the closest 8-bit values and store the result in the texture cache with 8 bits of precision per color channel. On the other hand, RGTC-compressed textures work similarly to DXT1-compressed textures, but instead use 3-bit indices and 8-bit colors. This gives RGTC textures up to ~10-11 bits of precision in practice, and uncompressed RGTC is actually stored at 16-bit precision in the texture cache to allow you to take advantage of this. Hence, it makes sense that the GPU cannot store 11/10 bit floats in the texture cache directly, and instead has to store them as 16-bit floats instead. In addition, the texture cache can't handle texels with 6 bytes, so they're then padded to 8 bytes, giving us the same cache footprint as GL_RGBA16F! The bandwidth used isn't the issue here; we're simply suffering from cache misses, and the latency for fetching that data is killing us, regardless of how much data is being fetched! Testing with GL_RG16F and GL_R32F instead, performance of the fast path improved from 8.2 ms to 6.0! The slow path only improved from 21.5 to 19.4 ms, as the other texture is still GL_RGBA16F.

So, to improve performance I really want to reduce the size of each texel. Since GL_R11F_G11F_B10F is out of the question for the color texture, I'll need to do some manual packing. Luckily, I don't need any filtering for these textures, so sacrificing filterability is not a problem. My current plan is to emulate something similar to GL_RGB9_E5 (as that format can't be rendered to) by storing the color in RGB, and an exponent in the alpha channel. This should allow me to store a HDR color with just 32 bits of data. The depth+CoC+MV texture is simpler; just store depth as a 16-bit float, and the CoC and motion vector length values can be stored at 8 bits precision each no problem. Since I'll need to do manual packing using bit manipulation, storing these values in 32-bit uint textures is the easiest. This means that I can easily pack all this data into a single GL_RG32UI texture. This will halve the number of texture samples AND halve the texel size for the slow path, but it'd also force the fast path to fetch the extra blur data it doesn't need. Hence, I will probably output just the color to a second GL_R32UI texture that the fast path can use to halve the size of each texel for that one too. I already do a fullscreen packing pass to generate the depth+CoC+MV GL_RGBA16F texture which only takes 0.08ms right now, so adding another texture (increasing bandwidth by 50%) shouldn't cause any significant slowdowns compared to the gains the blur pass will see.

I will also port the blur shader to a compute shader. Currently, in my fragment shader I pick which path to take based on the tile the pixel falls into. This assumes that the fragments are processed in blocks that align up with these tiles. If this isn't the case, this could have severe performance impacts as some pixels may end up having to execute multiple paths. By using a compute shader, I can get ensure that the compute shader work groups align perfectly with the classification tiles, so each work group would only ever need to execute one of the 3 paths.
21  Discussions / Miscellaneous Topics / Re: What I did today on: 2017-07-21 05:56:05
Depth of field + motion blur video! Sorry about the Skype notification sound(s) in the video! >___< The video can be streamed from Drive using the Youtube video player, but I recommend downloading the video for maximum 60 FPS quality.

 - There's some aliasing of sharp (and blurry) objects in the scene. This is because my antialiasing is not compatible with the DoF/motion blur at the moment, but I will probably fix that one way or another.
 - The motion vectors go crazy when I move the camera during slow-motion, as the camera believes it's rotating at an extremely high speed. It's not a problem with the algorithm.
22  Discussions / Miscellaneous Topics / Re: What I did today on: 2017-07-18 02:10:28
Jesus H. Christ, I've been in constant cold sweat / depression over finding out that there was a case which my combined DoF/MB implementation couldn't handle. Lied sleepness at night, almost walked into shit while going to the store. x___X

If the entire scene is out of focus, the result should be approaching a simple circle blur (think box blur, but a circle shape instead), but my triple-classification system didn't approach that for blurry scenes. This resulted in seeing weird outlines of the out-of-focus objects on the screen. I could tweak the range of the different layers, but this essentially resulted in disabling the third layer, which made the original issues come back. The problem was simply inherent to using a third classification, and therefore unfixable. Hence, I went back to the basics again and reimplemented the original depth of field from the paper, with the additions I made to support motion blur. I tried using the minimum depth of each tile again, but just confirmed the original issue I discovered there:

The per-tile depth had to go. There was no way that was ever going to work. Each pixel needed to be classified relative to the current pixel. Simply changing the depth I did the classification against to be the depth of the center pixel, and this just amplified the issue further: objects could no longer blur over their edges at all if the background was sharp. Well, shit. OK, I'll just change the center pixel to be classified as background too. Wait a minute, that actually looks... correct? With just two classification types? Then I enabled motion blur, and was immediately shot down as massive dark strips appeared around the edge of motion blurred objects. Crap. But... how can that even be possible? Motion blur and DoF are essentially the same thing. The only difference is the shape of the blur area and how the blur radius/length is calculated. I banged my head and finally found it: A really stupid bug in the background weight calculation of motion blurred pixels. Arghhhh!!! Shocked

I fixed it, and I'm now back to only two classifications, and it seems to handle every case I expect it to handle. It suffers from the inherent limitations of working in screen space as a postprocessing effect. A blurry object essentially becomes see-through around the edges, which should reveal the background, but that information simply isn't there in the original sharply rendered screen. Hence, the background is reconstructed by taking the colors of nearby pixels and averaging them together. You can actually see that in effect in the image above. You can see how the reconstructed background around the edge of the blurry robot looks good in most cases, but particularly at the horizon as well as in some other minor cases, the reconstruction simply doesn't look correct. There's no way of fixing that without actually feeding more data into the blurring algorith, which would require generating that additional data as well using something like depth peeling (and then lighting and processing all layers). That, or raytracing, and I'm pretty sure neither of those two will happen anytime soon.

In addition, there are some limitations in how the motion blur works. It still suffers from the same limitations as the original algorithm, as it only tries to blur in one direction at a time, the dominant motion vector of each tile. This can cause issues with overlapping motions around edges, which are more apparent the higher the maxMotionBlurDistance and maxDofRadius is. "Luckily", higher distances/radii cause such huge performance hits that this will be limited. =P

I'll be making a video after doing some more testing and reimplementing some of the optimizations I made, but I'm gonna take it easy today and go to bed early (and maybe even get some proper sleep).
23  Game Development / Newbie & Debugging Questions / Re: Game "running out of memory" with 32-bit JRE's on: 2017-07-17 07:20:57
Is there a specific reason why you want to support 32-bit JVMs?
24  Discussions / Miscellaneous Topics / Re: What I did today on: 2017-07-16 20:43:51
Incredible @theagentd. The combination of the blur passes is a really neat concept, I'm definitely going to play around with that (if you don't mind!). As always, amazing work.

@theagentd video or it didn't happen  Tongue Pointing

Seriously. Great stuff. If you come around captioning a vid, would be great and much appreciated!
Thanks! What do you mean by "captioning a video"? A video with explanations...?
25  Discussions / Miscellaneous Topics / Re: What I did today on: 2017-07-16 02:41:44
OK, I feel ready to show this shit off. Spent essentially the entirety of today on it too. =___=

Basically, I've been spending quite a lot of time trying to get a good quality motion blur and depth of field working. We really needed motion blur for WSW, so I tried very hard to implement the (then very new) algorithm described in A Reconstruction Filter for Plausible Motion Blur, an excellent algorithm. We used this algorithm for quite some time with great success, but it wasn't perfect. Some nice improvements and optimizations were floating around on the internet which I implemented as I found them, but there was always one glaring issue of that algorithm: It had a really bad discontinuity at the edges of objects, which can clearly be seen in the bottom picture of Figure 5 in the original paper.

Next Generation Post Processing in Call of Duty: Advanced Warfare was the next big source of information for me. I STRONGLY recommend everyone interested in postprocessing to take a look at that presentation, as it contains a huge amount of really great images, videos, explanations and even shader code. For motion blur, they added a weight renormalization pass to fix the depth discontinuity, making the motion blur much more convincing around the edges of moving objects. However, the really amazing thing in that presentation was their depth of field implementation. It had extremely good quality and worked similar to the motion blur. Basically, it split up the scene into a background and a foreground, defined per tile. It looked really great in the screenshots and the example video they had in the slides, but... I just couldn't get it to look good.

The foreground/background classification didn't seem to work well at all. As the classification was done based on the minimum depth of a tile, a tile with low minimum depth could cause neighboring tiles to classify the entirety of their contents as background. This caused tile-sized snaps and shifts as the scene moved, which was just unacceptable to me. Instead, I tried doing the classification per pixel instead, which solved some problems, but now an out-of-focus foreground object couldn't blur out over other in-focus objects. Murr murr...

It took me a very long time to find a solution, but two days ago I finally did. The problem with the foreground/background system is that the pixel you're looking at needs to go into either the foreground or the background. If it goes into the foreground, things that are in front of the pixel can't blur over it. Similarly, if it's in the background that also causes issues in certain cases. The solution was to add a third class, the "focus", which involves things at around the same depth as the pixel we're looking at. This fixed the problem in all cases!

No DoF:

DoF, focus on head:

DoF, focus further away:

There was just one major issue. The DoF didn't work together with the motion blur algorithm! No matter which one you applied first, as soon as you've blurred the image, the depth buffer becomes useless for classification! This means color bleeding over edges when both motion blur and depth of field are both applied to the same place.

Now, the story above isn't entirely chronological. Some months ago, I theorized if would be possible to combine both DoF and motion blur into a single postprocessing past. They're both blurs; I just need to figure out how to get them to work together. How hard can it be? ...... Don't get me started, it was f**king hard, but two days ago I managed to get it right. DoF's classification system is essentially a more advanced version of what the original motion blur algorithm does, so using it for motion blur is fine. I had a crazy amount of issues with the alpha calculations between the layers, and even more issues with the sampling patterns, but in the end I actually managed to get it working. Behold, the results of my work.

Combined DoF and motion blur.

I'm... exhausted as hell. But it was worth it I guess. Now there's just a massive amount of work left to optimize this. If anyone cares enough to want to implement this themselves, I can write an article on how the DoF+MB works.

26  Discussions / Miscellaneous Topics / Re: What I did today on: 2017-07-15 08:03:32
Spent approximately 20 hours split between today and yesterday working on........ something I'm not going to reveal right now. It's AFAIK something new that hasn't been done before with the quality I'm getting. I'll be writing one of my super long posts on this tomorrow, but right now it's like 10 in the morning and I REALLY need to sleep.

EDIT: Dammit, Archive, you prematurely made me reach 1000 medals! Thanks! xD
27  Game Development / Game Mechanics / The overly complicated tile rendering of Robot Farm on: 2017-07-04 01:14:13
Hello, everyone!

I thought I'd write a small long-ass post about the tile rendering system I've developed for Robot Farm. The tile rendering in RF ended up being quite complex, as we needed to keep memory usage down and performance up, so this post will chronicle some of the improvements that were made over time. Let's dive right into it!

Memory management

Our worlds can be quite huge. Our biggest worlds are up to 7680x5120 (= ~40 million) tiles big, and each tile in the world has a terrain tile and a detail tile. In addition, we also need to store pathfinding data for tiles. This can end up being quite a huge amount of data. In the beginning, simply tried to brute-force it by keeping everything in memory. 7680x5120 x (two ints + pathfinding Node object) = over 300MBs just for the tile data, and another gigabyte for the pathfinding data. NOPE, we can't have that. The first attempt to minimize memory usage was to add chunking to the data. We split up the world up into 32x32 chunks, and only kept a certain number of chunks in memory; the rest were written out to compressed files to the harddrive. This caused a number of issues though. The chunk management suddenly became very complex, and now we needed to involve the harddrive if we needed to load some data. With some unlucky harddrive response times, we could get significant spikes that were very noticeable when travelling quickly. The technique was also pretty hard on the harddrive itself, because we ended up having thousands upon thousands of tiny files for the chunks saved to the harddrive. You could almost feel the lifetime of your harddrive being drained each time you ran the game. >___>

In an attempt to alleviate this, I added "super chunks". Super chunks were a bundle of 4x4 normal chunks that were loaded and unloaded together. This cut the number of files down drastically and helped the harddrive cope with the constant reading and reduced the amount of stuttering we got due to more data being read at a time. However, it also had some very big drawbacks. The stuttering happened less often, but when it happen the freeze was noticeably longer as more data was read at the same time. In addition, the new system didn't handle random access very well. Even if just a single tile was needed, an entire super chunk had to be read from disk. We had several world configurations where we blew through the super chunk budget due to NPCs needing tile data for certain operations all around the world, causing thrashing and literally 0.1 FPS cases. Not a fun time. There was also quite a lot of overhead when you just needed to access a tile. Get the super chunk for that tile, get the chunk for that tile, get the tile. In the end, if we wanted to eliminate stuttering completely, we needed to eliminate the harddrive from the equation altogether, and preferably store chunks individually as well.

This is where I hatched the idea of compressing the chunk tile data to RAM instead of to the harddrive. Simply DEFLATE-ing using Deflater or GZIPOutputStream on the data made it much smaller as the terrain tiles are very continuous and the detail tiles are mostly 0 (= nothing there). So, instead of compressing the data and writing the result to disk, I simply compressed the data to a byte[] and kept the byte[] array in memory. To access the data, I need to run it through Inflater again to get the data back, but with no disk access required, this only took around 0.1ms per chunk! Most of the time, this reduced the size of the tile data by a factor of 10x to 20x, making the tile data memory usage trivial. And with the harddrive access gone, there was no reason to keep the super chunks around anymore, so they could simple be removed.

Pathfinding data was another big problem here, but I actually already have a post for that! If you're interesting in how that was solved, look here:;topic=38207.0. TL;DR: I got rid of the Node object and managed to barely pack all the pathfinding data of a node into the bits of a 32-bit integer. I then added a separate chunking system to it so that int "nodes" were only allocated (and reused) for areas that the pathfinder explored. This meant that less than 1/4th of the world needed pathfinding data at any single time, and cut the memory usage from over 1GB to just 50MB on average.

Tile transitions

In a tile-based game, tile transitions are almost always a must-have. Take a look at these two image from this very popular tile transition tutorial:

The left one just draws the tiles as they are, while the right one has tile transitions added to the rendering. The tile transitions add, well, transitions between different terrain types so that the world doesn't look like it's restricted as much to a grid anymore. However, they're quite complicated to both make and to render. The transitions to add to a tile depends on all the 8 neighbors of the tile. This leads to 2^8=256 possible combinations, but this can be reduced by combining multiple non-overlapping transitions to cover all cases. Still, a large number of transitions need to be drawn by our pixel artists for every single terrain type. Also, as you can see in the image above, the transitions have to be very thin to not completely occlude the neighboring tile. This limits the "effectiveness" and variation of the transitions.

We already had a transition system in place when I started working on Robot Farm, but it had some very serious problems. First of all, it required some very complicated neighbor checks for multiple terrain types that actually were so slow that we had big CPU performance issues from them when rendering to high resolution screens (as they could see more tiles). I added a couple of hacks to improve performance by caching neighbor tiles between checks, and I even added a full-world BitSet with one bit per tile just to track which tiles needed the transition checks. Even so, we had big problems with overlapping tile transitions each other as the transition tiles we already had were often too big to work well with that system. Hence, I started working with our pixel artist to improve the tile rendering. The project had two goals: Minimize the number of transitions needed (and hence the amount of work for our pixel artist) and maximize rendering performance of those transitions.

To reduce the number of transitions needed, we needed to reduce the number of tiles that can affect the look of the transitions. Ignoring the diagonal neighbors seemed like a promising idea, but it fell flat as it could not detect certain cases correctly. Instead, I shifted my focus to a brand new idea... Enter half-offset tile transitions!

The idea is pretty simple. Instead of drawing one tile image per tile in the world, we offset the tile rendering by half a tile as the image shows. Suddenly, the rendered tile is no longer a function of 8 neighbors (plus itself), but just the 2x2=4 tiles that it intersects! That's only 2^8 combination, of which one is nothing (none of the 4 tiles correspond to the terrain type) and one is fully covered, so only 14 transition tiles are needed per terrain type, a big improvement! I haven't been able to come up with another technique that provide the same quality AND simplicity as this technique!

Even better, this technique is perfect for being rendered on the GPU. First of all, the by cleverly ordering the transitions tiles in the tileset, you can calculate exactly which transition to pick using bitwise operations. Each of the four neighbors can be mapped to the lowest four bits of an integer, giving us the tile transition index we need without any complicated special cases, transition merging, etc! On top of that, we have a clear upper bound on the number of overlapping transitions: four, which happens when every single tile is different in a 2x2 area. This is a GREAT feature because it means that we at most only need to store four tile indices for rendering to be able to handle all cases, while the 8-neighbors version would require up to eight tile transitions.


Well, it wouldn't be like me to not tack on an over-engineered rendering system on top of these two systems now, would it? =P

The old tile rendering system of RF was very simple. Each tile was drawn as four vertices forming two triangles using an index buffer. Very simple, straightforward stuff. Although the biggest bottleneck lied in the rendering of the tile transitions, the generation of the vertex data became a significant cost when zoomed out. Now, this wasn't a significant issue in the actual game, but it was a huge issue in the tools we had as it made it impossible to zoom out enough to get a decent overview of the world for debugging and development. This lead me back to something I coded a long time ago.

Once upon a time I made this thread here about drawing tiles efficiently using shaders, but the image links are dead now. Anyway, the idea is simply to upload the tile indices to a texture, then use a shader to figure out which tile a pixel belongs to, then which tile index that tile has, then to sample the actual tile. The advantage of this technique is that an entire tile map can be drawn with a single fullscreen quad instead of individual quads for every tile, which is a massive win when the tiles are small and many (possibly in the millions). By using modern OpenGL, this even allows you do have tile mipmaps and even bilinear filtering without any bleeding between tiles too.

To handle the half-offset tile transitions, we need to store up to four tile indices per tile. Hence, a GL_RGBA16UI texture is perfect for the job, giving us up to 65536 possible tiles and fitting all four values in a single texture. To store the the tileset itself, we just use a 2D texture array, where each layer contains a single tile. This ensures that no filtering can happen between tiles as they're all isolated in their own layers. In the end, a rather short shader can be used to render all tiles at once:

#version 330

uniform usampler2D tileIndices; //"world map" containing tile indices (GL_RGBA16UI)
uniform sampler2DArray tileSet; //tile set, one layer per tile (GL_SRGB8_ALPHA8)

in vec2 tileCoords; //tile coordinates, integers correspond to tile edges

out vec4 fragColor;

void main(){
   vec2 halfOffsetCoords = tileCoords + 0.5; //offset tiles by half a tile
   vec2 flooredTileCoords = floor(halfOffsetCoords);
   uvec4 tileData = texelFetch(tileIndices, ivec2(flooredTileCoords), 0);
   vec2 texCoords = halfOffsetCoords - flooredTileCoords;

   //texCoords is not continuous at tile edges and causes incorrect derivatives to be calculated
   //We can manually calculate correct ones from the original tile coordinates for mipmapping.
   vec2 dx = dFdx(tileCoords);
   vec2 dy = dFdy(tileCoords);
   vec4 result = vec4(0);
   for(int i = 0; i < 4; i++){
      uint tile = tileData[i];
      //use textureGrad() to supply manual derivatives
      vec4 color = textureGrad(tileSet, vec3(texCoords, tile), dx, dy);
      result += color * (1.0 - result.a); //front-to-back blending
   fragColor = result;

That's it! Of course, I just had to go one step further though...

Well, our world is too big to be stored entirely in VRAM too. We need to shuffle tile data in and out of VRAM similarly to what we do in RAM by compressing the tile data. This leads to some more complexity. I ended up going completely nuts here and implemented software virtual/sparse texturing. Grin A sparse texture is a texture where not all parts of the texture actually has backing memory. For example, we can split up a 1024x1024 texture into 128x128 texel "pages", then use a 8x8 "page mapping texture" to map each part of the texture to a different place in memory. As we have fine control over which 128x128 pages are actually available at any given time, we don't need to have the whole thing in memory permanently. In our case, we have a page size of 128x128 tiles. So for a world that is 1024x1024 tiles big, we split it up into 8x8=64 pages. Let's say we only have a budget 8 simultaneously loaded pages at a time, which means that our page texture is a 128x128 2D texture array with 8 layers. We then have an 8x8 texture which maps parts of the original texture to pages. This is probably a bit confusing so let me explain it in a different way:

1. We want to sample tile (200, 300).
2. We calculate the page offset by dividing by the page size: (200, 300)/128 = (1, 2) (rounded down).
3. We go to the 8x8 page-mapping texture and sample point (1, 2) to get the layer index for that page. Let's say the value sampled is 4, meaning the page we're looking for is in layer 4.
4. We calculate the local tile position inside the page, which is (200, 300) - (1*128, 2*128) = (72, 44).
5. We sample the page texture array at (72, 44) on layer (4) to get the tile index we need!

The advantage here is that not all of the 64 pages need to be loaded at any given time. The camera frustum is pretty coherent in its motion and won't move around much, so if the player is in the top left corner of the map, it's fine if we only have the top four pages loaded, while the rest simply don't have any data in them. Hence, we can render the world as if the entire world was loaded all the time, as long as we keep remapping pages to cover the parts of the texture that we actually use at any given time. Some of you may know about the game RAGE which used a much more advanced version of this system to stream in extreme amounts of heavily compressed texture data on demand from a 131 072 x 131 072 virtual texture (128k x 128k), storing parts of this texture (and its mipmaps) into 128x128 pages. In other words, they had a 1024x1024 page-mapping texture (also with mipmaps) which mapped the virtual texture location to individual 128x128 pages. It was fun doing something simple but at least comparable, although probably a bit overkill in this case. =P


1. Page mapping (shows which layer in the page texture array the data for that part of the world is in.

2. Page-local positions (local tile coordinates inside each page)

3. Tile texture coordinates (the local texture coordinates inside each tile)

4. Final result

Extra transition test image
28  Discussions / General Discussions / Re: Compressing chunk of tile data? on: 2017-06-23 18:43:18
That's actually a pretty good idea! I'm currently in Italy on vacation though, but I'll post a proper challenge when I get back!
29  Game Development / Shared Code / Re: "Better" LinkedList implementation on: 2017-06-23 17:46:11
A pooled Node version would probably be slightly worse performance-wise when looping over the list due to the Node object and the element object being in two different places in memory. Also, without exposing the pointers, it becomes a bit annoying to loop through the list, but that's no real concern I guess.

Interesting, it'd be cool to see a.comparison between that and my implementation.

Hmm, how exactly would that work?
30  Discussions / General Discussions / Compressing chunk of tile data? on: 2017-06-23 17:34:01
I need to compress a 32x32 chunk of tile data. The data consists of 1 byte for the terrain type, and 1 short for a detail tile (second layer essentially). The terrain tiles are very continuous, while most detail tiles are 0 (null). What would a good compression method?

Here's my current idea:

1. Reorder data into z-curve order or Hilbert curve order to improve spatial coherency

2. Run length encode data using a special repeat byte/short value (say Byte/Short.MAX_VALUE) followed by number of repetitions, like this:
 baaaaab ---> ba*5b

3. Compress data using a precomputed Hilbert code. The idea is to precompute several Hilbert trees from statistical data from a lot of worlds based on the two most common tiles in the chunk, so there'll be one tree for each biome essentially. The chunk encodes a byte for which tree to use, then the encoded data. The RLE symbol is encoded in the tree, but the RLE count could be encoded using a different huffman code tree specifically for RLE counts.

My question is: Would this beat the DEFLATE compression I'm currently using? Is it even worth trying to implement this? I feel like DEFLATE should have problems compressing small chunks of data like this as it tries to learn about the data as it goes. Also, would it make sense to use delta encoding here?
Pages: [1] 2 3 ... 123
Ecumene (54 views)
2017-09-30 02:57:34

theagentd (77 views)
2017-09-26 18:23:31

cybrmynd (185 views)
2017-08-02 12:28:51

cybrmynd (183 views)
2017-08-02 12:19:43

cybrmynd (190 views)
2017-08-02 12:18:09

Sralse (200 views)
2017-07-25 17:13:48

Archive (756 views)
2017-04-27 17:45:51

buddyBro (890 views)
2017-04-05 03:38:00

CopyableCougar4 (1439 views)
2017-03-24 15:39:42

theagentd (1322 views)
2017-03-24 15:32:08
List of Learning Resources
by elect
2017-03-13 14:05:44

List of Learning Resources
by elect
2017-03-13 14:04:45

SF/X Libraries
by philfrei
2017-03-02 08:45:19

SF/X Libraries
by philfrei
2017-03-02 08:44:05

SF/X Libraries
by SkyAphid
2017-03-02 06:38:56

SF/X Libraries
by SkyAphid
2017-03-02 06:38:32

SF/X Libraries
by SkyAphid
2017-03-02 06:38:05

SF/X Libraries
by SkyAphid
2017-03-02 06:37:51 is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!