Java-Gaming.org Hi !
Featured games (90)
games approved by the League of Dukes
Games in Showcase (734)
Games in Android Showcase (223)
games submitted by our members
Games in WIP (811)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
   Home   Help   Search   Login   Register   
  Show Posts
Pages: [1] 2 3 ... 123
1  Java Game APIs & Engines / Java 2D / Re: Writing Java2D BufferedImage Data to Opengl Texture Buffer on: 2017-08-21 02:03:28
Am I insane? -> yes. But is it really crazier then drawing 2D stuff in OpenGL?
Yeah, it is. OpenGL is used for 2D stuff all the time. GPUs can really only draw 2D stuff in the first place; you project your 3D triangles to 2D triangles, so it makes a lot of sense to use the GPU for accelerating 2D stuff as well.
2  Java Game APIs & Engines / OpenGL Development / Re: [LWJGL] [JOML] Memory Behavior on: 2017-08-12 13:32:40
You can release the memory allocated on the native heap when you no longer need a direct NIO buffer by calling the cleaner yourself, I assume that it's what theagentd meant in his second suggestion.
Nope, I'm not a big fan of the Cleaner "hack". You'll generally get much better performance by managing memory yourself with malloc/free. In LWJGL, that can be done with MemoryUtil.memAlloc() and MemoryUtil.memFree(), which under the hood uses the best library for the job (usually Jemalloc).
3  Java Game APIs & Engines / OpenGL Development / Re: [LWJGL] [JOML] Memory Behavior on: 2017-08-10 01:30:55
This comes from you allocating a lot of native memory buffers using NIO, possibly using BufferUtils.create***Buffer() calls. All those objects are related to managing native memory, which lies outside the Java memory heap. You should:
 - try to reuse native memory buffers
 - manage native memory yourself instead to avoid GC overhead.
4  Java Game APIs & Engines / Engines, Libraries and Tools / Re: LibGDX How do I get the depth buffer from FrameBuffer? on: 2017-08-05 19:47:02
Ok, I will try that, but first I want to clarify something:
If my clipping range is between 1.0f and 1000.0f, then the normalised depth value from the shader (0.5 in this case) means 500f?
Anyway, thanks for help.
No, that is not the case. The depth value you get is calculated in a specific way to give more precision closer to the camera. This means that 0.5 will refer to something very close to the far plane, something like 3.0 (very rough estimate). How this is calculated depends on your projection matrix (the clipping range you mentioned). It is possible to "linearize" the depth value if you know the far and near planes.
5  Java Game APIs & Engines / Engines, Libraries and Tools / Re: LibGDX How do I get the depth buffer from FrameBuffer? on: 2017-08-05 14:32:43
the framebuffer should be configured with a depth-attachment attached to it, which is a texture after all.

internal-format should be GL_DEPTH_STENCIL or GL_DEPTH_COMPONENT and format should go like GL_DEPTH24_STENCIL8 or GL_DEPTH_COMPONENT24.

now when you render the color-attachment in a fullscreen-quad/triangle pass, just throw in the depth-texture like you do with any other texture.

next, sampling the depth-texture - i cannot rember the "proper" way.
You got the internal format and format switched around. Internal format (the format the GPU stores the data on the GPU in) should be GL_DEPTH_COMPONENT16, GL_DEPTH_COMPONENT24 or GL_DEPTH_COMPONENT32F, or if you also need a stencil buffer, GL_DEPTH24_STENCIL8 or GL_DEPTH32F_STENCIL8. Format (the format of the data you pass in to initialize the texture in glTexImage2D()) shouldn't really matter as you're probably just passing in null there and clearing the depth buffer for the first use anyway, but you need to pass in a valid enum even if you pass null, so GL_DEPTH_COMPONENT is a good choice.

To give you a list of steps:

1. Create a depth texture with one of the internal formats listed above.
2. Attach the depth texture to the FBO using glFramebufferTexture2D() to attachment GL_DEPTH_ATTACHMENT or GL_DEPTH_STENCIL_ATTACHMENT if your depth texture also has stencil bits.
3. Render to the FBO with GL_DEPTH_TEST enabled. If it's not enabled, the depth buffer will be completely ignored (neither read nor written to). If you don't need the depth "test" and just want to write depth to the depth buffer, you still need to enable GL_DEPTH_TEST and set glDepthFunc() to GL_ALWAYS.
4. Once you're done rendering to your FBO, you bind the texture to a texture unit and add a uniform sampler2D to your shader for the extra depth buffer.
5. Sample the depth texture like a color texture in your shader. A depth texture is treated as a single-component texture, meaning that the depth value is returned in the red channel (float depthValue = texture(myDepthSampler, texCoords).r;). The depth is returned as a normalized float value from 0.0 to 1.0, where 0.0.
6  Discussions / Miscellaneous Topics / Re: What I did today on: 2017-07-25 17:30:09
TL;DR: Compute shaders are frigging awesome!

Today I ported my blur shader to a compute shader. It does everything exactly the same, but it's significantly faster!



My blur shader is optimized using a kind of tile classification system. For each 16x16 tile on the screen, I calculate:
 - the dominant motion vector (DMV) direction and length of the tile
 - the highest circle of confusion (CoC = depth of field blur radius) of the tile
 - some additional data.
These tiles are then dilated based on their "reach". The point is to allow objects to blur over their borders by making sure the neighboring tiles are doing the blur calculations as well. Then, based on the dilated DMV length and max CoC of the tile, I check if the tile can be optimized. A sample count is calculated based on the blur area, so that for smaller blurs I use a lower sample count. In addition, I pick a "path" for the blur shader per tile as I've mentioned before, but here's a recap:
 - If the max CoC and DMV length is less than 0.5, the tile is completely sharp, so the blur shader can early out.
 - If the CoC and the DMV length doesn't vary much in the tile, a fast path is used instead as we don't need any fancy blending just to get an even blur for that.
 - Otherwise, the slow path is used as the tile contains complex geometry that needs to be blurred carefully.

The reason for doing this classification per tile is that it's cheaper to do, and that shaders are executed in large groups anyway. Hence, if pixel 1 in a group picks the fast path, pixel 2 picks the slow path and the rest want to early out, the entire shader group will need to run all 3 paths for all pixels as they all have to run in lockstep. This is actually slower than just running the slow path for all pixels. Hence, the idea is that by using 16x16 tiles, entire groups of pixels should be able to run only one path. To test this out, I created a specific scenario: The entire scene will be placed in focus with no motion (meaning the shader can early out immediately), but tiles in a checkerboard pattern will be forced to the slow path (in other words, have optimizations disabled). If the way the GPU groups up pixels into tiles is not EXACTLY aligned with the 16x16 tiles, it would need to execute the slow path for the entire screen, and the performance will reflect that. The key here is that compute shaders allow me to manually define the work group size, so I can force the workgroup size to perfectly align with the tiles. Let's see how this performs:

 - Fragment shader, all tiles using slow path: 3.94 ms
 - Fragment shader, checkerboard early out/slow: 3.94 ms (ouch!)
 - Compute shader, all tiles using slow path: 3.94 ms
 - Compute shader, checkerboard early out/slow: 2.07 ms (yes!)

In other words, the 16x16 tiles did NOT line up with the way the GPU placed pixels into groups, but with a compute shader they obviously do! This should give a nice (but obviously smaller) performance boost in real world scenarios!



In addition, I made some surprising findings while just playing around with it! Here's the performance of the compute shader compared to the fragment shader version when the entire screen is completely out of focus (max blur radius for every single pixel):
 - Fast path: 50% faster (15.0 ms ---> 10.0 ms)
 - Slow path: 17% faster (22.5 ms ---> 19.2 ms)

In other words, for some reason the raw performance of the shader is significantly better. This doesn't seem to be due to better ALU performance, but rather better texture sampling performance. There's no incoherent branching going on here either; every single pixel/invocation is executing the same thing (either the fast or the slow path). I guess since compute shaders bypass a lot of hardware in the GPU compared to a fullscreen triangle, maybe using a compute shader simply freed up cache space or changed the way the pixels are grouped together. I'm afraid I don't have a very good explanation for this, but damn is it awesome! =P



For a scene with mixed {early out/fast path/slow path} with dynamic sample count based on blur area (in other words, a typical scene in a normal game with all optimizations on):
 - 74% faster (7.5ms ---> 4.3 ms)

This is generating the exact same image with the same code, just executed as a compute shader instead of a fragment shader. The only difference is that I use gl_FragCoord.xy in the fragment shader version, which I need to calculate manually in the compute shader as (vec2(gl_GlobalInvocationID.xy) + 0.5). Also, this is BEFORE adding the packing of the data before. I expect that optimization to almost double the performance of both the fast and the slow path in the worst case scenario above (fast path 10 --> 5ms, slow path 20 --> 10ms), for a tiny performance cost in other scenarios.

EDIT: Here's a debug image showing the tile classification. Red = slow path, green = fast path, blue = early out. In addition, the color intensity represents the sample count; bright = more samples (up to 128).
7  Discussions / Miscellaneous Topics / Re: What I did today on: 2017-07-22 20:44:29
Stumbled upon some interesting results while optimizing the depth of field/motion blur shader.

Basically, there are three relevant things that can cause a bottleneck in the blur shader:

1. Too many ALU instructions, AKA too much mathz!!!11
2. Too many texture instructions. Each instruction, regardless of what it actually fetches, has a minimum cost.
3. Bad texture sample spatial cache locality. This is worsened if you're sampling a format which takes up a lot of memory in the texture cache.

The blur shader has three different paths:
a. The most expensive path is needed when there is a big difference between the circles-of-confusion or the motion vectors inside a tile. This path needs two texture samples per sample (1 for color, 1 for depth+CoC+motion vector length), and has quite a lot of math.
b. If the blur is uniform for the entire tile (and neighboring tiles), there's no point in doing all the fancy depth sorting and layer blending, so in that case I can just run a simple blur on the color on the scene directly. This version just reads the color texture and requires less math.
c. If I tile is completely in focus and has no major motion, the shader early outs as no blurring is needed. The shader is obviously incredibly fast if this path is used, so for the tests I will be disabling this path.

To rule out texture cache problems and get a baseline performance read, I ran path A and path B with zero CoCs and zero motion vectors. This causes all the texture samples to end up at the exact same place, so the texture cache should be able to do its work perfectly. Testing this, I got around 4.5ms for the expensive path and 2.2 ms for the cheaper path. Sadly, it's hard to tell what the bottleneck is alone from this data. The fact that performance is half as fast could mean that the bottleneck is either the sheer number of texture samples (as performance doubled for half as many samples), or it could just mean that the cheaper path has half as many math instructions. Next, I tested the shader with a 16 pixel radius CoC for every pixel on the screen, meaning that the samples are distributed over a 33x33 pixel area (over 1000 possible pixels). This caused the fast path to increase from 2.2 ms to 8.2. Even worse, the expensive path went from 4.5 ms all the way up to 21.5! Ouch!

As the blur size increases, the samples get less and less spatially coherent. However, there is no performance loss whatsoever as long as the samples fit in the texture cache. After that point, performance quickly gets a lot worse very quickly. That threshold also depends on the size of the texture format we're sampling, as a bigger format takes up more space in the texture cache ---> fewer texels can fit in it ---> the cache gets "shorter memory". In my case, both the color texture and the depth+CoC+motion vector length textures are GL_RGBA16F textures, meaning they should be taking up 8 bytes per sample each. As we can see, the fast path suffered a lot less from the worse cache coherency, only taking around 3.75x more time, but the slow path took around 4.75x as much time! This is because the fast path only needs to sample the color, and hence has less data competing for space in the cache.

Here's where it gets interesting: For a first test, I tried changing the texture format of the color texture from GL_RGBA16F (8 bits per texel) to GL_R11F_G11F_B10F (4 bits per texel). I expected this to possibly even double the performance of the fast path as it'd halve the size of each texel, but... Absolutely nothing happen. It performed exactly the same as GL_RGBA16F. The explanation for this is that the texture cache always stores data "uncompressed". For example, for DXT1-compressed textures, the texture colors of each 4x4 block are compressed to 2-bit indices used to interpolate between two 5-6-5-bit colors. The result of this interpolation does not result in exact 8-bit values, but the GPU will round the values to the closest 8-bit values and store the result in the texture cache with 8 bits of precision per color channel. On the other hand, RGTC-compressed textures work similarly to DXT1-compressed textures, but instead use 3-bit indices and 8-bit colors. This gives RGTC textures up to ~10-11 bits of precision in practice, and uncompressed RGTC is actually stored at 16-bit precision in the texture cache to allow you to take advantage of this. Hence, it makes sense that the GPU cannot store 11/10 bit floats in the texture cache directly, and instead has to store them as 16-bit floats instead. In addition, the texture cache can't handle texels with 6 bytes, so they're then padded to 8 bytes, giving us the same cache footprint as GL_RGBA16F! The bandwidth used isn't the issue here; we're simply suffering from cache misses, and the latency for fetching that data is killing us, regardless of how much data is being fetched! Testing with GL_RG16F and GL_R32F instead, performance of the fast path improved from 8.2 ms to 6.0! The slow path only improved from 21.5 to 19.4 ms, as the other texture is still GL_RGBA16F.

So, to improve performance I really want to reduce the size of each texel. Since GL_R11F_G11F_B10F is out of the question for the color texture, I'll need to do some manual packing. Luckily, I don't need any filtering for these textures, so sacrificing filterability is not a problem. My current plan is to emulate something similar to GL_RGB9_E5 (as that format can't be rendered to) by storing the color in RGB, and an exponent in the alpha channel. This should allow me to store a HDR color with just 32 bits of data. The depth+CoC+MV texture is simpler; just store depth as a 16-bit float, and the CoC and motion vector length values can be stored at 8 bits precision each no problem. Since I'll need to do manual packing using bit manipulation, storing these values in 32-bit uint textures is the easiest. This means that I can easily pack all this data into a single GL_RG32UI texture. This will halve the number of texture samples AND halve the texel size for the slow path, but it'd also force the fast path to fetch the extra blur data it doesn't need. Hence, I will probably output just the color to a second GL_R32UI texture that the fast path can use to halve the size of each texel for that one too. I already do a fullscreen packing pass to generate the depth+CoC+MV GL_RGBA16F texture which only takes 0.08ms right now, so adding another texture (increasing bandwidth by 50%) shouldn't cause any significant slowdowns compared to the gains the blur pass will see.

I will also port the blur shader to a compute shader. Currently, in my fragment shader I pick which path to take based on the tile the pixel falls into. This assumes that the fragments are processed in blocks that align up with these tiles. If this isn't the case, this could have severe performance impacts as some pixels may end up having to execute multiple paths. By using a compute shader, I can get ensure that the compute shader work groups align perfectly with the classification tiles, so each work group would only ever need to execute one of the 3 paths.
8  Discussions / Miscellaneous Topics / Re: What I did today on: 2017-07-21 05:56:05
Depth of field + motion blur video! Sorry about the Skype notification sound(s) in the video! >___< The video can be streamed from Drive using the Youtube video player, but I recommend downloading the video for maximum 60 FPS quality.

https://drive.google.com/open?id=0B0dJlB1tP0QZZHNOTHdiN3hCTms

Notes:
 - There's some aliasing of sharp (and blurry) objects in the scene. This is because my antialiasing is not compatible with the DoF/motion blur at the moment, but I will probably fix that one way or another.
 - The motion vectors go crazy when I move the camera during slow-motion, as the camera believes it's rotating at an extremely high speed. It's not a problem with the algorithm.
9  Discussions / Miscellaneous Topics / Re: What I did today on: 2017-07-18 02:10:28
Jesus H. Christ, I've been in constant cold sweat / depression over finding out that there was a case which my combined DoF/MB implementation couldn't handle. Lied sleepness at night, almost walked into shit while going to the store. x___X

If the entire scene is out of focus, the result should be approaching a simple circle blur (think box blur, but a circle shape instead), but my triple-classification system didn't approach that for blurry scenes. This resulted in seeing weird outlines of the out-of-focus objects on the screen. I could tweak the range of the different layers, but this essentially resulted in disabling the third layer, which made the original issues come back. The problem was simply inherent to using a third classification, and therefore unfixable. Hence, I went back to the basics again and reimplemented the original depth of field from the paper, with the additions I made to support motion blur. I tried using the minimum depth of each tile again, but just confirmed the original issue I discovered there:



The per-tile depth had to go. There was no way that was ever going to work. Each pixel needed to be classified relative to the current pixel. Simply changing the depth I did the classification against to be the depth of the center pixel, and this just amplified the issue further: objects could no longer blur over their edges at all if the background was sharp. Well, shit. OK, I'll just change the center pixel to be classified as background too. Wait a minute, that actually looks... correct? With just two classification types? Then I enabled motion blur, and was immediately shot down as massive dark strips appeared around the edge of motion blurred objects. Crap. But... how can that even be possible? Motion blur and DoF are essentially the same thing. The only difference is the shape of the blur area and how the blur radius/length is calculated. I banged my head and finally found it: A really stupid bug in the background weight calculation of motion blurred pixels. Arghhhh!!! Shocked

I fixed it, and I'm now back to only two classifications, and it seems to handle every case I expect it to handle. It suffers from the inherent limitations of working in screen space as a postprocessing effect. A blurry object essentially becomes see-through around the edges, which should reveal the background, but that information simply isn't there in the original sharply rendered screen. Hence, the background is reconstructed by taking the colors of nearby pixels and averaging them together. You can actually see that in effect in the image above. You can see how the reconstructed background around the edge of the blurry robot looks good in most cases, but particularly at the horizon as well as in some other minor cases, the reconstruction simply doesn't look correct. There's no way of fixing that without actually feeding more data into the blurring algorith, which would require generating that additional data as well using something like depth peeling (and then lighting and processing all layers). That, or raytracing, and I'm pretty sure neither of those two will happen anytime soon.

In addition, there are some limitations in how the motion blur works. It still suffers from the same limitations as the original algorithm, as it only tries to blur in one direction at a time, the dominant motion vector of each tile. This can cause issues with overlapping motions around edges, which are more apparent the higher the maxMotionBlurDistance and maxDofRadius is. "Luckily", higher distances/radii cause such huge performance hits that this will be limited. =P

I'll be making a video after doing some more testing and reimplementing some of the optimizations I made, but I'm gonna take it easy today and go to bed early (and maybe even get some proper sleep).
10  Game Development / Newbie & Debugging Questions / Re: Game "running out of memory" with 32-bit JRE's on: 2017-07-17 07:20:57
Is there a specific reason why you want to support 32-bit JVMs?
11  Discussions / Miscellaneous Topics / Re: What I did today on: 2017-07-16 20:43:51
Incredible @theagentd. The combination of the blur passes is a really neat concept, I'm definitely going to play around with that (if you don't mind!). As always, amazing work.
Thanks!

@theagentd video or it didn't happen  Tongue Pointing

Seriously. Great stuff. If you come around captioning a vid, would be great and much appreciated!
Thanks! What do you mean by "captioning a video"? A video with explanations...?
12  Discussions / Miscellaneous Topics / Re: What I did today on: 2017-07-16 02:41:44
OK, I feel ready to show this shit off. Spent essentially the entirety of today on it too. =___=

Basically, I've been spending quite a lot of time trying to get a good quality motion blur and depth of field working. We really needed motion blur for WSW, so I tried very hard to implement the (then very new) algorithm described in A Reconstruction Filter for Plausible Motion Blur, an excellent algorithm. We used this algorithm for quite some time with great success, but it wasn't perfect. Some nice improvements and optimizations were floating around on the internet which I implemented as I found them, but there was always one glaring issue of that algorithm: It had a really bad discontinuity at the edges of objects, which can clearly be seen in the bottom picture of Figure 5 in the original paper.

Next Generation Post Processing in Call of Duty: Advanced Warfare was the next big source of information for me. I STRONGLY recommend everyone interested in postprocessing to take a look at that presentation, as it contains a huge amount of really great images, videos, explanations and even shader code. For motion blur, they added a weight renormalization pass to fix the depth discontinuity, making the motion blur much more convincing around the edges of moving objects. However, the really amazing thing in that presentation was their depth of field implementation. It had extremely good quality and worked similar to the motion blur. Basically, it split up the scene into a background and a foreground, defined per tile. It looked really great in the screenshots and the example video they had in the slides, but... I just couldn't get it to look good.

The foreground/background classification didn't seem to work well at all. As the classification was done based on the minimum depth of a tile, a tile with low minimum depth could cause neighboring tiles to classify the entirety of their contents as background. This caused tile-sized snaps and shifts as the scene moved, which was just unacceptable to me. Instead, I tried doing the classification per pixel instead, which solved some problems, but now an out-of-focus foreground object couldn't blur out over other in-focus objects. Murr murr...

It took me a very long time to find a solution, but two days ago I finally did. The problem with the foreground/background system is that the pixel you're looking at needs to go into either the foreground or the background. If it goes into the foreground, things that are in front of the pixel can't blur over it. Similarly, if it's in the background that also causes issues in certain cases. The solution was to add a third class, the "focus", which involves things at around the same depth as the pixel we're looking at. This fixed the problem in all cases!

No DoF:


DoF, focus on head:


DoF, focus further away:


There was just one major issue. The DoF didn't work together with the motion blur algorithm! No matter which one you applied first, as soon as you've blurred the image, the depth buffer becomes useless for classification! This means color bleeding over edges when both motion blur and depth of field are both applied to the same place.

Now, the story above isn't entirely chronological. Some months ago, I theorized if would be possible to combine both DoF and motion blur into a single postprocessing past. They're both blurs; I just need to figure out how to get them to work together. How hard can it be? ...... Don't get me started, it was f**king hard, but two days ago I managed to get it right. DoF's classification system is essentially a more advanced version of what the original motion blur algorithm does, so using it for motion blur is fine. I had a crazy amount of issues with the alpha calculations between the layers, and even more issues with the sampling patterns, but in the end I actually managed to get it working. Behold, the results of my work.

Combined DoF and motion blur.



I'm... exhausted as hell. But it was worth it I guess. Now there's just a massive amount of work left to optimize this. If anyone cares enough to want to implement this themselves, I can write an article on how the DoF+MB works.

13  Discussions / Miscellaneous Topics / Re: What I did today on: 2017-07-15 08:03:32
Spent approximately 20 hours split between today and yesterday working on........ something I'm not going to reveal right now. It's AFAIK something new that hasn't been done before with the quality I'm getting. I'll be writing one of my super long posts on this tomorrow, but right now it's like 10 in the morning and I REALLY need to sleep.

EDIT: Dammit, Archive, you prematurely made me reach 1000 medals! Thanks! xD
14  Game Development / Game Mechanics / The overly complicated tile rendering of Robot Farm on: 2017-07-04 01:14:13
Hello, everyone!

I thought I'd write a small long-ass post about the tile rendering system I've developed for Robot Farm. The tile rendering in RF ended up being quite complex, as we needed to keep memory usage down and performance up, so this post will chronicle some of the improvements that were made over time. Let's dive right into it!



Memory management

Our worlds can be quite huge. Our biggest worlds are up to 7680x5120 (= ~40 million) tiles big, and each tile in the world has a terrain tile and a detail tile. In addition, we also need to store pathfinding data for tiles. This can end up being quite a huge amount of data. In the beginning, simply tried to brute-force it by keeping everything in memory. 7680x5120 x (two ints + pathfinding Node object) = over 300MBs just for the tile data, and another gigabyte for the pathfinding data. NOPE, we can't have that. The first attempt to minimize memory usage was to add chunking to the data. We split up the world up into 32x32 chunks, and only kept a certain number of chunks in memory; the rest were written out to compressed files to the harddrive. This caused a number of issues though. The chunk management suddenly became very complex, and now we needed to involve the harddrive if we needed to load some data. With some unlucky harddrive response times, we could get significant spikes that were very noticeable when travelling quickly. The technique was also pretty hard on the harddrive itself, because we ended up having thousands upon thousands of tiny files for the chunks saved to the harddrive. You could almost feel the lifetime of your harddrive being drained each time you ran the game. >___>

In an attempt to alleviate this, I added "super chunks". Super chunks were a bundle of 4x4 normal chunks that were loaded and unloaded together. This cut the number of files down drastically and helped the harddrive cope with the constant reading and reduced the amount of stuttering we got due to more data being read at a time. However, it also had some very big drawbacks. The stuttering happened less often, but when it happen the freeze was noticeably longer as more data was read at the same time. In addition, the new system didn't handle random access very well. Even if just a single tile was needed, an entire super chunk had to be read from disk. We had several world configurations where we blew through the super chunk budget due to NPCs needing tile data for certain operations all around the world, causing thrashing and literally 0.1 FPS cases. Not a fun time. There was also quite a lot of overhead when you just needed to access a tile. Get the super chunk for that tile, get the chunk for that tile, get the tile. In the end, if we wanted to eliminate stuttering completely, we needed to eliminate the harddrive from the equation altogether, and preferably store chunks individually as well.

This is where I hatched the idea of compressing the chunk tile data to RAM instead of to the harddrive. Simply DEFLATE-ing using Deflater or GZIPOutputStream on the data made it much smaller as the terrain tiles are very continuous and the detail tiles are mostly 0 (= nothing there). So, instead of compressing the data and writing the result to disk, I simply compressed the data to a byte[] and kept the byte[] array in memory. To access the data, I need to run it through Inflater again to get the data back, but with no disk access required, this only took around 0.1ms per chunk! Most of the time, this reduced the size of the tile data by a factor of 10x to 20x, making the tile data memory usage trivial. And with the harddrive access gone, there was no reason to keep the super chunks around anymore, so they could simple be removed.

Pathfinding data was another big problem here, but I actually already have a post for that! If you're interesting in how that was solved, look here: http://www.java-gaming.org/index.php?;topic=38207.0. TL;DR: I got rid of the Node object and managed to barely pack all the pathfinding data of a node into the bits of a 32-bit integer. I then added a separate chunking system to it so that int "nodes" were only allocated (and reused) for areas that the pathfinder explored. This meant that less than 1/4th of the world needed pathfinding data at any single time, and cut the memory usage from over 1GB to just 50MB on average.



Tile transitions

In a tile-based game, tile transitions are almost always a must-have. Take a look at these two image from this very popular tile transition tutorial:
   

The left one just draws the tiles as they are, while the right one has tile transitions added to the rendering. The tile transitions add, well, transitions between different terrain types so that the world doesn't look like it's restricted as much to a grid anymore. However, they're quite complicated to both make and to render. The transitions to add to a tile depends on all the 8 neighbors of the tile. This leads to 2^8=256 possible combinations, but this can be reduced by combining multiple non-overlapping transitions to cover all cases. Still, a large number of transitions need to be drawn by our pixel artists for every single terrain type. Also, as you can see in the image above, the transitions have to be very thin to not completely occlude the neighboring tile. This limits the "effectiveness" and variation of the transitions.

We already had a transition system in place when I started working on Robot Farm, but it had some very serious problems. First of all, it required some very complicated neighbor checks for multiple terrain types that actually were so slow that we had big CPU performance issues from them when rendering to high resolution screens (as they could see more tiles). I added a couple of hacks to improve performance by caching neighbor tiles between checks, and I even added a full-world BitSet with one bit per tile just to track which tiles needed the transition checks. Even so, we had big problems with overlapping tile transitions each other as the transition tiles we already had were often too big to work well with that system. Hence, I started working with our pixel artist to improve the tile rendering. The project had two goals: Minimize the number of transitions needed (and hence the amount of work for our pixel artist) and maximize rendering performance of those transitions.

To reduce the number of transitions needed, we needed to reduce the number of tiles that can affect the look of the transitions. Ignoring the diagonal neighbors seemed like a promising idea, but it fell flat as it could not detect certain cases correctly. Instead, I shifted my focus to a brand new idea... Enter half-offset tile transitions!

The idea is pretty simple. Instead of drawing one tile image per tile in the world, we offset the tile rendering by half a tile as the image shows. Suddenly, the rendered tile is no longer a function of 8 neighbors (plus itself), but just the 2x2=4 tiles that it intersects! That's only 2^8 combination, of which one is nothing (none of the 4 tiles correspond to the terrain type) and one is fully covered, so only 14 transition tiles are needed per terrain type, a big improvement! I haven't been able to come up with another technique that provide the same quality AND simplicity as this technique!

Even better, this technique is perfect for being rendered on the GPU. First of all, the by cleverly ordering the transitions tiles in the tileset, you can calculate exactly which transition to pick using bitwise operations. Each of the four neighbors can be mapped to the lowest four bits of an integer, giving us the tile transition index we need without any complicated special cases, transition merging, etc! On top of that, we have a clear upper bound on the number of overlapping transitions: four, which happens when every single tile is different in a 2x2 area. This is a GREAT feature because it means that we at most only need to store four tile indices for rendering to be able to handle all cases, while the 8-neighbors version would require up to eight tile transitions.



Rendering

Well, it wouldn't be like me to not tack on an over-engineered rendering system on top of these two systems now, would it? =P

The old tile rendering system of RF was very simple. Each tile was drawn as four vertices forming two triangles using an index buffer. Very simple, straightforward stuff. Although the biggest bottleneck lied in the rendering of the tile transitions, the generation of the vertex data became a significant cost when zoomed out. Now, this wasn't a significant issue in the actual game, but it was a huge issue in the tools we had as it made it impossible to zoom out enough to get a decent overview of the world for debugging and development. This lead me back to something I coded a long time ago.

Once upon a time I made this thread here about drawing tiles efficiently using shaders, but the image links are dead now. Anyway, the idea is simply to upload the tile indices to a texture, then use a shader to figure out which tile a pixel belongs to, then which tile index that tile has, then to sample the actual tile. The advantage of this technique is that an entire tile map can be drawn with a single fullscreen quad instead of individual quads for every tile, which is a massive win when the tiles are small and many (possibly in the millions). By using modern OpenGL, this even allows you do have tile mipmaps and even bilinear filtering without any bleeding between tiles too.

To handle the half-offset tile transitions, we need to store up to four tile indices per tile. Hence, a GL_RGBA16UI texture is perfect for the job, giving us up to 65536 possible tiles and fitting all four values in a single texture. To store the the tileset itself, we just use a 2D texture array, where each layer contains a single tile. This ensures that no filtering can happen between tiles as they're all isolated in their own layers. In the end, a rather short shader can be used to render all tiles at once:

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
#version 330

uniform usampler2D tileIndices; //"world map" containing tile indices (GL_RGBA16UI)
uniform sampler2DArray tileSet; //tile set, one layer per tile (GL_SRGB8_ALPHA8)

in vec2 tileCoords; //tile coordinates, integers correspond to tile edges

out vec4 fragColor;

void main(){
   vec2 halfOffsetCoords = tileCoords + 0.5; //offset tiles by half a tile
   vec2 flooredTileCoords = floor(halfOffsetCoords);
   uvec4 tileData = texelFetch(tileIndices, ivec2(flooredTileCoords), 0);
   vec2 texCoords = halfOffsetCoords - flooredTileCoords;

   //texCoords is not continuous at tile edges and causes incorrect derivatives to be calculated
   //We can manually calculate correct ones from the original tile coordinates for mipmapping.
   vec2 dx = dFdx(tileCoords);
   vec2 dy = dFdy(tileCoords);
   
   vec4 result = vec4(0);
   
   for(int i = 0; i < 4; i++){
      uint tile = tileData[i];
      //use textureGrad() to supply manual derivatives
      vec4 color = textureGrad(tileSet, vec3(texCoords, tile), dx, dy);
      result += color * (1.0 - result.a); //front-to-back blending
   }
   
   fragColor = result;
}

That's it! Of course, I just had to go one step further though...

Well, our world is too big to be stored entirely in VRAM too. We need to shuffle tile data in and out of VRAM similarly to what we do in RAM by compressing the tile data. This leads to some more complexity. I ended up going completely nuts here and implemented software virtual/sparse texturing. Grin A sparse texture is a texture where not all parts of the texture actually has backing memory. For example, we can split up a 1024x1024 texture into 128x128 texel "pages", then use a 8x8 "page mapping texture" to map each part of the texture to a different place in memory. As we have fine control over which 128x128 pages are actually available at any given time, we don't need to have the whole thing in memory permanently. In our case, we have a page size of 128x128 tiles. So for a world that is 1024x1024 tiles big, we split it up into 8x8=64 pages. Let's say we only have a budget 8 simultaneously loaded pages at a time, which means that our page texture is a 128x128 2D texture array with 8 layers. We then have an 8x8 texture which maps parts of the original texture to pages. This is probably a bit confusing so let me explain it in a different way:

1. We want to sample tile (200, 300).
2. We calculate the page offset by dividing by the page size: (200, 300)/128 = (1, 2) (rounded down).
3. We go to the 8x8 page-mapping texture and sample point (1, 2) to get the layer index for that page. Let's say the value sampled is 4, meaning the page we're looking for is in layer 4.
4. We calculate the local tile position inside the page, which is (200, 300) - (1*128, 2*128) = (72, 44).
5. We sample the page texture array at (72, 44) on layer (4) to get the tile index we need!

The advantage here is that not all of the 64 pages need to be loaded at any given time. The camera frustum is pretty coherent in its motion and won't move around much, so if the player is in the top left corner of the map, it's fine if we only have the top four pages loaded, while the rest simply don't have any data in them. Hence, we can render the world as if the entire world was loaded all the time, as long as we keep remapping pages to cover the parts of the texture that we actually use at any given time. Some of you may know about the game RAGE which used a much more advanced version of this system to stream in extreme amounts of heavily compressed texture data on demand from a 131 072 x 131 072 virtual texture (128k x 128k), storing parts of this texture (and its mipmaps) into 128x128 pages. In other words, they had a 1024x1024 page-mapping texture (also with mipmaps) which mapped the virtual texture location to individual 128x128 pages. It was fun doing something simple but at least comparable, although probably a bit overkill in this case. =P



Images

1. Page mapping (shows which layer in the page texture array the data for that part of the world is in.


2. Page-local positions (local tile coordinates inside each page)


3. Tile texture coordinates (the local texture coordinates inside each tile)


4. Final result




Extra transition test image
15  Discussions / General Discussions / Re: Compressing chunk of tile data? on: 2017-06-23 18:43:18
That's actually a pretty good idea! I'm currently in Italy on vacation though, but I'll post a proper challenge when I get back!
16  Game Development / Shared Code / Re: "Better" LinkedList implementation on: 2017-06-23 17:46:11
@SHC
A pooled Node version would probably be slightly worse performance-wise when looping over the list due to the Node object and the element object being in two different places in memory. Also, without exposing the pointers, it becomes a bit annoying to loop through the list, but that's no real concern I guess.

@jono
Interesting, it'd be cool to see a.comparison between that and my implementation.

@Riven
Hmm, how exactly would that work?
17  Discussions / General Discussions / Compressing chunk of tile data? on: 2017-06-23 17:34:01
I need to compress a 32x32 chunk of tile data. The data consists of 1 byte for the terrain type, and 1 short for a detail tile (second layer essentially). The terrain tiles are very continuous, while most detail tiles are 0 (null). What would a good compression method?

Here's my current idea:

1. Reorder data into z-curve order or Hilbert curve order to improve spatial coherency

2. Run length encode data using a special repeat byte/short value (say Byte/Short.MAX_VALUE) followed by number of repetitions, like this:
 baaaaab ---> ba*5b

3. Compress data using a precomputed Hilbert code. The idea is to precompute several Hilbert trees from statistical data from a lot of worlds based on the two most common tiles in the chunk, so there'll be one tree for each biome essentially. The chunk encodes a byte for which tree to use, then the encoded data. The RLE symbol is encoded in the tree, but the RLE count could be encoded using a different huffman code tree specifically for RLE counts.

My question is: Would this beat the DEFLATE compression I'm currently using? Is it even worth trying to implement this? I feel like DEFLATE should have problems compressing small chunks of data like this as it tries to learn about the data as it goes. Also, would it make sense to use delta encoding here?
18  Game Development / Shared Code / Re: "Better" LinkedList implementation on: 2017-06-20 13:01:46
The removeFirst() and removeLast() functions don't check for empty lists yet.  Pointing  Grin

I like your solution, esp. because its an intrusive list - the node pointers are members of the list element.
NOTHING in this list checks for errors. I should've mentioned that in the original post I guess. xd

>Add already existing element again ---> Corrupt link references.
>Add element from another list ---> Corrupt other list.
>moveToHead() on object not in the list ---> head is set to object not in the list.
>remove() object not in list ---> corrupts size counter and may corrupt link references.
>removeFirst/Last() on empty list ---> nuppo

I keep track of if an element is in the list outside the list for the cache, so such testing wasn't necessary (and IMO not the job of the list in the first place).

I'd be inclined to have a reference in the Element, to the List that owns it.

It'd allow fail-fast protection against multiple logic errors (adding an Element to multiple lists/the same list multiple times, removing/moving an element from a List that it's not a member of, etc), making the code far less fragile.
Yeah, that'd be how you add testing. Add a reference to the list the element currently is on, check this reference in add*(), moveToFirst() and remove(). Null out the reference on removal. Also add an isEmpty() check for the removeFirst/Last() functions.

Actually you could make use of this. Just don't move the object to the top of the list, if it's in the upper half (or some other threshold) already. If anything, just swap it with an object a couple of indicies up in the buffer. This way you would only produce holes for the less used objects. Since you don't need additional references, you can easily make up for the lost cache entries by increasing the array size.

Or you only move the head pointer for adding new objects and just swap existing objects up in the buffer. This would not produce any holes and while it is technically not an LRU list anymore, it might provide good enough heuristics to function as a cache.
I'm worried about experimenting too much with my use case. The cache is critical for performance, and if I were to start thrashing, the game will essentially freeze up. I could try some kind of system where you don't move an accessed element to the head of the list, but just "upgrade" it a couple of elements up by swapping it with an element x indices up in the list. If x is randomly chosen for each swap, then the risk of getting to a pathological case would be low. Like you said, this wouldn't be an LRU cache anymore, but some kind of heuristic priority cache I guess. It's an interesting idea, but I'm afraid I don't have much more time to spend on this right now. =/
19  Game Development / Shared Code / Re: "Better" LinkedList implementation on: 2017-06-18 22:15:01
Are you thinking about a concurrency-safe version by chance, or is that definitely not an issue in your situation?
Not really, it's not a requirement for my use case. It'd just get messy if I needed to support concurrency in the system I'm working on, really. You can always just synchronize on the whole class if you really wanted to.
20  Game Development / Shared Code / Re: "Better" LinkedList implementation on: 2017-06-18 13:06:41
The problem with a simple ring buffer is that it's essentially a FIFO queue. When an element is accessed, I need to move it to the start of the queue again to make sure it's the most recently used again. I don't think that's doable in an array-based ring buffer. Removing an element would leave a null hole in the array. If you want to fill the hole, you need to shift half the elements to fill the hole (either the ones before or after, so 0 to size/2 elements). If you just leave the hole there, you've essentially reduced the size of the cache until the head reaches that point and can fill the null hole, right? I imagine a pathological (but very common) case where the cache is full of objects, and then two elements are accessed repeatedly, causing them to be moved to the head of the list repeatedly. If each movement leaves a null hole, then one element would be evicted from the cache each time an element is moved to the front of the list. =/ If I've misunderstood or missed something, please let me know.
21  Game Development / Shared Code / "Better" LinkedList implementation on: 2017-06-18 01:43:53
Hey, everyone.

Today I found myself in a need of a least-recently-used (LRU) cache system. Essentially, I wanted a cache of the last 10 000 used objects. When a new object was to be loaded, I wanted to unload the oldest element in the cache. Before, the cache was only 96 elements, at which point looping through the list to find the least recently used element was perfectly fine for the load I had, but with 10 000 objects, that situation changed. I found that LinkedHashMap could be used as a makeshift cache, as it can be set to update the internal order of elements when they are accessed, but it just seemed inefficient and weird for what I actually wanted to accomplish here. Some research made me realize that a simple linked list was a good fit for this, as a double linked list can easily be maintained in LRU order. Basically, whenever an element in the cache is used, it is moved to the beginning of the list (an O(1) operation on linked list), meaning that the cache stores objects sorted from most recently used to least recently used. Then to remove the most least recently used object, I just remove the tail of the list, again an O(1) operation.

Now, I hope we all know how bad Java's LinkedList class is. Not only does it generate garbage every time an object is added to it, but moving an object to the beginning of a list is an O(n) operation as I'd need to use list.remove(object), which would need to scan through the list until it finds the object in question and can remove it. A proper linked list implementation would store the previous and next references in the object itself, meaning that finding the previous and next objects become O(1) operations, and the element can be removed without going through the entire list. So... I wrote a minimal linked list that could do what I needed and not much more.

As mentioned, the main difference is that my linked list class does not allocate Node objects to store the next and previous references, but instead stores those references in the elements themselves. This leads to a couple of peculiarities.

 - All elements must implement an interface that allows the list to set the previous and next pointers (or alternatively extend a class which implements those functions for it).
 - You can't place the same element in a linked list twice, as each element can only track one previous and next neighbor.
 - For the same reason, you can't even use the same element in two DIFFERENT lists at the same time.

In my case, none of these were an issue, so I went ahead and implemented it. I even added a special method for moving an element in the list to the start of the list for maximum performance in that sense. The class provides the following functions, all of which are O(1):

 - addFirst(e)
 - addLast(e)
 - moveToFirst(e)
 - remove(e)
 - removeFirst()
 - removeLast()

There is no random access function. The correct way of looping through the list is to simply call element.getNext() until it returns null (the current size of the list is tracked though). The somewhat complicated usage of generics is there to allow for type safety, both when extending the Element class and when working with FastLinkedList.

Usage example:
1  
2  
3  
4  
5  
6  
7  
8  
private class MyElement extends FastLinkedList.SimpleElement<MyElement>{ //Note this definition here, it's particularly clever =P
    ...
}


FastLinkedList<MyElement> list = new FastLinkedList<>();
list.addFirst(...);
...



Performance test on 100 000 randomly shuffled elements:

Adding 100 000 objects to the list:
    LinkedList: 1.617 ms
    FastLinkedList: 0.627 ms
~3x faster

Move 2000 random elements from their current position to the start of the list:
    LinkedList: 203.315 ms
    FastLinkedList: 0.118 ms
Insanely much faster (O(n)--->O(1))

Remove 2000 random elements from the list:
    LinkedList: 175.541 ms
    FastLinkedList: 0.094 ms
Insanely much faster (O(n)--->O(1))

Remove remaining 98 000 objects using removeFirst():
    LinkedList: 0.298 ms
    FastLinkedList: 2.78 ms
Noticeably slower in this test due to worse cache coherency. FastLinkedList needs to access each element's data to find the next and previous node in the list. As the elements are shuffled, this ends up jumping around randomly in memory. Java's LinkedList instead creates Node objects for each element, which end up sequentially in memory as they're recreated each iteration. This is not an issue with real-life data, and is the cost of going garbage-less.




Code:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
47  
48  
49  
50  
51  
52  
53  
54  
55  
56  
57  
58  
59  
60  
61  
62  
63  
64  
65  
66  
67  
68  
69  
70  
71  
72  
73  
74  
75  
76  
77  
78  
79  
80  
81  
82  
83  
84  
85  
86  
87  
88  
89  
90  
91  
92  
93  
94  
95  
96  
97  
98  
99  
100  
101  
102  
103  
104  
105  
106  
107  
108  
109  
110  
111  
112  
113  
114  
115  
116  
117  
118  
119  
120  
121  
122  
123  
124  
125  
126  
127  
128  
129  
130  
131  
132  
133  
134  
135  
136  
137  
138  
139  
140  
141  
142  
143  
144  
145  
146  
147  
148  
149  
150  
151  
152  
153  
154  
155  
156  
157  
158  
159  
160  
161  
162  
163  
164  
165  
166  
167  
168  
169  
170  
171  
172  
173  
174  
175  
176  
177  
178  
179  
180  
181  
182  
183  
184  
185  
186  
187  
188  
189  
package engine.util;

import engine.util.FastLinkedList.Element; //this import is required

public class FastLinkedList<E extends Element<E>> {
   
   private E head, tail;
   private int size;
   
   
   public FastLinkedList() {}

   
   public void addFirst(E element){
      if(size == 0){
         head = element;
         tail = element;
      }else{
         head.setPrevious(element);
         element.setNext(head);
         head = element;
      }
      size++;
   }
   
   public void addLast(E element){
      if(size == 0){
         head = element;
         tail = element;
      }else{
         tail.setNext(element);
         element.setPrevious(tail);
         tail = element;
      }
      size++;
   }
   
   public void moveToFirst(E element){
     
      if(element == head){
         return;
      }
     
      E prev = element.getPrevious();
      E next = element.getNext();
     
      prev.setNext(next); //prev cannot be null thanks to if-statement above
      if(next != null){
         next.setPrevious(prev);
      }else{
         //element was tail, update the tail.
         tail = prev;
      }
     
      element.setPrevious(null);
      element.setNext(head);
      head.setPrevious(element);
      head = element;
   }
   
   public void remove(E element){
      E prev = element.getPrevious();
      E next = element.getNext();
     
      if(prev != null){
         prev.setNext(next);
      }else{
         //prev == null means element was head.
         head = next;
      }
     
      if(next != null){
         next.setPrevious(prev);
      }else{
         //next == null means element was tail.
         tail = prev;
      }
     
      element.setPrevious(null);
      element.setNext(null);
     
      size--;
   }
   
   public E removeFirst(){
     
      E h = head;
     
     
      E next = h.getNext();
      if(next != null){
         next.setPrevious(null);
      }else{
         //next and prev are null, list is now empty
         tail = null;
      }

      h.setNext(null);
     
      head = next;

      size--;
     
      return h;
   }
   
   public E removeLast(){
     
      E t = tail;
     
      E prev = t.getPrevious();
      if(prev != null){
         prev.setNext(null);
      }else{
         //next and prev are null, list is now empty
         head = null;
      }
      t.setPrevious(null);
     
      tail = prev;
     
      size--;
     
      return t;
   }
   
   public E getFirst(){
      return head;
   }
   
   public E getLast(){
      return tail;
   }
   
   public int size() {
      return size;
   }
   
   public boolean isEmpty(){
      return size == 0;
   }
   
   public String toString() {
      StringBuilder b = new StringBuilder("[");
      E e = head;
      for(int i = 0; i < size; i++){
         b.append(e.toString());
         if(i != size-1){
            b.append(", ");
         }
         e = e.getNext();
      }
      return b.append(']').toString();
   }
   
   public static interface Element<E extends Element>{
     
      public void setNext(E next);
      public void setPrevious(E previous);
     
      public E getNext();
      public E getPrevious();
   }

   public static class SimpleElement<E extends SimpleElement> implements Element<E>{
     
      private E next, previous;

      @Override
      public void setNext(E next) {
         this.next = next;
      }

      @Override
      public void setPrevious(E previous) {
         this.previous = previous;
      }

      @Override
      public E getNext() {
         return next;
      }

      @Override
      public E getPrevious() {
         return previous;
      }
   }
}
22  Game Development / Shared Code / Re: Compute view/proj matrices for head tracking rendering on: 2017-06-14 14:50:50
I may have something to show at some point. =P
23  Java Game APIs & Engines / OpenGL Development / Re: Optimizing Performance on: 2017-06-12 12:57:48
Are you on mobile? I've heard about that being a major issue on mobile, but never on desktop. Mobile likes to prefetch texture data before the shader starts, which isn't possible if you need to run the shader to figure out the texture coordinates.
24  Java Game APIs & Engines / OpenGL Development / Re: Optimizing Performance on: 2017-06-12 03:28:38
Cool, let me know if you have any questions. I'd love to hear about your results when you're ready, too. =P
25  Game Development / Newbie & Debugging Questions / Re: (LibGDX) Memory Leak/Issue With ShaderProgram on: 2017-06-12 01:41:10
I'm not familiar with that program, but are you sure you're interpreting this correctly? Isn't it saying that there are a crapload of ShaderProgram objects, and they're all stored in a single Object[]? Wouldn't that simply mean that you may be loading shaders, throwing them on an ArrayList somewhere and forgetting about them, preventing them from being garbage collected?
26  Discussions / Miscellaneous Topics / Re: What I did today on: 2017-06-11 19:52:35
Today I reworked a large part of RF's rendering pipeline. I finally added proper sRGB support throughout the pipeline, and the fake bloom just looks so much better with it. We'll need to re-tweak a lot of colors as all non-texture colors (e.g. hardcoded float color values for as vertex colors or uniform color values) will look significantly brighter now, but it'll just be a tiny amount of work to fix.

A very common problem I've encountered is essentially wanting to scale something nicely without aliasing, but still retaining the pixely look of GL_NEAREST filtering. Essentially, I want antialiased edges between the different texels, but the actual interior of each texel to be just a single color from the texture. I found the solution to this: I enable linear filtering and then use a shader to "sharpen" the texture coordinates before sampling. This lead to very good results.

What I actually wanted to use this for was to upscale the entire rendered scene in Robot Farm to screen resolution. Our tiles are 16x16 pixels big, which meant that we needed to render those tiles (and everything else) at a scale that was a multiple of 16. We then realized that having a higher screen resolution or simply lowering the tile scale meant that the player would be able to see much further than intended, which gave the player a big advantage when exploring. To keep the view distance fixed, we needed to support arbitrary scaling with of tiles without them looking like absolute crap. For example, here are our 16x16 tiles drawn as 17x17 quads.

The issue here is that with nearest neighbor filtering, the pixels from the original tiles end up covering either 1 or 2 pixels, which causes heavy distortion of objects depending on where they are on the screen. This is especially visible during motion, but the anti-aliased text drawn at tile resolution shows the issue very well in a still image. The numbers look choppy and have weird thickness here and there. When using the new shader to upscale, here's what I get:

In this case, the new shader essentially works as bilinear filtering of the screen as the upscaled resolution is so similar to the original resolution, but it is a tiny bit sharper than bilinear filtering at least. Not too impressive actually...



Looking at a more zoomed-in result where we upscale from 16x16 to 34x34, we get the following result with nearest neighbor filtering:

Ugly again, the fact that each pixel is upscaled to either 2 or 3 pixels causes visible differences and artifacts in the text again. Let's try bilinear filtering:

Well, it's soft and all, but it's also very blurry. How about the new shader?

Perfect! The image is as sharp as it can be without introducing aliasing, but still filtered to the point where the pixels in the text look perfectly even. Try switching between this and the unfiltered image and see the massive difference without a significant loss of sharpness.

Essentially, what the new shader does is give you the output of extremely super-sampled nearest neihgbor filtering, without requiring rendering at a higher resolution. It looks great and costs pretty much nothing! =D


27  Java Game APIs & Engines / OpenGL Development / Re: Optimizing Performance on: 2017-06-10 10:59:42
I strongly recommend that you first of all try to optimize the terrain rendering. It's by far the slowest part, so optimizing it should be a priority. You should try to figure out what's making it slow. Either you're drawing too many triangles and/or your vertex shader is too expensive, so processing the geometry is the slow part, or the slow part is processing the pixels. You can diagnose this by changing the resolution you render at. If you halve the resolution (width/2, height/2), does the timing of the terrain rendering stay the same?
> Performance is ~4x faster ---> The slow part is either the fragment shader and/or the writing of the data to the G-buffer textures, so we should take a look at the fragment shader of the terrain to investigate further.
> Performance roughly stays the same ---> The slow part is the sheer number of vertices and/or the vertex shader. Take a look at the vertex shader and consider adding a LOD system to reduce the number of vertices of distant terrain, if you don't already have that.



Concerning your lighting shader...

 - Be careful! Some compilers don't accept automatic int-->float conversion. Line 30 and 33 seem to cause issues on at least some AMD hardware. Make sure those are float literals.

 - I'd recommend having different shaders for different types of lights. If you want, you can inject #defines into the source code to specialize the shader for different lights instead of relying on runtime branching on uniform variables. Although branching is not usually very expensive anymore (especially branching on uniforms as that means all shader invocations will take the same branch), it still forces the GPU to allocate enough registers for the worst case branch for all invocations, which can negatively impact texture read performance.

 - Consider using a signed texture format for your normalTexture (GL_RGB8_SNORM for example). They'll be more accurate and are automatically normalized to (-1, +1) instead of (0, 1), so you won't have to do that conversion yourself.

 - You seem to be doing lighting in world space instead of view space which is more common. Doing it in view space has the advantage of placing the camera at (0, 0, 0), which simplifies some of the math you have.

 - Even if you choose to do the lighting in world space, precompute the inverse projection view matrix and do a single matrix multiply. Currently, around half of the assembly instructions in your shader comes from this single line:
1  
vec4 homogenousLocation = invViewMatrix * invProjectionMatrix * clipSpaceLocation;

This line first calculates a mat4*mat4 operation, which you might recognize requires computing a 4D dot product for every single element in the array. This requires 4 operations per element, so that's 64 operations right there. The resulting matrix is then used to do a mat4*vec4 operation, which is much cheaper; this only requires 4 dot products = 16 operations. That means that changing it to the following code is 60% faster as it avoids the mat4*mat4 operation:
1  
vec4 homogenousLocation = invViewMatrix * (invProjectionMatrix * clipSpaceLocation);

but the fastest will always be
1  
vec4 homogenousLocation = invViewProjectionMatrix * clipSpaceLocation;

which will make the entire shader around 80% faster in total. In other words, doing that instead gets rid of 44% of the instructions in your entire shader. As your lighting shader is ALU-bound (lots of math instructions), you're likely to see a very significant performance increase from that optimization alone.

Together, all the optimizations above (signed normal texture + view space lighting + precomputed matrix) should theoretically yield a 87% increase in performance of the lighting.



In addition:

 - Doing a fullscreen pass for the ambient light is extremely inefficient. That requires you to do an entire fullscreen pass just to add the ambientLight*diffuseColor to the lighting computation. This requires your GPU to read in the entire diffuse texture and blend with the entire lighting buffer, which is going to involve gigabytes of memory moved around and millions of pixels filled just so you can do three math instructions per pixel. I can see that you're doing bloom/HDR right after the lighting. See if you can pack the ambient light calculation into one of those shaders instead. Adding 3 math instructions to a different shader is always going to be faster than doing an entire fullscreen pass.

 - I get the impression that fog could be merged into another shader as well to save the overhead of a fullscreen pass (like the DoF).

 - Your FXAA shader looks a bit expensive for some reason. Are you using a custom one?

 - You shouldn't be doing fullscreen passes for local lights either. There are a number of different techniques for making sure you're not rendering too many unnecessary pixels.
 > Render an actual sphere and only compute lighting for the pixels covered by the sphere (pretty simple to implement).
 > Use the scissor test and the depth bounds test to only process pixels in a rectangular area around the sphere that are within the depth bounds of the sphere (best and fastest, simple but some complicated math to calculate everything).
Consider implementing one of those.

28  Java Game APIs & Engines / OpenGL Development / Re: Optimizing Performance on: 2017-06-09 09:37:42
Your first step should be getting basic GPU profiling working. See this thread: http://www.java-gaming.org/index.php?topic=33135.0. Using that, you can get the exact time your different render passes and postprocessing effects take, which will allow you to see where you should focus your efforts.

Once you've figured out your bottleneck, you can start optimizing it. If you find anything that stands out, I can help you diagnose what's making that particular part slow.
29  Game Development / Newbie & Debugging Questions / Re: LibGDX Box2D Applying Horizontal Linear Impulse Causes Object To Glide on: 2017-06-07 03:32:37
From what little I remember from playing with Box2D, I think I know what's going on. You're applying too much force and have a too low maximum velocity for your player. What happens is that the horizontal motion completely overwhelms the vertical motion, so when the velocity is capped to the max velocity, the vertical motion pretty much disappears.

Example: Max velocity 10, velocity (100, 1). Velocity is calculated to ~100.005, which is capped to the max velocity. The end result is that velocity is set to around (10, 0.1). If you keep pumping horizontal impulse into the entity, he'll never be able to pick up speed and fall due to the max velocity clamping continually reducing the vertical velocity.
30  Discussions / Miscellaneous Topics / Re: Silly Programming Mistakes on: 2017-06-04 21:30:47
Did this one today:
1  
BufferedImage tilesetImage = ImageIO.read(new File("test resources\tilesets\tileset.png"));
Pages: [1] 2 3 ... 123
 
cybrmynd (45 views)
2017-08-02 12:28:51

cybrmynd (62 views)
2017-08-02 12:19:43

cybrmynd (68 views)
2017-08-02 12:18:09

Sralse (84 views)
2017-07-25 17:13:48

Archive (511 views)
2017-04-27 17:45:51

buddyBro (653 views)
2017-04-05 03:38:00

CopyableCougar4 (1143 views)
2017-03-24 15:39:42

theagentd (1147 views)
2017-03-24 15:32:08

Rule (1118 views)
2017-03-19 12:43:22

Rule (1093 views)
2017-03-19 12:42:17
List of Learning Resources
by elect
2017-03-13 14:05:44

List of Learning Resources
by elect
2017-03-13 14:04:45

SF/X Libraries
by philfrei
2017-03-02 08:45:19

SF/X Libraries
by philfrei
2017-03-02 08:44:05

SF/X Libraries
by SkyAphid
2017-03-02 06:38:56

SF/X Libraries
by SkyAphid
2017-03-02 06:38:32

SF/X Libraries
by SkyAphid
2017-03-02 06:38:05

SF/X Libraries
by SkyAphid
2017-03-02 06:37:51
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!