Hi !
Featured games (91)
games approved by the League of Dukes
Games in Showcase (755)
Games in Android Showcase (229)
games submitted by our members
Games in WIP (842)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
   Home   Help   Search   Login   Register   
  Show Posts
Pages: [1] 2 3 ... 124
1  Java Game APIs & Engines / OpenGL Development / Re: Antialiased rendering to Framebuffer object on: 2018-03-23 11:51:53
To add a bit, MSAA is not something that you just "enable" on a framebuffer. MSAA requires storing multiple samples for each pixel, meaning it needs more memory. A framebuffer object does not contain any actual data; it just renders to textures. If you attach an MSAA texture to the FBO, rendering to it will happen with MSAA. To actually display an MSAA image, you need to resolve it, which usually means averaging the samples of each pixel together. All of this happens automatically when you request an MSAA default framebuffer but needs to be manually taken care of when using FBOs.
2  Discussions / Miscellaneous Topics / Re: What I did today on: 2018-03-18 01:50:25
Also, how many terms are in your Taylor expansion? And do you just evaluate it with a time since launch or some other variable?
I used 8 coefficients there, but even 3 can sometimes suffice; it depends on the parameters of the simulation. Yeah, I calculate x = (1 - currentTime/totalTime), use x, x^2, x^3, etc with the taylor coefficients and finally do angle = (1.0 - sum) * Math.PI/2 to get the thrust angle. Since my only real requirement is that the sum is 0 at the start and 1 at the end, I just make sure that the coefficients sum up to 1.0 and that handles it.
3  Discussions / Miscellaneous Topics / Re: What I did today on: 2018-03-18 01:05:15
Had my last uni exam today. Now I only have one more left this summer and I'm done with uni.

Decided to have some fun and coded a super simple ok, maybe not that simple Falcon 9 FT rocket launch simulator in 2D. I've inputted the following data in SI units:

 - Earth's radius and mass
 - Earth's rotation and the "bonus" horizontal velocity gotten by launching from Cape Canaveral
 - Earth's atmospheric density approximated with an exponential function of height over ground

 - Target orbit altitude (400 000m Low Earth Orbit, just under the ISS) and the exact velocity needed to maintain that orbit

 - Rocket drag area (circle of radius 2.6315m) and approximated drag coefficient (0.25)

 - The following parameters for the two Falcon 9 FT stages:
    - Weight without fuel (22 tons, 4 tons)
    - Weight of fuel (410.9 tons, 107.5 tons)
    - Thrust (~8 000 000 N (actually 7 607 000 to 8 227 000 depending on altitude), 934 000 N
    - Burn time (162 sec, 397 sec)

The rocket is then assumed to rotate from 0 to 90 degrees over the course of the flight (you never want to point a rocket downwards or you will not go to space today), and this rotation is controlled using a Taylor series controlling the rotation. This allows the rocket to rotate at different rates at different parts of the flight. Basically, this allows the first stage to go more straight up to quickly get out of the thickest part of the atmosphere, and then quickly rotate over to the side to gain orbital speed as efficiently as possible. The Taylor constants are decided using a simple learning algorithm that tries to get rid of the error in altitude and orbital speed/direction. However, if the thrust is simply too high, the rocket will get a too high orbital velocity, so I also added the payload weight as a parameter, so if the rocket overshoots, I just add more weight to it. Also, the rotation of the Earth is taken into consideration, and the drag is calculated (approximately) based on the idea that the atmosphere rotates with the planet (so the rocket starts out travelling at around 408m/s together with the surface, but has zero drag since the air is rotating too).

The learning algorithm just tries random modifications to the taylor constants and the payload weight and checks if the result is a better end orbit than before (= lower error in altitude and orbital speed/direction), and given time it converges on the perfect settings for launching the rocket. After running the program for around 10 minutes and changing the learning rate as it went, I got the calculated error down:
 - -9.909272E-7 meter error in altitude
 - 6.1378614E-8 m/s error in velocity away from the Earth
 - -1.3683348E-8 m/s error in orbital velocity

In other words, the resulting orbit is off by less than a micrometer, and the velocity is off by even less than that. The maximum payload was determined to be 21843.23 kg, which is a bit lower than the payload the Falcon 9 FT is rated for, 22 800kg payload (possibly plus 1 700 kg, not sure if those 1.7 tons are included in the 22 800 number), but at least close enough to be in the same ballpark. Considering it's a 2D simulation I made with information I could Google in a couple of minutes, it's pretty cool.

Image of the trajectory:

The curve at the start comes from the fact that the planet is rotating, so the rocket starts with a high speed to the left but accelerates upwards. The color of the trajectory shows the two stages and how they deplete their fuel (which of course causes the rocket to get MUCH lighter over the course of the flight). Once the first stage is depleted, it instantly switches to stage 2 at which point stage 1 is jettisoned and no longer adding weight to the rocket. It would be possible to add a zero-thrust stage inbetween to add some time for the stage separation. No, the first stage doesn't fly back and land. =P

Some people go out drinking to celebrate, I write rocket simulations...
4  Discussions / Miscellaneous Topics / Re: What I did today on: 2018-03-11 15:19:56
So far, the most glaring missing feature is uniform buffers, instead using the old uniform system which has much worse performance than uniform buffers.

See #1231.
From that GitHub issue:
GL UBOs are conceptually the same as constant buffers in D3D. Internally bgfx does use constant buffers, but it assigns only used constants. shaderc does strip unused constants and makes more compact constant buffer. This behavior matches what old-GL style does it. So from user point of view this is better/more desirable behavior, where you just don't care about packing things. In the future I'll add per view and per frame uniforms, which will be set at different frequency than per draw uniforms. Some GL implementations implement UBO (can't find blog post about this) as driver internally calls bunch of glUniform. I haven't tested this myself, but having tested some other GL features that were also promising huge gains, I believe it. Anyhow, bgfx is not 1:1 wrapper to lower level APIs and there will be always some differences how things are handled on high-level.

Imagine you want to draw 300 3D models in your game.  The vertex and index data are all in shared buffers and they all use the same shader and same textures, but a single uniform vec3 for the position offset of each model is changed inbetween each call to place each model at the right position. You also have a large number of uniforms for controlling lighting (essentially arrays of light colors, positions, radii, shadow map matrices, etc), but these are of course set up once. You essentially have code that looks like this:
//Upload lighting data
glUniform3fv(lightColorsLocation, 100, ...);
glUniform3fv(lightPositionsLocation, 100, ...);
glUniform1fv(lightRadiiLocation, 100, ...);
glUniformMatrix4fv(shadowMatricesLocation, 100, ...);

for(Model m : models){
    glUniform3f(positionOffsetLocation, m.getX(), m.getY(), m.getZ());

This will perform so bad it's not even funny, and it's easy to explain why. GPUs do not have support for "uniform variables". They source uniform data from buffers, either in RAM or VRAM. This means that the GPU will create a uniform buffer layout for us based on the defined uniforms in our shader and place our data in a buffer. Great, no problem. We end up with a buffer that has the light colors, positions, radii and shadow matrices... and then the position offset. Then we change the position offset inbetween each draw call. The problem is that the GPU hasn't actually executed those commands yet, so we can't write over our previous uniform data. Because of that, the driver will create a full copy of the entire uniform buffer, including all the light data, giving each draw call its own version of the uniform buffer, with only the position offset actually differing between them. This leads to a lot of hidden memory usage within the driver, horrendous CPU performance (spiking on either glUniform3f() as it allocates a copy each time, or on buffer swapping if a multithreaded driver is used). The fact that all the uniform variables are placed in the same buffer makes it impossible to change a few of them without making an entire copy.

This is what the Nvidia driver does, and the exact problem I got working on Voxoid for Cas... except in my case, I already had my massive lighting data in a (manually created separate) uniform buffer. The only thing being copied around where the per-scene/per-camera attributes like fog uniforms and camera matrices, and even then, that copying was enough to completely kill CPU performance. The above example would probably drop below 60 FPS around ~100 draw calls or less.

Sure, bgfx could try to add some heuristics to this whole thing to try to figure out the update rate of different uniforms and assign them to different uniform buffers that can be swapped individually... aaaaand before you know it you have an inefficient as f**k behemoth that uses heuristics for which uniforms to place in which uniform buffers, more heuristics to figure out whether to place the buffer in RAM or VRAM, and even more heuristics detect arrays of the same size and group them into structs for memory locality, and now the user has to train the heuristics to not bork out and max out the RAM of your computer for anything more complex that the most trivial possible use of uniforms, the whole engine crashes when it tries to use more than the maximum number of uniform buffers, etc. You've just created an unholy mess that has much worse performance due to the overhead of running the heuristics calculations even for well-written code, and you got crazy amounts of bugs and spaghetti interaction between seemingly unrelated things. Hmm, what does that remind me of...? Right, OpenGL. None of these are exaggerations either, by the way. Do anything at all unconventional with buffer object mapping/updating on the Nvidia driver and you'll start to see all kinds of cool colors and shapes floating in front of you as you feel the sweet LSD overdose kick in which you took trying to deal with the fact that your buffer object test program performs completely differently depending on which order you test the different methods in, and one order semi-reliably causes a massive memory leak in the driver, but I digress. (disclaimer: don't do drugs)

All this because to avoid having the user having to lay out the data and upload it to a buffer him/herself.

EDIT: In the above case, I solved it by completely getting rid of the glUniform3f() call by placing the data in a per-instance vertex attribute buffer, and using base instance to pick a value for each draw call. This would of course have solved the issue for bgfx as well in this case, but even then, the uniform buffer approach is vastly superior. If each of those models would need a different shader, you'd have to recall the uniform setters after each shader switch, and OpenGL/bgfx would still need to duplicate the light data for each shader. With uniform buffers, you could just leave the uniform buffer (or the Vulkan descriptor set) bound and it all just works out with absolutely no additional cost.

EDIT2: An additional benefit of uniform buffers is the ability to do arrays-of-structs instead of structs-of-arrays, which should have better memory locality.

5  Discussions / Miscellaneous Topics / Re: What I did today on: 2018-03-07 22:11:20
This graphics abstraction sounds pretty similar to the goals of BGFX (plus supports many more backends/platforms and has been thoroughly tested and used in high profile projects) and considering LWJGL3 now has bindings for the same, does it still make sense to roll you own? or are there other reasons for doing so?
I was aware of BGFX's existence to the extent that I knew that LWJGL had a binding for it, but I wasn't entirely sure what its purpose was just from checking the Javadocs in LWJGL. I've looked it up a bit now and it does indeed seem to have the same purpose as the abstraction I've started on. I can't make any conclusive statements about it; I'll need to look into it more, but it doesn't seem like a perfect match from what I can tell.

So far, the most glaring missing feature is uniform buffers, instead using the old uniform system which has much worse performance than uniform buffers. For example, with forward+ shading I'll be uploading a large amount of light data to uniform buffers which are then read in every single rendering shader. With uniform buffers, I can lay out and upload the data much more efficiently using buffers, bind it once and leave it bound for the entire render pass. Without them, I'll have to reupload all the uniforms for each shader that needs them with one call for each variable.

Secondly, it's unclear to what extent BGFX supports multithreading. It claims that the rendering commands are submitted from a dedicated rendering thread (so am I) to improve performance, but this is essentially what the OpenGL driver already does. I can't tell from the research I did if it's possible to construct efficient command buffers from multiple threads with BGFX. I suspect it does support that, but if it doesn't that'd obviously hold back performance quite a bit.

The biggest problem is just the entire approach that BGFX has. The level of abstraction is at the level of OpenGL, meaning that a lot of the gritty details are hidden. When I started learning Vulkan and trying to find common points between Vulkan and OpenGL, I quickly came to the conclusion that emulating OpenGL on Vulkan is completely pointless. If you try to abstract away the gritty details like image layouts, descriptor sets, multithreading, uniform buffers, etc like OpenGL does, you're setting yourself up for essentially writing a complete OpenGL driver. The problem is that it's going to be a generic OpenGL driver that has to manage everything in suboptimal ways, instead of an actual dedicated driver for the specific hardware you have, and seriously, you're not gonna be able to write a better driver than Nvidia for example.

The key thing I noticed was that it's a really bad idea to emulate OpenGL on top of Vulkan, but emulating Vulkan on top of OpenGL is actually super easy. Emulating descriptor sets on OpenGL is super easy and you can keep a lot of the benefits of descriptor sets. Image layout transitions can just be completely ignored by the abstraction (the OpenGL driver will of course handle that for us under the hood). We can emulate command buffers to at least get some benefit by optimizing them on the thread that compiles them. In other words, writing a Vulkan implementation that just delegates to OpenGL is easy. Writing an OpenGL implementation that delegates to Vulkan, oh, have fun. Hence, if you expose an API that requires the user to provide all the data needed to run the API on Vulkan/DirectX12/Metal, adding support for anything older like OpenGL is trivial.
6  Discussions / Miscellaneous Topics / Re: What I did today on: 2018-03-06 20:58:29
@theagentd you are smart as f**k.  I have no clue what most of that meant at all lol.
Thanks, but you can see that as me being bad at explaining too. =P If you have any specific questions, I'd be glad to answer them.
7  Discussions / Miscellaneous Topics / Re: What I did today on: 2018-03-06 20:29:20

I won't go too deep into detail here, as 1. it'd be way too much for one post and 2. some of it may change drastically before we would consider releasing it, so keep that in mind.

2D is mostly just simple stuff like sprite batching and some simple postprocessing, so not too much to brag about. We have some simple postprocessing for 2D (simple bloom, dithering, MSAA resolving). Not much to say here.

NNGINE has a powerful task-based threading library which allows the user to define "task trees" to run. These tasks may be able to individually be processed on multiple threads (example: split up frustum culling into N subtasks) or individual tasks (run two non-threadable tasks in parallel). The task set additionally contains dependencies between tasks, forcing certain tasks to finish before others can run. The system has the potential to use any number of CPU cores, and NNGINE could hit 3.5x scaling on a quad core, held back by OpenGL.

The 3D engine is where most of the fancy stuff is happening. The engine is currently using classic deferred rendering, but this will be changed to clustered deferred, with an optional forward clustered rendering mode (either for the entire scene, or just for lighting particles/transparent effects etc). The G-buffer data for deferred will be customized with a storage/read shader to allow games to customize the G-buffer layout to contain the data they need.

Rendering of geometry is done using a minimal Renderer interface. The interface pretty much only has a render() function which is called by the engine for each view that needs to be rendered. However, the Renderers are allowed to inject their own tasks and dependencies into the main NNGINE task tree, allowing renderers to run frustum culling in parallel, etc. This is used extensively in the current Renderers we have implemented.

USING Renderers is pretty easy. Here's the complete code for adding the ocean I made that you may have seen screenshots of here:
      oceanRenderer = new OceanRenderer();

Obviously, the complexity of the renderer depends on the actual implementation. Here's the code for setting up a static 3D model:

      modelRenderer = new ModelRenderer();
      modelRenderer.addStaticModel("terminus", ModelUtils.readModel(new File("terminus.wam")));

      Instance in = modelRenderer.createStaticInstance();
      in.matrix.translation(x, y, z).scale(0.5f);

      //once you don't want the instance
      in.dispose(); //don't forget it or it lingers forevaaa

You also need to create the NNGINEGraphics object which is the main 3D engine object (the "graphics" variable in the above code). This is 20 lines or so, including setting default graphics settings and setting up a camera.

Model textures are streamed in automatically on demand on a background thread by the engine by checking the materials of visible models. Gonna make model vertex data (optionally) streamed too, as it's annoying to manage right now.

Lighting includes shadowed and unshadowed point, cone and directional lights (with configurable cascades). Lighting is physically based and produces very high quality reflections, and should close to the result seen in Substance Painter when designing materials for NNGINE. A planned feature is changing the BRDF to allow for lower quality lighting, cel shading, etc.

Postprocessing is currently not heavily customizable. Every single possible setting can be changed while the game is running and updated (except GPU count and texture resolution, last one for now only, planned feature), but you cannot currently add your own postprocessing passes. This is due to the heavy integration of the passes to minimize the number of fullscreen passes and how optimized the whole thing is. Adding individual postprocessing "passes" will never perform well, so I'll probably introduce the concept of a complete "postprocessing pipeline" which you can completely swap out instead. Current effects include: HQ SSAO, motion blur and depth of field (extremely high quality, writing my master's thesis on it), screenspace volumetric lighting, HDR tone mapping, HQ bloom, temporal SRAA, tone mapping (with camera exposure etc), air distortions, and probably a few I forgot about. =P

Transparency rendering is a tough one. When designing NNGINE I went for a 100% order independent transparency system, but this wasn't entirely optimal due to performance and memory usage. I'm not sure what to do here, but regardless of the sorting system we'll allow for lit particles in the future thanks to the wonders of clustered shading. Regardless, transparency is structured similarly to the opaque Renderers, with different renderers. Particles are a specific renderer and are in turn split into Systems for culling. This stuff is very likely to change though, so not much point in posting code. It won't be that far off from the above ModelRenderer example though.

Finally, the entire engine will be ported over from OpenGL to a graphics abstraction system I'm working on that will allow the engine to run on pretty much any rendering API, with the primary targets being Vulkan and OpenGL variants. The abstraction is pretty much done... in my head at least. I have some minor code and proof of concepts (read: a single triangle), but it's been on ice for some time since that. The abstraction will include a cross-API shader language (or something along those lines). Bear in mind, it will NOT be easy to work with the abstraction directly (which you would have to do to some extend when writing your own Renderers); it's essentially raw Vulkan with only very minor simplifications. That being said, the actual code you'd write in that case would be surprisingly similar to OpenGL. When running on OpenGL, command buffers are emulated (still can bring great performance benefits actually). Vulkan support will allow for significantly reduced driver overhead, threaded rendering (finally linear scaling with cores!) and advanced features like manual multi-GPU support.

We got an OpenAL-based sound system, controller support, multi-window support, etc as well. We got a Blender plugin for exporting models and animations from Blender directly to our own format. We got an entire toolset under construction for converting textures to streamable NNGINE textures.

I feel like I forgot half the shit I wanted to write, but TL;DR: When released, it will be a completely multithreaded, cross platform, cross API (with Vulkan support) highly customizable engine
8  Discussions / Miscellaneous Topics / Re: What I did today on: 2018-03-06 12:18:32
Is NNGINE a fork of NGINX?  Grin
Yes, the engine is just the front-end and is just a webpage. It just connects to our server farm in China.

No, it is a TERRIBLE DISTRACTION from the awesome VOXOID engine he's also making Tongue

Cas Smiley
... which is also just a distraction from my schoolwork and Master's thesis. =P
9  Discussions / Miscellaneous Topics / Re: What I did today on: 2018-03-06 08:48:01
I'll post some examples to give you a generic idea of how NNGINE works as soon as I can (either tonight or tomorrow night depending on school).
10  Java Game APIs & Engines / OpenGL Development / Re: Seams rendering tilemap on: 2017-12-25 13:22:11
You'll always have a risk of bleeding when using a tilemap.

 - Linear filtering are inherently problematic, as it means the edge around each tile will read the neighboring tiles.
 - Mipmaps are problematic even without linear filtering if your tile size isn't a power of two (or if you've set the max mipmap level incorrectly).
 - MSAA can cause extrapolation of texture coordinates when the center of a pixel is outside the triangle but one or more of the samples are still covered, which causes bleeding.
 - Even if all the above are disabled, you can still get rare bleeding due to floating point errors. These are more common when your tilemap texture isn't a power-of-two, but can always occur.

All of the above can be 100% fixed by using a GL_TEXTURE_2D_ARRAY texture instead, with each tile in its own layer. Then you can use linear filtering, mipmaps and MSAA without any issues at all as no filtering will ever occur between texture array layers. If you for some reason don't want to use 2D texture arrays (there's really no reason not to use them though; they're pretty much meant for these cases) and you're willing to sacrifice linear filtering and mipmaps, simply modifying the texture coordinates to read a slightly smaller area of the texture than the tile can work. So for example, if your tiles are 32x32 pixels big, you can use the texture coordinates (0.1, 0.1) to (31.9, 31.9) or something like that (divided by the texture size of course). You can prevent extrapolation when using MSAA by enabling centroid interpolation of the texture coordinates in your shader.

If you have further questions or want more information about any of this stuff, just ask.
11  Games Center / WIP games, tools & toy projects / Re: Generated Graph optimized A* Pathfinding on: 2017-12-18 13:26:37
Light falloff values are VERY dependent on you doing correct gamma correction. If you just output linear 8-bit color values without any modification, your monitor will essentially  run those through approximately x^2.2. I may be mistaken, but the fact that you think a linear falloff looks the best seems to indicate that you're not doing gamma correction.

A lack of gamma correction is especially noticeable when adding together light values. If you write out 0.2 as your light value, that gets mashed to 0.2^2.2 = ~0.029. However, if you have two lights that add up, both with the same intensity, you get 0.2+0.2=0.4 ---> 0.4^2.2 = ~0.133. In other words, adding together two lights makes the result over 3x as bright, not 2x.

It is imperative that you get gamma correction working before you start tweaking things like this.
12  Games Center / WIP games, tools & toy projects / Re: Generated Graph optimized A* Pathfinding on: 2017-12-13 17:22:58
My open areas were not rectangular, but organic. Try a big diamond-shaped room and then let my grand children know the results.  Wink
13  Games Center / WIP games, tools & toy projects / Re: Generated Graph optimized A* Pathfinding on: 2017-12-13 16:59:45
I've actually tried out this exact pathfinding system myself. It's REALLY nice and clean and is a LOT faster for stuff like long corridors. It pretty much generates completely optimal paths in contrast with a grid-based pathfinder. However, it really can't handle open spaces well, as you end up with all corners connecting to all other corners over extreme distances. These checks are expensive and slow. In my case I needed to be able to modify the terrain efficiently and there was a very big risk of large open rooms with massive amounts of edges, so I ended up dropping it.
14  Java Game APIs & Engines / Engines, Libraries and Tools / LibGDX jars not part of the generated project and instead deep in .gradle folder on: 2017-11-07 22:13:05
So I'm trying to lay the foundation for a small school project for my Game Design class. I said that we could use LibGDX for Android compatibility, and now I'm in charge of setting it up of course. The problem is that I just cannot fathom what I'm supposed to do to get this project pushed completely to GitHub so we can actually collaborate. The LibGDX project creator generated dependencies like this:

Is there any way I can make LibGDX generate sane library dependencies or should I just clone my harddrive and mail it to my teammates?
15  Discussions / Miscellaneous Topics / Re: What I did today on: 2017-10-22 21:57:08

Ah, so the scenario is essentially 1 source of gravity where the precise positions of the ships/particles affected by it comes to play? But then why is the calculations so taxing? What makes this so different than anything else? The precision? The magnitude/amount of bodies? Why wouldn't e.g. a QuadTree based solution work?
Simulating the physics takes 0.27 ms for ~1 million ship, and this is GPU bandwidth limited, so I an have up to 8 sources of gravity before I get any drop in performance. If it's just the simulation, it can easily be done for over 10 million ships. The problem is the collision detection really. Hierarchical data structures are usually not very efficient on the GPU, and constructing them on the CPU would require reading the data back, constructing the quad tree, then uploading it again to the GPU, which is gonna be too slow. In addition, actually querying the quad tree on the GPU will be very slow as well; GPUs can't do recursion and computations happen in lockstep in workgroups, so any kind of branching or uneven looping will be very inefficient. It's generally a better idea to use a more fixed data structure, like a grid instead, but that's a bad match in this case. The large scale of the world, the extremely fast speed of the ships and the fact that ships will very likely be very clumped up into fleets means that even a uniform grid will be way too slow.

The idea of sorting the ships along one axis and checking for overlap of their swept positions (basically treating each ship as a line from its previous position to its current position) was inspired by Box2D's broadphase actually. I concluded that sorting was a simpler problem to solve than creating and maintaining a spatial data structure (especially on the GPU), but after testing it out more I'm not sure it's a good solution in this case. For a fleet orbiting in close formation, there's a huge spike in sorting cost when the orbit reaches the leftmost and rightmost edges of the orbit when the order of the entire fleet reverses. There are also problems when two large fleets, one moving left and the other right) cross each other, again due to the two fleets first intermixing and then swapping positions in the list once they've crossed... Finally, there's a huge problem with just fleets travelling around together. A fleet of 10 000 ships moving very quickly together will have overlapping swept positions, so all 10 000 ships will be collision tested against each other.

I got a lot of thoughts on this problem, so if you want to have more of a discussion about this, I'd love to exchange ideas and thoughts on this through some kind of chat instead.
16  Discussions / Miscellaneous Topics / Re: What I did today on: 2017-10-22 19:28:54

It's not an N-body simulation. Celestial bodies (stars, planets, moons) affect each other, but ships are only pulled by celestial bodies. Ships don't pull each other either.
17  Discussions / Miscellaneous Topics / Re: What I did today on: 2017-10-21 23:28:53
Life's been tough on me the last few weeks, especially the last few days, so I decided to do some extra fun coding this weekend.

3-4 years ago I made some threads about an extreme scale space game with realistic Newtonian physics. The game would require simulating a huge number of objects affected by gravity, with extreme speed collision detection. I am talking 100k+ ships, each orbiting a planet at 10km/second, with accurate collision detection. The technical challenges are enormous. After some spitballing here on JGO, I ended up implementing a test program using fixed-precision values (64-bit longs) to represent positions and velocities to get a constant amount of precision regardless of distance from origin. Simple circle-based collision detection was handled by sorting the ships along the X-axis, then checking collisions only for ships that overlap along the X-axis. The whole thing was completely multi-threaded, and I even tried out Riven's mapped struct library to help with cache locality. Even sorting was multithreaded using my home-made parallel insertion sort algorithm, tailor-made for almost-sorted data sets (the order along the X-axis did not change very quickly). It scaled well with more cores, but was still very heavy for my poor i7.

I realized that the only way to get decent performance for this problem on a server would be to run the physics simulation on the GPU. With a magnitude higher performance and bandwidth, the GPU should be able to easily beat this result as long as the right algorithms are used. The physics simulation is easy enough as it's an embarrassingly parallel problem and fits perfectly for the GPU. The collision detection (sorting + neighbor check) is a whole different game. GPU sorting is NOT a fun topic, at least if you ask me. The go-to algorithm for this is a parallel GPU radix sort, but with 64-bit keys that's very expensive. Just like my parallel insertion sort algorithm took advantage of the almost-sorted nature of the sorting, I needed something like that that could run on the GPU as well. That's when I stumbled upon a simple GPU selection sort algorithm.

The idea is simple. For each element, loop over the entire array of elements to sort. Calculate how many elements that should be in front of this element. You now know the new index of your element, so move it directly to that index. Obviously, this is O(n^2), so it doesn't scale too well. However, the raw power of the GPU can compensate for that to some extent. 45*1024 = 46 080 elements can be sorted in ~60FPS, regardless of how sorted the array is. By using shared memory as a cache, performance almost triples to 160 FPS, allowing me to sort 80*1024 = 81 920 elements at 60 FPS. Still not fast enough. Anything above 200k elements runs a big risk of causing the driver to time out and restart...

Enter block-based selection sort for almost sorted data-sets! The idea is to split the list up into blocks of 256 elements, then calculate the bounds of the values of each block. This allows us to skip entire blocks of 256 values if the block doesn't intersect with the current block we're processing. Most likely, only the blocks in the immediate vicinity of each block needs to be taken into consideration when sorting, while the rest of the blocks can be skimmed over. Obviously, this makes the data dependent, and the worst case is still the same as vanilla GPU selection sort if all blocks intersect with each other (which is essentially guaranteed for a list of completely random values). However, for almost sorted data sets this is magnitudes faster!

To simulate an almost sorted data-set, an array is filled with elements like this:
for(int i = 0; i < NUM_KEYS; i++){
   data.putLong(i*8, i + r.nextInt(1000));

This gives us an almost sorted array with quite a lot of elements with the exact same value, to test the robustness of the sort. The block-based selection sort algorithm is able to sort a 2048*1024 = 2 097 152 element list... at 75 FPS, way over the target of 100 000. It's time to implement a real physics simulation based on this!

Let's define the test scenario. 1024*1024 = 1 048 576 ships are in perfect circular orbits around the earth. The orbit heights range from low earth orbit (International Space Station height) to geosynchronous orbit. Approximately half of the ships are orbiting clockwise, the other half counterclockwise. The size of the earth, the mass, the gravity calculations, etc are physically accurate and based on real-life measurements.

Going back to my original threaded CPU implementation, it really can't handle one million ships very well. Just the physics simulation of the ships takes 20.43ms, and sorting another 18.75ms. Collision detection then takes another 10.16ms.

The compute shader implementation is a LOT faster. Physics calculations take only 0.27ms, calculating block bounds another 0.1ms and finally sorting takes 2.07ms. I have not yet implemented the final collision detection pass, but I have no reason to expect it to be inefficient on the GPU, so I'm optimistic about the final performance of the GPU implementation.

Each ship is drawn as a point. The color depends on the current index in the list of the ship, so the perfect gradient means that the list is perfectly sorted along the X-axis. 303 FPS, with rendering taking up 0.61ms, 370 FPS without rendering.
18  Discussions / Miscellaneous Topics / Re: What I did today on: 2017-10-16 14:53:13
I got fiber installed a couple of days ago.


19  Java Game APIs & Engines / Engines, Libraries and Tools / Re: SSAO in LibGDX sans Deferred Rendering? on: 2017-09-30 01:12:59
I've been super busy, sorry.

I didn't realize the random vectors essentially filled the same purpose as the random rotations. You can drop the rotation matrix I gave you and just use the random vector texture you had. Please post how you sample from it. I recommend a simple texelFetch() with the coordinates &-ed to keep them in range.
20  Java Game APIs & Engines / Engines, Libraries and Tools / Re: SSAO in LibGDX sans Deferred Rendering? on: 2017-09-27 14:23:37
I think the reason why you're getting wrong results is because you do the matrix multiplication the wrong way around. Remember that matA*matB != matB*matA. However, I've been thinking about this, and I think it's possible to simplify this.

What we really want to do is rotated the samples around the Z-axis. If we look at the raw sample offsets, this just means rotating the XY coordinates separately, leaving the Z intact. Such a rotation matrix should be much easier to construct:
   float angle = rand(texCoords) * PI2;
   float s = sin(angle);
   float c = cos(angle);
   mat3 rotation = mat3(
      c, -s, 0,
      s,  c, 0,
      0, 0, 1
   //We want to do kernelMatrix * (rotation * samplePosition) = (kernelMatrix * rotation) * samplePosition
   mat3 finalRotation = kernelMatrix * rotation;

This should be faster and easier to get right!
21  Java Game APIs & Engines / Engines, Libraries and Tools / Re: SSAO in LibGDX sans Deferred Rendering? on: 2017-09-26 18:29:47
A couple of tips:

 - The code you have is using samples distributed over a half sphere. Your best bet is a modified version of best candidate sampling over a half sphere, which would require some modification of the JOML code to get.

 - I'd ditch the rotation texture if I were you. Just generate a random angle using this snippet that everyone is using, then use that angle to create a rotation matrix around the normal (You can check the JOML source code on how to generate such a rotation matrix that rotates around a vector). You can then premultiply the matrix you already have with this rotation matrix, keeping the code in the sample loop the exact same.

 - To avoid processing the background, enable the depth test, set depth func to GL_LESS and draw your fullscreen SSAO quad at depth = 1.0. It is MUCH more efficient to cull pixels with the depth test than an if-statement in the shader. With an if-statement, the fragment shader has to be run for every single pixel, and if just one pixel in a workgroup enters the if-statement the entire workgroup has to run it. By using the depth test, the GPU can avoid running the fragment shader completely for pixels that the test fails for, and patch together full workgroups from the pixels that do pass the depth test. This massively improves the culling performance.

 - You can use smoothstep() to get a smoother depth range test of each sample at a rather small cost.

 - It seems like you're storing your normals in a GL_RGB8 texture, which means that you have to transform it from (0.0 - 1.0) to (-1.0 - +1.0). I recommend using a GL_RGB8_SNORM which can stores each value as a normalized signed byte, allowing you to write out the normal in the -1.0 to +1.0 range and sample it like that too. Not a huge deal of course, but gives you better precision and a little bit better performance.
22  Java Game APIs & Engines / Engines, Libraries and Tools / Re: SSAO in LibGDX sans Deferred Rendering? on: 2017-09-26 04:05:01
I'm not sure what your "kernel" is. Are those the sample locations for your SSAO? I'd recommend precomputing some good sample positions instead of randomly generating them, as you're gonna get clusters and inefficiencies from a purely random distribution. JOML has some sample generation classes in the org.joml.sampling package that may or may not be of use to you.

It doesn't look like you're using your noise texture correctly. A simple way of randomly rotating the samples is to place random normalized 3D vectors in your noise texture, then reflect() each sample against that vector. I'm not sure how you're using your random texture right now, but it doesn't look right at all. If you let me take a look at your GLSL code for that, I can help you fix it.
23  Java Game APIs & Engines / Engines, Libraries and Tools / Re: SSAO in LibGDX sans Deferred Rendering? on: 2017-09-25 14:11:52
To fix the SSAO going too far up along the cube's edges, you need to reduce the depth threshold.

I can also see some banding in your SSAO. If you randomly rotate the sample locations per pixel, you can trade that banding for noise instead, which is much less jarring to the human eye.
24  Java Game APIs & Engines / Engines, Libraries and Tools / Re: SSAO in LibGDX sans Deferred Rendering? on: 2017-09-24 18:18:23
Renderbuffers are a bit of a legacy feature. They are meant for exposing formats the GPU can render to but can't be read in a shader (read: multisampled stuff). The thing is that multisampled textures are supported by all OGL3 GPUs so they no longer fill any real purpose anymore. If you do the FBO setup yourself, you can attach a GL_DEPTH_COMPONENT24 texture as depth attachment and read it in a shader.
25  Java Game APIs & Engines / Engines, Libraries and Tools / Re: SSAO in LibGDX sans Deferred Rendering? on: 2017-09-24 13:14:04
Is LibGDX using a renderbuffer?
26  Java Game APIs & Engines / Engines, Libraries and Tools / Re: SSAO in LibGDX sans Deferred Rendering? on: 2017-09-23 14:07:54
You do not need to store depth in a color texture. You can simply bind the depth texture you use as depth buffer and bind that as any other texture. The depth value between 0.0 and 1.0 is returned in the first color channel (red channel) when you sample the texture with texture() or texelFetch().
27  Java Game APIs & Engines / Engines, Libraries and Tools / Re: SSAO in LibGDX sans Deferred Rendering? on: 2017-09-23 12:59:11
The traditional purpose of doing a depth pre-pass is to avoid shading pixels twice. By rendering the depth first, the actual shading can be done with GL_EQUAL depth testing, meaning each pixel is only shaded once. The depth pre-pass also rasterize at twice the speed as GPUs have optimized depth-only rendering for shadow maps, so by adding a cheap pre-pass you can eliminate overdraw in the shading.

To also output normals, you need to have a color buffer during the depth pre-pass, meaning you'll lose the double speed rasterization, but that shouldn't be a huge deal. You can store normal XYZ in the color, while depth can be read from the depth buffer itself and doesn't need to be explicitly stored.

If you have a lot of vertices, rendering the scene twice can be very expensive. In that case, it's possible for you to do semi-deferred rendering where you do lighting as you currently do but also output the data you need to do SSAO afterwards. This require using an FBO with multiple render targets, but it's not that complicated. The optimal strategy depends on the scene you're trying to render.
28  Java Game APIs & Engines / Engines, Libraries and Tools / Re: SSAO in LibGDX sans Deferred Rendering? on: 2017-09-23 07:33:36
Traditional SSAO doesn't require anything but a depth buffer. However, normals help quite a bit in improving quality/performance. You should be able to output normals from your forward pass into a second render target. It is also possible to reconstruct normals by analyzing the depth buffer, but this can be inaccurate if you got lots of depth discontinuities (like foliage).

EDIT: Technically, SSAO is <ambient> occlusion, meaning it should only be applied to the ambient term of the lighting equation. The only way to get "correct" SSAO is therefore to do a depth prepass (preferably output normal too), compute SSAO, then render the scene again with GL_EQUAL depth testing while reading SSAO from the current pixel. If you already do a depth prepass, this should essentially be free. If not, maybe you should! It could improve your performance.
29  Game Development / Game Play & Game Design / Re: Graphics Backend Abstraction on: 2017-09-06 11:57:31
<--- You can't contain this! =P
30  Java Game APIs & Engines / Java 2D / Re: Writing Java2D BufferedImage Data to Opengl Texture Buffer on: 2017-08-21 02:03:28
Am I insane? -> yes. But is it really crazier then drawing 2D stuff in OpenGL?
Yeah, it is. OpenGL is used for 2D stuff all the time. GPUs can really only draw 2D stuff in the first place; you project your 3D triangles to 2D triangles, so it makes a lot of sense to use the GPU for accelerating 2D stuff as well.
Pages: [1] 2 3 ... 124
DesertCoockie (36 views)
2018-05-13 18:23:11

nelsongames (81 views)
2018-04-24 18:15:36

nelsongames (73 views)
2018-04-24 18:14:32

ivj94 (755 views)
2018-03-24 14:47:39

ivj94 (85 views)
2018-03-24 14:46:31

ivj94 (626 views)
2018-03-24 14:43:53

Solater (101 views)
2018-03-17 05:04:08

nelsongames (182 views)
2018-03-05 17:56:34

Gornova (408 views)
2018-03-02 22:15:33

buddyBro (1068 views)
2018-02-28 16:59:18
Java Gaming Resources
by philfrei
2017-12-05 19:38:37

Java Gaming Resources
by philfrei
2017-12-05 19:37:39

Java Gaming Resources
by philfrei
2017-12-05 19:36:10

Java Gaming Resources
by philfrei
2017-12-05 19:33:10

List of Learning Resources
by elect
2017-03-13 14:05:44

List of Learning Resources
by elect
2017-03-13 14:04:45

SF/X Libraries
by philfrei
2017-03-02 08:45:19

SF/X Libraries
by philfrei
2017-03-02 08:44:05 is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!