Java-Gaming.org Hi !
Featured games (90)
games approved by the League of Dukes
Games in Showcase (726)
Games in Android Showcase (216)
games submitted by our members
Games in WIP (796)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
   Home   Help   Search   Login   Register   
  Show Posts
Pages: [1] 2 3 ... 122
1  Java Game APIs & Engines / Engines, Libraries and Tools / Re: LWJGL 3.1.2 on: 2017-05-15 19:31:05
Oh, man, this is awesome! I've never heard of MoltenVK, maybe that'll save MacOS for now. I also didn't know that stb_dxt existed, gotta try that out! Also, stb_image now supports 16-bit PNGs, which are perfect for normal maps!
2  Game Development / Newbie & Debugging Questions / Re: Switching Shaders - Painter's Algorithm on: 2017-05-11 13:26:55
 - Orthographic projection matrices still calculate a depth value, which is configured using the near and far values when setting up your orthographic matrix.

 - Z-buffers do not work well with transparency, as a semi-transparent sprite will update the depth buffer, preventing things from being drawn behind it. If you're OK with binary alpha (either fully opaque or fully transparent), you can use alpha testing / discard; in your shader to prevent depth writes for the fully transparent parts (this is a very common technique used for foliage in 3D games).

 - If-statements are not slow per se. They're only slow if 1. the shader cores in a group take different branches, as that forces all cores to execute all branches, and/or 2. the worst path of your shader uses a lot of registers, as the GPU needs to allocate registers for the worst case branch regardless of if that branch executes or not (this is very rarely a problem). If you have if-statements where all pixels of a triangle take the same path, the overhead is essentially zero. In addition, 2D games are often CPU limited as they often need to micromanage the sprites, so getting rid of CPU work at the cost of GPU work is often a net win as the GPU was sitting idle for most of the time anyway.

 - If you can draw all stumps first, then all leaves, then you only have two shader switches regardless of the number of trees you have, in which case the number of shader switches is constant. In that case, 2 switches instead of 1 is insignificant.
3  Discussions / Miscellaneous Topics / Re: What I did today on: 2017-05-10 14:03:30
A couple of days ago, I managed to get some huge improvements in NNGINE's performance. Seriously, it's frigging insane how much faster it got. In addition, it was really simple and just took a day to do. There were huge GPU performance improvements throughout the entire engine, especially in the G-buffer filling pass, and the memory controller load is down a lot too! A scene which could barely maintain 60 FPS is now doing a crazy 280 FPS!!!

TL;DR: Upgraded from one GTX 770 to two Zotac GTX 1080s, with 5 years warranty to avoid one of them breaking 2 months after the 3 year warranty ends again...
4  Game Development / Newbie & Debugging Questions / Re: [Solved] GDX Bullet - Lag Spikes on: 2017-05-10 09:22:08
ViSualVM is awesome, yeah. It's really good for finding out where your garbage is coming from and can give you a good indicator of hotspots in your code. It can however sometimes be a bit confusing though and the profiler generally makes games drop to 0.01 FPS or so. =/
5  Game Development / Newbie & Debugging Questions / Re: [Solved] GDX Bullet - Lag Spikes on: 2017-05-10 04:38:09
Wait, so your issue was just a crapload of hashmap insertion? =P
6  Game Development / Newbie & Debugging Questions / Re: GDX Bullet - Lag Spikes on: 2017-05-10 04:16:44
Well, try running some memory profiling using VisualVM. It can log stack traces of all allocations you do.
7  Game Development / Game Mechanics / Re: Skeletal Animation structure on: 2017-05-10 03:52:31
My suggestion:

1  
2  
3  
4  
5  
class Bone{
    Bone parent;
    double x, y;
    double sin, cos; //of the rotation
}


When you want to animate a bone, you can then easily get the parent (if there is one), transform it by the parent and go on. You don't generally need a list of children in each bone as long as you have a "global" list of all bones in a skeleton.
8  Java Game APIs & Engines / OpenGL Development / Re: [LWJGL] Conditional rendering? on: 2017-05-03 16:31:43
For such small pieces of geometry, hardware instancing is said to be fairly inefficient. Since your geometry seems to essentially be <6 vertices each, you're probably better off packing all your vertex data of a chunk in a single buffer, using an index buffer to construct polygons out of triangles. If your per-object matrices are static, then manually premultiplying them into the vertex data is a good idea. Otherwise, you can add a manual instance ID variable to each vertex (just 4 bytes per vertex), which you can use to in turn look up a per-object matrix from a texture buffer. Then you can reduce each chunk to a single draw call, and you can then adjust the CPU/GPU tradeoff by modifying the chunk size.
9  Game Development / Performance Tuning / Re: Pathfinding over too large grid on: 2017-04-27 20:12:57
I investigated a lot of different solution, from hierarchical systems to waypoint systems, but they all had different problems, ranging from expensive precomputing to not being able to handle dynamic worlds (units can block a tile). In the end I decided to try optimizing the existing implementation and see how far that got me.

The first issue was the memory usage of the array that holds all the nodes. As the worlds we generate are mostly inaccessible terrain like mountains or ocean, a majority of the elements in the array are going to be null. However, a 40 million element array of 32-bit references is still going to use 150MBs, so the single monolithic array had to go. I split up the world into chunks of 16x16 tiles. That left me with a 480x320 chunk array instead, which only uses around half a megabyte of memory instead. This reduced the constant by around 150MBs. Chunk objects are then allocated (and reused) on demand when the pathfinder first enters a chunk for each job. Chunks have all their nodes preallocated, so there's a tiny bit of inefficiency for the edge around an inaccessible area; if just one tile in a chunk is accessible and explored by the pathfinder, the entire chunk will need to be allocated/reserved.

The second problem was the size of each node. The Node object I used to store the pathfinding data of each tile had the following data:
1  
2  
3  
      int x, y;
      int lastOpenedStart, lastOpenedGoal;
      Node parentStart, parentGoal;

For some crazy reason, this ended up taking up 40 bytes per node, plus 4 byte per node in the arrays used to hold the Nodes in each chunk!!! I count 24 bytes of actual data in there. 24 bytes of actual data taking 44 bytes! I decided to look for ways of reducing the size of the information.

As our max world size is 7680x5120, it's clear that for x and y we really don't need the full range of a 32-bit integer. With 13 bits for each of X and Y, we can accommodate up to 8192x8192 worlds, which is just enough for us.

The reason why the lastOpened* values are ints instead of booleans is to avoid having to reset the entire world after each pathfinding job. If those were booleans, I'd need to go through all opened nodes and reset them back to false after each job. By making them ints, I can implicitly "reset" the entire world by simply doing jobID++; once in the pathfinder. Then when I want to mark a node as opened, I do
lastOpenedStart = jobID;
, and to check if a node is opened, I just do
if(lastOpenedStart == jobID) ...
. However, with the chunk system, it became trivial (and necessary anyway) to reset chunks when they're reused, so such a system isn't necessary here. Hence, we can compress the two opened fields down to just 2 bits in total.

Lastly, the two parent references are used for backtracking once a path has been found. However, in our simple tile world there are only 4 possible directions you can travel in, so we can store each parent with just 2 bits used to encode which of the 4 directions the parent is in.

This all sums up to a grand total of...... exactly 32 bits per node. In other words, my "node array" is currently just an int[] which I pack in all the data into. I can fit an entire node in just 4 bytes per node instead of the 44 bytes used before. In addition, the chunk system means that in most cases the worst case scenario for memory usage is never reached due to more efficient reuse of chunks between multiple pathfinding jobs. The new system averages less than 50MBs, with an absolute worst case scenario at under 75MBs. The improvements in memory locality also almost doubled the performance of the system.


Here's a comparison normal breadth-first search and bi-directional breadth first search:

Normal:


Bidirectional:


There's a huge difference in the number of tiles explored by the two (the red area). The normal search explores almost every single reachable tile in the world, while the bidirectional one generally avoids exploring half the world in worst case scenarios. This reduces peak memory usage by quite a bit.
10  Discussions / Miscellaneous Topics / Re: What I did today on: 2017-04-27 20:02:44
Two monitor person here. Best decision ever. Code on one monitor, talk about anime girls with your best friend on the other.
SHHHHHH!!! That was supposed to be our secret! >___<


Jokes aside, two monitors is great. Having a Eclipse on one monitor, Skype/an internet resource/a second Eclipse window on the other helps a lot.
11  Game Development / Performance Tuning / Pathfinding over too large grid on: 2017-04-26 02:23:11
Hello.

I'm taking a look at the pathfinder we have for RF, and to my horror I realized that it had a super bad worst case scenario for memory usage. At worse, this memory usage could skyrocket to over a gigabyte of memory! >__<

The problem is that our worlds are really big, up to 7680x5120 or so. The game world is essentially a grid, and the data is split up into chunks of 32x32 tiles that can be paged to disk when they aren't used. The pathfinder I'm using is a bi-directional breadth first search that searches from both the start and the goal and finds where the searches meet to reduce the number of tiles that get checked. For each tested tile, the tile is checked if it's accessible, which may require loading in data from disk. The last N chunks are cached in RAM until a new chunk is needed. The pathfinding may potentially need to explore the entire map to find a path (or admit failure), but this is rare (yet still possible).

The problem I'm having is that the pathfinding requires a huge amount of data per tile. For each visited tile, I need to know if it's been opened by the start search and the goal search, which I store using a job-ID system to avoid having to clear nodes. Secondly, to get neighbors and be able to find neighbors and to backtrack, I need to track the XY position (stored as shorts right now) and two parent references. This all sums up to 28 bytes per tile. Hell, just the array for holding the nodes of the entire world is huge, as 7680*5120*4 = 150MBs. The tiles then add an additional 7680*5120*28 = 1050MBs for a total of 1200MBs of raw pathfinding data. It's insane! The entire point of the chunk system goes out the window. In addition, I cannot only track pathfinding data of loaded chunks as if the search is wide we'd start unloading the chunks looked at at the start of the search (the start and finish), which would mean that we would be unable to backtrack. There's no way of telling which nodes we'll need for the final path.

I tried limiting the search area to reduce memory usage. The idea was to preallocate a square area of nodes. This square grid of nodes would then be centered on the average of the start and goal coordinates so that the search could run with "local" nodes. This obviously limited the maximum distance between the start and the goal, and also made it impossible to find paths that need to venture outside the square area of preallocated nodes. In the end, I realized that less than 1000 tiles of search distance in all directions (2048x2048 square search area) was the minimum I could drop it to, and that's still hundreds of megabytes of data.

The thing is that the number of traversible tiles is pretty low. Most of the world is either ocean, mountains or forest, so if I could eliminate those tiles I could cut memory usage by a lot. However, the 150MBs of memory needed just for the array is a lot, and it's so useless when a majority of the elements will be null anyway. If I could find a better way to store the pathfinding objects than in an array, I could eliminate a lot of that data. Similarly, if I can avoid allocating 70% of the nodes as they're not traversible, That would cut memory usage from 1050MBs to ~350MBs. Finally, I could reduce memory usage of the node class from 28 bytes to 20 bytes, another 100MBs saving for a final total of 250MBs. That would be...... acceptable.

So I guess my real question is: What's a good data structure for storing a very sparse/lazily allocated grid? It needs to quickly be able to retrieve a node given its coordinates (or null if none exist). Maybe a custom hashmap that doesn't use linked lists to resolve collisions? Any other ideas?
12  Discussions / General Discussions / Re: Programmer jokes on: 2017-04-24 23:53:28
Quote
Sorry, but even I don't have any advice on how to make UI design fun.

Hahahahahaha! Hahaha! Hahaha... Haha... Ha..... Ha....... Q___Q
13  Discussions / Miscellaneous Topics / Re: What I did today on: 2017-04-20 15:01:41
The quote could be interpreted as "Things always get better, even when they get worse.", which I assume is the cause of Riven's distress.
14  Game Development / Game Play & Game Design / Re: Graphics Backend Abstraction on: 2017-04-19 20:35:51
Porting being a no-op is, I think, an illusion. Different platforms require different gameplay and different performance trade offs. Whether or not your graphics-engine can seamlessly transition from HTML5/WebGL to multiple dedicated GPUs is a mere side note. The effort that is put into building an VK-like API on top of OpenGL is impressive, but by the time you're ready to publish your game, VK has probably gone mainstream, and WebGL would support it by then.

It all reminds me of some of the efforts Cas and I took upon us, to scale the sprite-engine from the low-budget to high-end GPUs, and seeking fast-paths through outdated drivers and/or hardware. It just got in the way of publishing the games, and that framerate bump wasn't ultimately worth it.

If this technical tinkering takes you 18 months, in that time high-end GPU performance has doubled, but more interestingly: your low-end supported hardware doubled in performance too. To make all that raw power, and super efficient API/engine worth the effort, you need a small army of artists and designers. A one or two man team most likely cannot come close to what the big (and free) engines support, so why not pick an art-style that hides the technical inefficiencies? It's hard enough to turn a tech demo into a polished, playable, purchable game. Very little of game programming is performance sensitive, so why focus all your energy on it?

My $0.02 Kiss
Of course, that different platforms require different trade offs is a given. A game designed for a keyboard will not work in app form, and a graphics engine designed for 500GB/sec bandwidth GPUs will not run well on a mobile GPU. However, there most definitely are overlaps in the technical department. For example, allowing the user of an Android tablet to plug in a controller and play the game just like on PC is definitely a plus, and a shitty Intel GPU has more in common with a mobile GPU than a dedicated 400$ GPU from AMD or Nvidia. In other words, alternate ways of controlling the game will be a nice feature to have on mobile/tablets, while a version of the engine tweaked for mobile will work very well as a "toaster mode" setting for weak desktop GPUs too, something we've gotten requests for for WSW.

Note: By "engine", I mean the engine built on top of the abstraction, not the abstraction itself. In the example above, I was specifically referring to using deferred rendering for desktop to take full advantage of the huge amounts of processing power and bandwidth available on desktop to achieve great lighting, while falling back to forward rendering on mobile GPUs where bandwidth is scarce and a 24 byte/pixel G-buffer isn't practical. Again, this is pretty much exactly the same problem that Intel GPUs have as they use system RAM, so using forward rendering on Intel GPUs could easily double or triple performance for us (but only at the absolute lowest settings with minimal lighting of course).

I appreciate your advice regarding technical stuff, but I think that this will be worth it. SkyAphid focuses 100% on developing games using the stuff I write, and I spend most of my time on helping out with that. It's worked out so far. If anything, a solid backend will allow us to work faster when creating the actual game as there's much less to worry about when it comes to performance.
15  Game Development / Game Play & Game Design / Re: Graphics Backend Abstraction on: 2017-04-17 21:29:32
- Who will use the API?
Initially, only me and SkyAphid. We will be basing NNGINE, our 3D engine on it, so WSW and all our future games will use this abstraction. If the engine becomes powerful enough to attract attention, we may attempt to market and license the engine. At them moment, we have no concrete plan of commercializing it. This is mainly for our own purposes.

- How likely is it to have anyone add a backend?
The abstraction is modular. The idea is that you can add multiple backends for different features. The purpose of the abstraction is essentially to make a more powerful and efficient abstraction than LibGDX that we can use to support multiple platforms and APIs, so I do not have any expectations that others will implement their own backends. Regardless, I want it to be possible to support new APIs easily.

- Why would this be desireable or even useful?
We have specific games that are a good fit for Android and HTML5 planned. The ability to switch between APIs is a great feature IMO as it means the games made on top of the abstraction are completely API independent, meaning that porting to a different API is a no-op.

- Do you plan to make a living from providing an API?
We do not plan on it, but keep it open as a potential idea. I do not think it will make me rich.

- Why is Vulkan only not good enough?
Backwards compatibility, especially on Android. Also, seamlessly supporting WebGL is nice (although the user will need to be aware of the limitations of GWT).

- Do you need it for your game?
For our games currently in development, yes-ish. Vulkan support for We Shall Wake would help immensely as the game's CPU scalability is currently only limited by single-threaded API calls. Even just the OpenGL backend I'm developing should give a significant boost to WSW's performance thanks to software command buffers and a dedicated rendering thread, let alone the Vulkan backend. In addition, we have a concrete game with an almost complete design document which would only be viable with Android and GWT support. This abstraction is essentially the convergence of a number of requirements.

- Aren't you just procrastinating to hide your creativity block from yourself?

The last one killed all my game dev ambitions years ago by keeping myself occupied with tech details until my money ran out, so don't fall for that. (You might need to work for insurance companies otherwise)
No, I prioritize work on RF and paid work, developing it in my spare time. I appreciate your concern and I can relate to the money troubles. I am currently living very cheaply and getting by, and I know what awaits me if I end up not being able to support myself as I've worked at a bank writing Java web applications.


I guess it's time for me to speed up my work on this. I've been stuck on some stupid technical details that don't have a clean solution for a little bit too long now. =P

I don't like telling people about WIP in that sense unless it's something concrete that others can appreciate. I don't want praise for plans with no real substance. Hence, I'll be back about this once I have something concrete to show.
16  Game Development / Game Play & Game Design / Re: Graphics Backend Abstraction on: 2017-04-17 15:18:14
@theagentd Now how would you combine OpenGL (which from Java uses raw values) and Vulkan (which from Java uses structs) together into one API? Would the API manage the memory or is the user responsible for that?
I just hide the structures.

Example:
1. I create a texture, get an abstract texture object back, which either wraps a OGL int target and int textureID, or a Vulkan long pointer.
2. I want to add it to a descriptor set, so I pass it into a descriptor set. The OGL descriptor set casts it into the OGL implementation and gets the target+textureID, while the Vulkan casts to the VK implementation and gets the pointer.
3. OGL backend emulates descriptor sets, adding the texture info to some temp memory which can later be bound. VK sets up a stack allocated structure and updates the descriptor set.

In the case of batchable operations (descriptor set updates, command buffer submissions, swapchain swaps, etc etc etc), the interface also exposes a batch system where you get a batch object, something like this:
1  
2  
3  
4  
5  
CommandSubmissionBatch batch = graphics.getCommandSubmissionBatch();
for(int i = 0; i < 32; i++){
    batch.add(commandBuffers[i]);
}
batch.submit();

In this case, the OGL version will just loop over the command buffers and execute them, while the VK one will actually only submit one draw call for all of them.
17  Game Development / Game Play & Game Design / Re: Graphics Backend Abstraction on: 2017-04-16 12:22:07
@CopyableCougar4

Nah, that's way too frustrating, error-prone and un-extendable. If you want to add a backend with that system, you need to modify the core classes, meaning that if a user adds their own backend their code will be incompatible with future updates to the main code. It also essentially forces you to have all the backends in the same file, not to mention the huge amounts of switches/if-statements needed in literally every single function.

Using interfaces/abstract classes and having the backends implement them is much more robust, allows for extension without modifying the core classes and is equally fast, since with just one implementation of an interface loaded the function calls can be inlined perfectly.
18  Game Development / Game Play & Game Design / Re: Graphics Backend Abstraction on: 2017-04-14 12:55:45
I think the concepts in JavaFX have it right.

Cas Smiley
Care to elaborate?

In my code i have engine specific enums, so i would build your scene graph (as an example) with no back end specific values. When you then convert this to a data model that your back end renderer will use you then convert the enums to back end specific values. This way you have a distinct separation between rendering and the data model
The entire point of this thread was to figure out how to do that efficiently. =P
19  Game Development / Game Play & Game Design / Re: Graphics Backend Abstraction on: 2017-04-12 21:07:49
What I'm getting at is, why abstract this, when your "engine" API is already the abstraction? The underlying rendering API - Vulkan or OpenGL (or even DX12!) - is what your engine will be using to turn its own ideas about what it has to render into rendering commands for the specified low-level API. You write a specific plug-in to render what your "engine" contains using the specific API. There's no real point in abstracting the rendering APIs and then using that abstraction to feed to your engine... do it directly. I'm probably not explaining this very well though.

Cas Smiley
I think I understand what you're saying, but I've concluded that it's simply too much work to maintain. IMO, the graphics API is such a vital part of a game engine that the graphics needs to be well-integrated to have good performance. In addition, the graphics APIs actually have very much in common. The optimal strategy is essentially the same on all APIs, but how you accomplish that can differ a bit. The point of that is that regardless of the API, most of your logic will stay the same.

Obviously, the graphics is just a small part of any complete game engine, as there are lots of logic and API-independent functionality that the engine needs. Hence, it makes sense to at least compartmentalize the graphics API as a complete independent module of the game engine. Now, you could just write N different versions of this module, one for each API, but this has a number of big drawbacks. You will end up with a lot of code duplication, as a lot of functionality is almost independent of the API used, to the point where you simply want to call an API function with a different name to accomplish the same thing. Having to rewrite higher level functionality for each API is a lot of wasted time, increase in maintenance and increased risk of bugs. There is also a severe limitation in the extensibility of the engine by the user. There are lots of techniques that require very optimized and tailor-made rendering modules, for example advanced terrain systems, fluid/water rendering, etc, that require a level of control of draw calls that usually cannot be achieved effectively with a general purpose model renderer. By forcing the user to write one version for each graphics API they want to support, you both force them to learn all the APIs they want to support (which is exactly what you want to avoid), which in turn will
encourage users to limit their support to a few specific APIs. This is a major blow to cross platform compatibility, as something as trivial as rewriting a renderer from OpenGL to OpenGL-ES can be prohibitively time consuming. Essentially, such API integration would severely diminish the point of the entire engine.

I have a number of goals with my abstraction:
1. I want to minimize the amount of work I have to put in.
2. I want to minimize the time taken to maintain and add new features to the system.
3. I want to minimize the risk of bugs by avoiding code duplication whenever possible.
4. I want to completely hide the graphics API used to avoid users having to learn all the APIs they want to support and subsequently locking themselves into a single API to save time.
5. I want the user to be able to work very close to the API when writing their own renderers for maximum performance and flexibility, without actually exposing the API being used.

A completely different version of the engine would violate essentially all of these goals. To facilitate point 5, a low-level abstraction of the graphics API is needed, so that the user can take full control of the rendering if needed to accomplish some exotic rendering task. If I'm going to write a low-level abstraction of each API for the user anyway, it makes a lot of sense to base the entire engine on top of that abstraction too. This reduces the amount of code duplication and also forces me to test the abstraction fully, as a bug in the abstractions will show up when I implement the built-in renderers of the game engine.

As an example, consider texture streaming. In OpenGL, this involves creating a separate OpenGL context, mapping buffers, uploading texture data and managing GLsync objects. In Vulkan, it involves a transfer-specialized queue (if available), mapping buffers, uploading texture data and managing fences/semaphores. These concepts are easily abstracted 1-to-1 into a common interface. By abstracting that away at a very low level, I can both write a single texture streaming system based on this common interface and avoid tons of code duplication, and even allow others to write tailor-made texture streaming systems with the same performance as mine for their very own purpose, possibly based on GPU feedback on which texture data is missing, using sparse/virtual textures, etc.

20  Game Development / Game Play & Game Design / Re: Graphics Backend Abstraction on: 2017-04-12 15:51:56
OpenGL and Vulkan are already abstractions... you probably don't want to abstract over them any more.

Cas Smiley
I don't agree. This is common practice in the engine world. You write your own abstraction layer over the different APIs you want to support, mainly to support console specific APIs, and then write your engine on top of this abstraction. This makes your engine independent of the API you use.

The huge mistake that people make is that they see Vulkan and all its complexity and say "Damn, this is too complicated. I'm gonna write an abstraction that let's me use Vulkan as if it's OpenGL!". Essentially, what this means is that they'll be writing their own OpenGL driver on top of Vulkan, and trust me, nobody has more resources than Nvidia's driver team when it comes to OpenGL drivers. You're going to fail, it's going to suck, it's going to be a nightmare of complexity and it's going to be slow.

The correct approach which I'm taking is to do the opposite thing: Write a Vulkan emulator on top of OpenGL. Basically, emulate command buffers and descriptor sets, discard all the extra information that Vulkan requires, etc. In fact, OpenGL either trivially maps to Vulkan in most cases, or is simpler than Vulkan. Vulkan render passes just contain a mapping from a list of render targets to a set of draw buffers, while Vulkan framebuffers (together with render passes) trivially map to an array of OpenGL FBOs. Emulating Vulkan on OpenGL is trivial.

The result however is not at all a simplification of the underlying libraries. Rather, it is the most explicit kind of abstraction you can imagine. You essentially have to provide all information that any of the underlying libraries would require for everything, with the unneeded information being silently discarded by each implementation. In practice, this means essentially coding for Vulkan and getting OpenGL/DirectX/Metal/WebGL/OpenGL ES for free.

That being said, you still get a lot of things for free with my abstraction. The most complicated parts of Vulkan is arguably the swapchain management, synchronization and queue management. My abstraction will completely hide the swapchain management, simplify and abstract the synchronization and automate the queue construction. In the end you only need to worry about the "fun" parts of Vulkan where you actually gain a shitload of performance and flexibility.
21  Game Development / Performance Tuning / Re: Question: Branch Prediction Alleviating Performance Loss in Looped Conditionals? on: 2017-04-09 05:10:45
Sound like using a @FunctionalInterface might be the thing to do.
I actually did some benchmarking for this some time ago where I compared virtual function calls vs a switch statement used to enter different functions. Essentially, I needed it to implement a software command buffer for OpenGL where I queue up commands and then I can execute/replay that list of commands as many times as I want. In my test, I had four very simple "commands" that I stored in different ways and measured the time it took to execute the commands. The commands simply did one of four mathematical operations on an int (addition, subtraction, multiplication or division). I tried 4 different methods for "encoding" the commands:

 - The commands are stored as Command objects which hold the arguments for the command and a simple execute() function. To execute the commands, I simply loop over the Command[] and call execute() on each of the commands.
 - The commands are stored in an int[] as an int command ID followed by their arguments. To execute the commands, I loop over the int[], extracting the ID. Then I use a switch() statement on the int ID to go to the right function for that command, which in turn extracts its arguments from the int array.
 - The commands are stored as singleton Command objects (so only 4 Command objects exist in total) in a Command[], with the arguments for the commands in a separate int[]. To execute the commands, I loop over the Command[] and call a different execute() function which extracts its arguments from the int[] and then executes the command.
 - The commands are encoded as an int command ID followed by their arguments into a native memory pointer allocated and written to using Unsafe. To execute the commands I use a loop that extracts the next command ID. A switch() statement is used to go to the right function based on the command ID and the function itself extracts its arguments.

Here are the performance timings of executing 16 384 commands with the four techniques described above:
1  
2  
3  
4  
Object: 0.247459 ms
Int:    0.117624 ms
Poly:   0.239245 ms
Unsafe: 0.108612 ms


As you can see, the techniques using virtual functions are around half as fast as using a switch statement. Using a simple array of Command instances turned out to be the slowest solution due to worse memory coherency (not all the Command objects get laid out linearly in memory after each other, and this can get worse as the program runs and the GC moves things around). Even using only four singleton Command instances gave pretty bad performance, which tells us a lot of the raw overhead of virtual function calls.

The two techniques using switch() statements on a command ID were consistently over twice as fast as the two that use virtual function calls. Most likely this improved performance comes from the ability of the compiler to inline the non-virtual calls in each case of the switch() statement. The cost of picking the right function is the same, but the cost of CALLING the right function disappears thanks to inlining.
22  Game Development / Performance Tuning / Re: Question: Branch Prediction Alleviating Performance Loss in Looped Conditionals? on: 2017-04-08 20:28:58
What you're doing is probably the fastest solution, as in this case the switch statement only needs to run once per sprite instead of once per pixel. However, you're right that branch prediction reduces the performance cost of branches if the branching is very coherent, which is obviously the case when every single pixel in the sprite takes the same path. However, branch prediction is far from perfect, and you can overwhelm the branch prediction "cache" which keeps track of the previous results branches, as the branch prediction hardware can only track a certain number of branches. This is extremely hardware dependent, so it's hard to reason around best practices in that sense. My tip to you is simply to test it; try to simplify your drawSprite() function in the way you described (move the switch into the loop) and see what kind of effect it has on performance. Note that it is sometimes not worth doing a branch if it will only save you a couple of instructions, as the cost of the branch (even with branch prediction) can be higher than simply always executing the operations.

The whole thing is quite annoying indeed. It'd be really nice if the compiler could just deal with stuff like this. For example, let's say that you have a loop that repeatedly calls a virtual function:
1  
2  
3  
for(int i = 0; i < array.length; i++){
    interfaceImplementation.virtualFunction(array[i]);
}

AFAIK, in this case you will have to pay the cost of the overhead of calling a virtual function (where Java essentially does a switch() statement on the class of the object to determine which implementation of the function to use). It'd be nice if the compiler would realize that the same function is called repeatedly, looks up the function before the loop starts once, then just reuses that for the entire loop. In addition, you'd kinda want the compiler to be able to handle your case too (which one could argue is simpler). By coding:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
public void drawSprite(blah... blah)
{
    //precalculate clipping offsets
    ...
    //interpret transforms
    ...

    for(i)
        //precalculate y index stuff blah blah
        for(j)
            switch(transform){
                 case 1:
                     //interesting stuff; actual blending and setting of pixels
                     break;

                 case 2:
                     //interesting stuff; actual blending and setting of pixels
                     break;

                 case 3:
                     //interesting stuff; actual blending and setting of pixels
                     break;

                 default:
                     //interesting stuff; actual blending and setting of pixels
                     break;

    }
}

you'd want the compiler to essentially compile the whole thing to the code that you wrote, essentially placing the switch statement outside the loop, then generating one loop for each of the possible branches. It'd be soooo cool if this was an actual thing... =/ I'm afraid I don't have a good solution to your problem, but I'd probably choose the flexibility and maintainability of having the switch-statement inside the loop (so you can avoid duplicating the for-loops). I suspect the performance penalty will be relatively small.

TL;DR: Benchmark it (and post the results in this thread =P).
23  Game Development / Newbie & Debugging Questions / Re: Right RGB Colors for day&night cycle? on: 2017-04-08 19:55:10
Hmm.. idk. When I look outside at day it's usually more blueish and not red at all. More like:
(0.7, 0.9, 1.0)
I guess it depends on where on earth you live (i.e. how near the sun gets to the zenith).
The sun is always a little bit red/orange due to the blue light being scattered more heavily. Due to this, the direct light from the sun is orange, but the light from the sky is blue. In direct illumination from the sun, these two should roughly cancel out resulting in white. In addition, nights aren't actually blue; this is just a stylization thing that is very common in games and movies to visualize nighttime, and people are quite used to it, so they interpret it as nighttime. Basically, I answered with values that I believe "look good". They aren't based on IRL colors.
24  Game Development / Newbie & Debugging Questions / Re: Right RGB Colors for day&night cycle? on: 2017-04-07 15:01:41
These ratios work decently well for HDR effects too, but they depend a bit on your tone mapping functions as well, so may require additional tweaking.
25  Game Development / Newbie & Debugging Questions / Re: Right RGB Colors for day&night cycle? on: 2017-04-07 12:09:25
The most common values are dark bluish for night, orange for morning/dawn and white for day, possibly with a small tint of orange there too.

Something like
 morning: (0.6, 0.4, 0.3)
 day: (1.0, 0.9, 0.8 )
 dawn: (0.6, 0.4, 0.3)
 night (0.1, 0.1, 0.2), could be darker.

You can tweak these values as you want, but approximately these ratios between colors usually work well.
26  Java Game APIs & Engines / OpenGL Development / Vulkan multi-GPU support is out! + other features on: 2017-04-01 21:15:27
<a href="http://www.youtube.com/v/rQqwG_rQx7A?version=3&amp;hl=en_US&amp;start=" target="_blank">http://www.youtube.com/v/rQqwG_rQx7A?version=3&amp;hl=en_US&amp;start=</a>

Vulkan multi-GPU support is now out as an experimental KHX extension to Vulkan called KHX_device_group!


EDIT2: It IS possible to synchronize memory between GPUs in a device group! See below!

I was expecting Vulkan's multi-GPU support to be extremely complicated, forcing you to essentially manage two completely different Vulkan devices and replicating commands to both separate GPUs, but this is simply not at all the case. Vulkan's multi-GPU support seems to be very similar to DirectX 12's. Basically, multiple GPUs from the same vendor (usually connected with an SLI/Crossfire bridge for direct data transfer between the GPUs) are reported as a device group, which is still exposed as a single logical device to the user. In other words, your code will look almost exactly identical, with just a few extra calls to direct commands to different devices, making adding multi-GPU support significantly easier, to the point where I'd even say it's trivial.

When allocating device memory, you're now able to tell the driver exactly which devices that should actually allocate memory. In other words, you do not need to have the same memory allocated on all devices. This can lead to some reductions in memory usage, but in all likelyhood it's not a significant gain. You're still most likely going to want to upload your textures to all devices and allocate your render targets on all your devices.

There are a couple of features for controlling which device does what. A number of functions take in a device mask, which controls which devices are supposed to actually execute the commands. Additionally, when recording a command buffer there is a new command, vkCmdSetDeviceMaskKHX(), that is used to control which GPUs are to execute the following commands.

A big part of the additions is related to presenting images from a device group. Again, this seems to be mostly automated and I haven't exactly figured out how it's supposed to work, but it seems to just be a change to image acquiring and presenting where you tell the driver which GPU should have access to which image. This part is rather uninteresting; presenting just got even more complicated. =_=


From what I can tell, multi-GPU support is lacking a number of important features.

 - DirectX 12 is currently showing off asymmetrical device groups formed by the user manually, allowing you to do some pretty crazy stuff like combining a mobile Nvidia GPU with the integrated Intel GPU and managing the workload between them. Vulkan is limited to the device groups the driver exposes, which limits multi-GPU support to two identical GPUs for Nvidia, or two similar GPUs from the same generation on AMD.

- There doesn't seem to be a way to manually copy a texture or buffer from one of the devices to another in a device group. o_O This is basically the critical feature that is the whole point of manual multi-GPU support, as it's the foundation of everything. Without that, the fancy new pipelined multi-GPU technique that people are talking about is completely impossible to implement. Hell, even split-frame rendering (SFR), checkerboard rendering and even non-trivial alternate frame rendering (AFR) is impossible without the ability to manually trigger copies.  However, there seems to be a presentation mode that sums up the colors of all the images of all devices. My guess is that this is supposed to be used with split-frame rendering, allowing each device to render part of the frame (when starting a render pass, you can give each device its own sub-rectangle to render to) and then the presentation engine to merge the result. This is however extremely limiting, as manually synchronizing resources for SFR is necessary for a large number of effects, like bloom, SSAO, etc.
EDIT: A separate extension called EXT_discard_rectangles allows the user to define a number of rectangles that the rendering is clipped against. In addition, these rectangles can be set per device in a device group, allowing for checkerboard multi-GPU rendering, but without manual synchronization this would again break any postprocessing effect that requires neighboring pixels.


All in all, I'm positively surprised by the simplicity of the system they've chosen, but also disappointed in primarily the seeming lack of manual synchronization control, rendering the whole thing kinda useless. =/ The ability to manually start an asynchronous copy using the transfer engine over the SLI/Crossfire bridge is the entire point of manual multi-GPU support, as that's the work that the driver teams of Nvidia and AMD have been doing manually for a decade now, leading to buggy, hacky solutions that either don't even work or never got implemented in a huge number of AAA games, let alone indie games. Since DirectX 12 seems to support this, I can only assume that the lack of this feature is the reason why the extension is classified as an experimental extension, and hence will be updated to include this before actual release.

Still, I am very happy to get reliable information on how multi-GPU support is going to work in Vulkan, to the point where I can confidently continue on my abstraction layer without fear of having to rewrite it.



EDIT2: There IS support for synchronizing memory between GPUs! It is called "peer memory". The flow seems to be something like this:
 - In a device group with multiple discrete GPUs, the device local GPU heap flags will have the VK_MEMORY_HEAP_MULTI_INSTANCE_BIT_KHX flag set. This indicates that data allocated from this heap by default is replicated on all GPUs in the group.
 - Peer memory features can be queried per device group, heap type and device pair (local + remote). The device group must support copying to and from peer memory (=copying between GPUs), but can also support even generic access to memory on other GPUs (=any access, for example texture reads, SSBO reads, etc)!
 - When allocating memory, the VK_MEMORY_ALLOCATE_DEVICE_MASK_BIT_KHX flag bit can be sent in together with a device mask, meaning that memory is only allocated for certain devices in the device group. However, if the subsetAllocation property is not available to the device group, the allocation may consume memory from all devices regardless of which devices are selected using the device mask. Regardless, this means that each device in the device group can have its own instance of memory.
 - When binding memory to a buffer or image, you can additionally specify which instance of memory is bound for which device in the device group. This allows you to essentially create two buffer objects that read from the same memory location, but from different GPUs. Confused yet?



Let's say you have a device group with four GPUs and you want to render using AFR (Alternate Frame Rendering), meaning that each GPU renders every fourth frame completely on its own. However, you realize that to render a frame you need access to the previous color buffer for some temporal anti-aliasing you're doing. In other words, each GPU needs to read an image from the previous GPU. Here's what you'd do to initialize the whole thing:

1. Query the device group properties for how you can share your memory. Let's say that only COPY_DST access is available (the only one required to be available by the spec), meaning that we need to copy the image from one GPU to another to be able to sample it like a texture/image.
2. We allocate the memory for the image we want to synchronize as normal, just making sure that the heap we're allocating from has the VK_MEMORY_HEAP_MULTI_INSTANCE_BIT_KHX bit set. This means that each GPU will allocate its own instance of the image's memory. Let's call this "memory 1".
3. To be able to copy the image from one GPU to another and to simplify the synchronization, we also need to have another identical image allocated. As our device group doesn't support direct access to other GPUs, this image is needed. It's memory is allocated exactly like the previous one. Let's call this "memory 2".
4. We create a Vulkan image ("image 1") and bind it as usual to memory 1. This image object will be used when each GPU draws its own frame.
5. We create a Vulkan image ("image 2") and bind it to memory 2 the same way. This image object will hold the previous image copied from the previous GPU.
6. We create a third Vulkan image object ("image 3"), also bound to memory 2, but to the next device's instance of that memory. This is done by passing in a device index list which makes each device read from a different instance of the memory. In the previous two memory bindings, we (implicitly) told each of the four devices to use instances {0, 1, 2, 3} of the memory, which simply means that device 0 uses instance 0, device 1 uses instance 1. In other words, each device uses its own local instance. For this third binding, we're going to tell the devices to bind to the "next" device's instance by passing in device indices {1, 2, 3, 0}. This gives device 0 access to device 1's instance, device 1 access to device 2's instance, etc. This means that each device can use this weird third image object to copy to the next GPU's memory.

Now, the actual synchronization process is fairly complicated.
1. We direct all our rendering commands to device 0 for the first frame. We have no previous frame yet so we simply ignore it.
2. Device 0 renders to image 1 (in other words, it renders to its own instance of memory 1). We attach a semaphore ("semaphore 1") which is signaled when the rendering to the image is completed.
3. Still on device 0, we go to the dedicated transfer queue and submit a command buffer containing a copy from image 1 to image 3 (in other words, a copy from device 0's current image to device 1's previous image). We tell this command buffer to await semaphore 1 (so we don't start transfering before rendering is complete) and tell it to signal another semaphore ("semaphore 2") when the copy is complete.
4. Device 0 continue with some extra postprocessing, finishes up its frame and presents its result to the window.
We're now gonna start submitting commands for Device 1.
5. Device 1 starts rendering its own frame to its own instance of image 1 until we reach the point where we need access to the previous frame.
6. For this part, we submit a command buffer and tell it to await semaphore 2. Once semaphore 2 is signaled, the copy from device 0 to device 1 is complete, and device 1 can access the previous frame from device 0 using image 2!
7. Go to step 3 and repeat (but for device 1 instead, then 2, then 3, then 0, etc).

Phew! What a mess! But it should work! It's actually not that complicated in practice... xd



Additions from KHR_descriptor_update_template:
 - Allows you to create a descriptor update template, which can be used to update a specific part of a descriptor set for a huge number of descriptor sets quickly. The use case is when you need to update a large number of descriptor sets with the same change. To do that, a template for the change is created, and then vkUpdateDescriptorSetWithTemplateKHR() is called to update all descriptor sets with the new change in one call. This is faster than a manual vkUpdateDescriptorSets() with one VkWriteDescriptorSet struct for each descriptor set to update.


Additions from KHR_push_descriptor:
 - Normally, the user allocates descriptor sets, binds them in command buffers and is then unable to modify the descriptor set until the command buffers that use the set has completed execution. This extension allows the user to define descriptor set layouts that instead read their data from the command buffer, called push descriptors. When a command buffer begins recording a command buffer the push descriptors are all undefined, so the user needs to call vkCmdPushDescriptorSetKHR() to update all push descriptors used by the shader. From what I can tell, this is mostly just a convenience feature so that you can avoid having to set up a new descriptor set each frame for descriptor sets that change each frame. Instead you'd just use push descriptors and inject the updates into the command buffer instead. Really nice feature, as it removes the need for per-frame tracking of descriptor sets that update each frame.


Additions from KHR_incremental_present:
 - Allows you to only present individual rectangles of the screen. This means that you can avoid redrawing the entire screen if only a small section of the screen needs to be redrawn. This is mostly a win on mobile where it can save a lot of battery.


Additions from KHR_maintenance1:
 - Copy/aliasing compatibility betweeen 2D texture arrays and 3D textures.
 - Allowing negative viewport heights to make it possible to do a y-inversion of the scene. This is great for OGL compatibility!!!
 - Command pool trimming, to hint to the driver to release unused memory. Better than force releasing everything when clearing a command pool. Think ArrayList.trimToSize() for command pools.
 - Some error reporting improvements.
27  Game Development / Game Play & Game Design / Re: Graphics Backend Abstraction on: 2017-03-25 00:11:04
If you're going to have the variables inside the enums for speed, I'd use ServiceLoader to lookup the required implementation at runtime and request the underlying values from it, rather than have the implementation have to configure the enums itself - that's ... yuck!  persecutioncomplex
What is a ServiceLoader? How would that work with my enums? What kind of advantages are there here?


From my point of view, an abstraction layer shouldn't contain backend-specific data in the first place.

The abstract model should have its own, backend-independent way to differentiate between
geometry types (POINTS, LINES, TRIANGLES, QUADS, SPLINES, SPHERES, something completely different) and the backends
have to map it to their own model at some point.
I agree with this, which is why I want to go with enums for the abstraction to get compile time checking of arguments. My initial idea had the OpenGL and Vulkan constants in the enum, which was bad as it meant having backend specific data in the abstraction. However, I think putting an int field in the enum that the backend can use for mapping the enum to a constant isn't the same thing and shouldn't be bad design as it has the best performance and arguable the lowest complexity and maintenance requirements.

The mapping actually has to occur just once, for example when the geometry is read from a file.
Not in my abstraction. It's just a thin layer over OpenGL and Vulkan, so data in a buffer for example isn't tied to a specific geometry topology like triangles or points. The mapping has to be done every time you submit a draw call, and there are a lot of other cases where I'll be needing to map lots of enums to int constants for OGL and VK.

28  Game Development / Game Play & Game Design / Re: Graphics Backend Abstraction on: 2017-03-24 17:51:34
@CopyableCougar4: The problem is that I want the user to be able to pass in an enum (say PrimitiveTopology.Points) and the backend then converts that enum to the symbolic constant of the API it uses (either GL_POINTS or VK_PRIMITIVE_TOPOLOGY_POINT_LIST) on the fly. It's the problem of doing this mapping quickly that I'm worried about. I don't really see your example code solving that, as extracting the int constants to variables at initialization time doesn't really help when the user later passes in enums.
29  Game Development / Game Play & Game Design / Re: Graphics Backend Abstraction on: 2017-03-24 15:34:14
You would just use the hashmap when you first load the program. Then you could store that values in static variables and lookup in that class when you need to.
How do you mean? That doesn't make sense to me.
30  Game Development / Game Play & Game Design / Re: Graphics Backend Abstraction on: 2017-03-24 15:32:23
Benchmark for mapping enum to int: http://www.java-gaming.org/?action=pastebin&id=1525

1  
2  
3  
4  
HashMap: 5.521456 ms    hashMap.get(enum)
EnumMap: 1.240702 ms    enumMap.get(enum)
Array:   0.560504 ms    constants[enum.ordinal()]
Field:   0.338337 ms    enum.constant
Pages: [1] 2 3 ... 122
 
Archive (280 views)
2017-04-27 17:45:51

buddyBro (472 views)
2017-04-05 03:38:00

CopyableCougar4 (921 views)
2017-03-24 15:39:42

theagentd (933 views)
2017-03-24 15:32:08

Rule (944 views)
2017-03-19 12:43:22

Rule (912 views)
2017-03-19 12:42:17

Rule (916 views)
2017-03-19 12:36:21

theagentd (953 views)
2017-03-16 05:07:07

theagentd (890 views)
2017-03-15 22:37:06

theagentd (684 views)
2017-03-15 22:32:18
List of Learning Resources
by elect
2017-03-13 14:05:44

List of Learning Resources
by elect
2017-03-13 14:04:45

SF/X Libraries
by philfrei
2017-03-02 08:45:19

SF/X Libraries
by philfrei
2017-03-02 08:44:05

SF/X Libraries
by SkyAphid
2017-03-02 06:38:56

SF/X Libraries
by SkyAphid
2017-03-02 06:38:32

SF/X Libraries
by SkyAphid
2017-03-02 06:38:05

SF/X Libraries
by SkyAphid
2017-03-02 06:37:51
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!