Java-Gaming.org Hi !
Featured games (91)
games approved by the League of Dukes
Games in Showcase (806)
Games in Android Showcase (239)
games submitted by our members
Games in WIP (868)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: [1] 2
  ignore  |  Print  
  4 or more threads in render loop  (Read 14338 times)
0 Members and 1 Guest are viewing this topic.
Offline jakethesnake

Senior Devvie


Medals: 8
Projects: 1



« Posted 2014-10-03 10:16:52 »

Hello. The game loop of any java game is a thread in itself. Would it be an idea, since multiprocessors are getting more and more cores, to divide up your update and render method into several threads, so that all cores can be fully utilized?
Offline Gibbo3771

JGO Kernel


Medals: 128
Projects: 5
Exp: 1 year


Currently inactive on forums :(


« Reply #1 - Posted 2014-10-03 11:05:37 »

no

"This code works flawlessly first time and exactly how I wanted it"
Said no programmer ever
Offline Roquen

JGO Kernel


Medals: 518



« Reply #2 - Posted 2014-10-03 11:18:12 »

Only if you're software rendering.
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline princec

« JGO Spiffy Duke »


Medals: 1146
Projects: 3
Exp: 20 years


Eh? Who? What? ... Me?


« Reply #3 - Posted 2014-10-03 11:28:08 »

Well... I've successfully done it and achieved a pretty tidy factor of speed increase... but as usual the thing with threads is, if you actually have to ask this sort of question then the answer for you, at this time, is "no".

Cas Smiley

Offline jakethesnake

Senior Devvie


Medals: 8
Projects: 1



« Reply #4 - Posted 2014-10-03 11:54:43 »

Well, I understand that it can be done in an update loop if one knows what one's doing and not messing up the data. But what about a render loop? Theoretically, if you have an image you want to display on the screen, would it yield an increase in optimization if the image is divided into four squares and four threads gets one piece each, which they in turn render? Or will it all be clogged up in some bottleneck on its way to the display, rendering the effort useless?
Offline ags1

JGO Kernel


Medals: 367
Projects: 7


Make code not war!


« Reply #5 - Posted 2014-10-03 19:21:41 »

Don't forget your 4 CPU threads are feeding 1 GPU.

Offline Catharsis

JGO Ninja


Medals: 76
Projects: 1
Exp: 21 years


TyphonRT rocks!


« Reply #6 - Posted 2014-10-03 20:29:32 »

I'd be curious if Cas would like to share more info on what aspects of his engine efforts were split between threads.

Typically, an update / render loop should be one thread. Where multiple threads really come into play is any marshalling of data between network and depending on the use case journalling (could be useful for debug / replay of game sequences / especially network data). I'd say look at the Disruptor architecture for a solution and technical info in this regard:
http://lmax-exchange.github.io/disruptor/

For game engine use cases I can see streaming data to the GPU / say new texture or geometry data for large levels being facilitated via threads on the CPU and use of the GL threading API.

In my OpenGL ES video engine for Android I handle encoding and rendering to the screen in separate threads that share two GL contexts for texture / FBO data. This allows 30FPS+ of rendering to the screen and the encoding at the same time where if these operations occurred sequentially in the same CPU thread then performance for both drops to ~15FPS.

As a baseline you'll want to examine the threading API of OpenGL ES 3.0 for the essentials of coordinating GL operations. ARM has a good tutorial:
http://malideveloper.arm.com/downloads/deved/tutorial/SDK/android/1.6/thread_sync.html

In general though as has been mentioned already; unless you have a specific use case the answer is usually no; single thread for update / render.

Check out the TyphonRT Video Suite:
http://www.typhonvideo.com/

Founder & Principal Architect; TyphonRT, Inc.
http://www.typhonrt.org/
http://www.egrsoftware.com/
https://plus.google.com/u/0/+MichaelLeahy/
Offline Riven
Administrator

« JGO Overlord »


Medals: 1371
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #7 - Posted 2014-10-03 20:32:57 »

update( ) and render( ) run on the same thread, but their implementation can do a fork-join to spread some specific heavy workload over all available cores.

Examples: AI / pathfinding for N units, filling multiple VBOs with vertex data.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings!
Offline princec

« JGO Spiffy Duke »


Medals: 1146
Projects: 3
Exp: 20 years


Eh? Who? What? ... Me?


« Reply #8 - Posted 2014-10-03 22:22:46 »

In my case the sprite engine I use animates, sorts, transforms, and writes out all the sprite vertex data using a thread-per-core and achieves a very tidy speedup as a result (when we're talking tens of thousands of sprites). I do a few other things using multithreading such as particles and emitters, but unfortunately my current game requires deterministic processing so I couldn't do AI on multiple threads (which would have been great).

Cas Smiley

Offline Catharsis

JGO Ninja


Medals: 76
Projects: 1
Exp: 21 years


TyphonRT rocks!


« Reply #9 - Posted 2014-10-03 23:10:04 »

In my case the sprite engine I use animates, sorts, transforms, and writes out all the sprite vertex data using a thread-per-core and achieves a very tidy speedup as a result.

Chatting a bit about data synchronization may give some bread crumbs to the folks interested in multithreading CPU side.

To achieve synchronization do you use double the memory and have active render buffers and buffers to fill and swap them when ready with an AtomicReference, etc?

I've found in general synchronization use cases for game / video engine dev that the CAS (compare and swap) operations are sufficient for synchronization CPU side without having to go full bore ala ring buffer or the direction the Disruptor takes for really high performance throughput. Because one is dealing with full buffers of data and not many discrete individual events / data updates CAS operations are plenty quick to handle synch for typical game engine use cases.

The Disruptor architecture though is pretty bad ass because it allows creation of a producer -> multiple consumer dependency graph; IE really useful for network marshaling and journaling that has many discrete events / data leading to one business logic / game logic thread, then back out via another Disruptor chain. And that is a basic example.

Check out the TyphonRT Video Suite:
http://www.typhonvideo.com/

Founder & Principal Architect; TyphonRT, Inc.
http://www.typhonrt.org/
http://www.egrsoftware.com/
https://plus.google.com/u/0/+MichaelLeahy/
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline Roquen

JGO Kernel


Medals: 518



« Reply #10 - Posted 2014-10-04 09:54:00 »

I thought we were talking about rendering loop.  Everything changes if you're talking about the engine as a whole.  If you need disruptor, then you're over-engineering.
Offline jakethesnake

Senior Devvie


Medals: 8
Projects: 1



« Reply #11 - Posted 2014-10-04 10:58:15 »

Yea, well I was just curious if multithreading was something that was done in java-games. And if there's an advantage to be gained. Or if this is done in the JVM itself. Makes sense since having four threads running, producing a large prime integer and printing it on the console will be faster than one thread producing four large primes...
Offline princec

« JGO Spiffy Duke »


Medals: 1146
Projects: 3
Exp: 20 years


Eh? Who? What? ... Me?


« Reply #12 - Posted 2014-10-04 11:25:13 »

I thought we were talking about rendering loop.  Everything changes if you're talking about the engine as a whole.  If you need disruptor, then you're over-engineering.
I wouldn't know a disruptor if it came up and bit me on the arse, but I think what I'm doing here for performance is basically "rendering" (less the particle logic and such).

Cas Smiley

Offline Roquen

JGO Kernel


Medals: 518



« Reply #13 - Posted 2014-10-04 13:14:26 »

I'm speaking kinda generally here.  You might need something like disruptor if your doing an MMO.  You might use multiple threads for rendering...a case the comes to mind is software occlusion queries (I don't even really think of that as rendering either...just a rendering related task, like simulation).  Generally I'd say you want a thread to be as independent a task as possible...and if you ever starting think about moving beyond single-producer/single-consumer...you might want to step back and re-think and make sure that's what you really-really-really want to do.
Offline Catharsis

JGO Ninja


Medals: 76
Projects: 1
Exp: 21 years


TyphonRT rocks!


« Reply #14 - Posted 2014-10-04 19:27:46 »

I wouldn't know a disruptor if it came up and bit me on the arse, but I think what I'm doing here for performance is basically "rendering" (less the particle logic and such).

Yes, yes, yes; a little less misdirection...  Kiss Rendering is the main discussion... re:

"In my case the sprite engine I use animates, sorts, transforms, and writes out all the sprite vertex data using a thread-per-core and achieves a very tidy speedup as a result (when we're talking tens of thousands of sprites)."

I posited a direction of least resistance in regard to how one can handle synchronizing a multithreaded CPU side buffer filling mechanism as you outlined. For others and heck even myself are you interested in commenting on the synch mechanisms you use. IE 2 buffers swapping an AtomicReference when ready and the render loop / thread just chugs along picking up the buffers from the current ones stored in the AtomicReference oblivious of any multi-threading going on to fill those buffers. The question was generally how you are solving the synch issues with the worker threads and render thread since it's not a fork / join type task. This will help others think about the problem and come up with a solution.

While I don't have an optimized sprite engine I do have a general computing (grid counting demo) that is implemented in OpenCL and various CPU implementations from serial to multi-threaded using a similar mechanism as I described above. The multithreaded CPU version gives about 6x improvement in speed over the single threaded solution when mapped to approximately the same number of threads as cores on the CPU. Is that about what you are seeing?  It's nice because I can use a software OpenCL implementation to test if my Java multithreaded implementation is efficient and it's pretty close.

I'm speaking kinda generally here.  You might need something like disruptor if your doing an MMO.

So was I here on general patterns. The Disruptor pattern works well for high throughput discrete event passing. The way LMAX uses it for even higher performance is the slots in the ring buffer are actually byte arrays instead of objects. The reason I'm keen to the Disruptor architecture (let's be honest here; it's a fancy ring buffer; are we scared about talking about ring buffers here because there is a name attached to a particular use case / implementation?) is that what I'm doing is highly event driven via the EventBus pattern (though different implementation to Otto or Guava's) meaning that it is a good fit for my architecture in general for data coming in and off the wire. Stop me now before I talk about the "DisruptorBus"  Roll Eyes I'd have dug into this area a lot more already if it ran on Android and didn't depend on sun.misc.Unsafe aspects which aren't supported on Android. Like all things this is infrastructure for a particular use case / problems at hand (certainly leaning toward MMO scale as Roquen mentioned) and as an engine developer potentially allows me to provide APIs to general developers, you know game developers, who won't have to think about the complexities of threading under the hood.

My uses of CPU side multithreading presently are much more close to the complete buffer filling and swap scenario as originally discussed.

Check out the TyphonRT Video Suite:
http://www.typhonvideo.com/

Founder & Principal Architect; TyphonRT, Inc.
http://www.typhonrt.org/
http://www.egrsoftware.com/
https://plus.google.com/u/0/+MichaelLeahy/
Offline princec

« JGO Spiffy Duke »


Medals: 1146
Projects: 3
Exp: 20 years


Eh? Who? What? ... Me?


« Reply #15 - Posted 2014-10-04 21:49:39 »

I use a circular buffer of VBOs and just pick the next one each frame. The only synchronising is done with OpenGL depending on what version of OpenGL I've got.
But basically it's
1  
glMapBufferRange(... GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT | GL_MAP_INVALIDATE_RANGE_BIT ...);


Using a circular buffer of several VBOs means I don't seem to fall foul of synchronisation issues. Ideally I'd use the newfangled way of doing things where you permanently map the buffer and then manage it yourself but that's only available on pretty new GPUs and I'm sticking to OpenGL3.0 as a baseline.

There's no other synching going on anywhere else - it's all just parallel workloads made by chopping stuff up into equal size chunks, one for each core. It's not the very most efficient way to do things but it's easy and simple and gives great results for the effort.

Cas Smiley

Offline Catharsis

JGO Ninja


Medals: 76
Projects: 1
Exp: 21 years


TyphonRT rocks!


« Reply #16 - Posted 2014-10-04 22:48:27 »

I use a circular buffer of VBOs and just pick the next one each frame.
Using a circular buffer of several VBOs means I don't seem to fall foul of synchronisation issues.

This sounds similar to the generally naive approach I take with the render / encoder threads in my video engine for the OpenGL ES 2.0 implementation except replace VBO w/ FBO; just use a blocking queue that the encoder thread waits upon. It works in practice on every device I've tested, but could potentially be unsafe. I'll be beefing things up a bit w/ OpenGL ES 3.0 threading API quite likely soon.

There's no other synching going on anywhere else - it's all just parallel workloads made by chopping stuff up into equal size chunks, one for each core.

What you describe sounds fine for one worker thread filling buffers and one render thread. Likely the worker thread will always have the next buffer full before the render thread renders it. I'm trying to figure out the multiple producer / single consumer scenario you mention without synch. Say 100k sprites and each worker thread (say 5) are dealing w/ 20k sprites each filling a single buffer (sync issues). Or is it 5 worker threads each filling their own buffer of all 100k sprites into a unique VBO in the circular buffer of VBOs (starvation in rendering or skipping to the most recent buffer filled when render occurs / IE wasting CPU on worker threads).

Obviously it's working and you see an improvement. Just curious and all.. 

 

Check out the TyphonRT Video Suite:
http://www.typhonvideo.com/

Founder & Principal Architect; TyphonRT, Inc.
http://www.typhonrt.org/
http://www.egrsoftware.com/
https://plus.google.com/u/0/+MichaelLeahy/
Offline princec

« JGO Spiffy Duke »


Medals: 1146
Projects: 3
Exp: 20 years


Eh? Who? What? ... Me?


« Reply #17 - Posted 2014-10-05 10:11:51 »

Sprite engine render method:
http://pastebin.java-gaming.org/4816e91800a12

VBO class:
http://pastebin.java-gaming.org/816e1009a0211

Threadulator:
http://pastebin.java-gaming.org/16e101a920115

It's nothing fancy... took me weeks to perfect it mind. Maybe someone can find some bugs or ways to make it faster.

Cas Smiley

Offline CopyableCougar4
« Reply #18 - Posted 2014-10-05 23:09:49 »

(Disclaimer: I hadn't heard of a ForkJoinPool until now)

The
ForkJoinPool 
just blew my mind Smiley Now I just want to find a place where I can add something similar to your Threadulator!

CopyableCougar4

Either wandering the forum or programming. Most likely the latter Smiley

Github: http://github.com/CopyableCougar4
Offline Riven
Administrator

« JGO Overlord »


Medals: 1371
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #19 - Posted 2014-10-06 17:36:35 »

VBO class 169: check for availability

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings!
Offline ra4king

JGO Kernel


Medals: 508
Projects: 3
Exp: 5 years


I'm the King!


« Reply #20 - Posted 2014-10-06 17:50:16 »

VBO class 169: check for availability
What do you mean by availability for FlushMappedBufferRange?

Offline Riven
Administrator

« JGO Overlord »


Medals: 1371
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #21 - Posted 2014-10-06 17:57:43 »

Cas will understand.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings!
Offline princec

« JGO Spiffy Duke »


Medals: 1146
Projects: 3
Exp: 20 years


Eh? Who? What? ... Me?


« Reply #22 - Posted 2014-10-06 18:04:38 »

He means I've not checked the GL caps to ensure that function is available (it throws an NPE in LWJGL when you forget to check and it's not present).
Right now the engine's mandated to be OpenGL3.0+ so I'm not sure it'll help my specific case but it's a good idea in library code (such as this is).

Cas Smiley

Offline Catharsis

JGO Ninja


Medals: 76
Projects: 1
Exp: 21 years


TyphonRT rocks!


« Reply #23 - Posted 2014-10-07 05:16:04 »

So when did you stop supporting Java 1.4?  Roll Eyes  OK.. Been a while since I've read anything about your backward compatibility decisions. I think if you said I'm using ForkJoinPool and invoke that would have gotten around a lot of the chat above. Yeah. Hard to say how much faster one can get with that particular direction w/ the invoke and wait strategy. The only thing I can think of offhand is not storing things in a Sprite class, so a bare array of data to render to be more cache friendly (of course there are limitations in this direction / less readable code, etc.). I was under the impression that you were filling buffers concurrently to a separate rendering thread. Figuring out a strategy like that will make things faster as you fill separately and render immediately with latest filled data. Even a separate single thread filling buffers could be rather quick. I'm not saying implementing this is easy; it sounds like you are plenty happy with performance presently, so you know... keep making that game. Can't wait to see it.

I'm quite glad Doug Lea and all put out most of the java.util.concurrent code as public domain:
http://g.oswego.edu/dl/concurrency-interest/

Been stuck in Java 6 land w/ Android, but thinking about integrating the JSR-166 code eventually into TyphonRT.

Check out the TyphonRT Video Suite:
http://www.typhonvideo.com/

Founder & Principal Architect; TyphonRT, Inc.
http://www.typhonrt.org/
http://www.egrsoftware.com/
https://plus.google.com/u/0/+MichaelLeahy/
Offline princec

« JGO Spiffy Duke »


Medals: 1146
Projects: 3
Exp: 20 years


Eh? Who? What? ... Me?


« Reply #24 - Posted 2014-10-07 08:24:17 »

There are various smaller optimisations in the sprite engine to try and help a bit. For example I can mark a sprite as "frozen" which after calculating all its rendering data, I stash it in a ByteBuffer and simply copy it in verbatim to the rendering VBO next frame. And I can change the sorting algorithm on individual sprite layers (eg. "no sort", handy for terrain), and also specify that an entire layer is "frozen". Stuff like that. The real win though was multithreading the animation and multithreading the sorting/gathering/writing.

In case anyone's not fallen asleep yet the gist of that code does this:

1. Use a thread-per-layer to sort sprites (if necessary) and gather up a count of all the visible sprites in that layer. Wait.
2. Now I know how much sprite data there is back in the main thread, I ensure that my VBOs are big enough and map them.
3. Then I chop up the sprites into a number of threads and calculate and write the sprite data out to the VBOs concurrently (here lies the biggest performance gain). So if there's 100k sprites and 4 cores I'm writing 25k sprites out in each thread. The calculations of the sprite vertex data can be quite complex (requiring sometimes multiple matrix transforms as sprites can be defined hierarchically). At the same time I build up the rendering command lists, one each. Wait.
4. Back in the main thread I then iterate through the rendering command lists in order.

So I'm basically hopping in and out of a thread pool and waiting on all the jobs finishing each time, which is my "synchronisation". There's probably a bit of time spent idle but it's nice and easy.

Cas Smiley

Offline basil_

« JGO Bitwise Duke »


Medals: 418
Exp: 13 years



« Reply #25 - Posted 2014-10-07 08:58:17 »

sounds great! just wondering, "writing concurrently to VBOs" ..

do you map the buffer in the main thread and pass the ByteBuffer to the other threads .. or do you create a gl-context per thread and map there ?
Offline princec

« JGO Spiffy Duke »


Medals: 1146
Projects: 3
Exp: 20 years


Eh? Who? What? ... Me?


« Reply #26 - Posted 2014-10-07 09:08:43 »

The former. Multiple contexts with shared VBOs is just asking for bugs.

Cas Smiley

Offline basil_

« JGO Bitwise Duke »


Medals: 418
Exp: 13 years



« Reply #27 - Posted 2014-10-07 09:15:02 »

thought so Wink
Offline ra4king

JGO Kernel


Medals: 508
Projects: 3
Exp: 5 years


I'm the King!


« Reply #28 - Posted 2014-10-07 20:07:45 »

Huh, I was wondering about separating out the sprite transformation and filling of the VBOs in separate threads using a thread pool, this is fast enough to be done every frame?

Offline princec

« JGO Spiffy Duke »


Medals: 1146
Projects: 3
Exp: 20 years


Eh? Who? What? ... Me?


« Reply #29 - Posted 2014-10-07 20:35:00 »

Well, what do you think?

Cas Smiley

Pages: [1] 2
  ignore  |  Print  
 
 

 
Riven (587 views)
2019-09-04 15:33:17

hadezbladez (5533 views)
2018-11-16 13:46:03

hadezbladez (2411 views)
2018-11-16 13:41:33

hadezbladez (5794 views)
2018-11-16 13:35:35

hadezbladez (1233 views)
2018-11-16 13:32:03

EgonOlsen (4669 views)
2018-06-10 19:43:48

EgonOlsen (5688 views)
2018-06-10 19:43:44

EgonOlsen (3205 views)
2018-06-10 19:43:20

DesertCoockie (4104 views)
2018-05-13 18:23:11

nelsongames (5125 views)
2018-04-24 18:15:36
A NON-ideal modular configuration for Eclipse with JavaFX
by philfrei
2019-12-19 19:35:12

Java Gaming Resources
by philfrei
2019-05-14 16:15:13

Deployment and Packaging
by philfrei
2019-05-08 15:15:36

Deployment and Packaging
by philfrei
2019-05-08 15:13:34

Deployment and Packaging
by philfrei
2019-02-17 20:25:53

Deployment and Packaging
by mudlee
2018-08-22 18:09:50

Java Gaming Resources
by gouessej
2018-08-22 08:19:41

Deployment and Packaging
by gouessej
2018-08-22 08:04:08
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!