Java-Gaming.org    
Featured games (78)
games approved by the League of Dukes
Games in Showcase (429)
Games in Android Showcase (89)
games submitted by our members
Games in WIP (468)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: [1] 2
  ignore  |  Print  
  Fully utilising Multi cores/Multi cpus?  (Read 9901 times)
0 Members and 1 Guest are viewing this topic.
Offline moogie

JGO Knight


Medals: 11
Projects: 6
Exp: 10 years


Java games rock!


« Posted 2007-10-24 01:55:38 »

Some people have tested the real time ray tracer i am making on different computers with different number of cpus and cores per cpu and it is apparent that the more cpus/cores a computer has, the more dramatic the level of cycles are wasted. (on a 8 core machine an average of 50% utilisation )

I am quite inexperienced with multi-cpu/core programming and am wondering whether there is someing fundamental I am doing incorrectly which would account for this waste and whether there are steps i can to reduce it?

At a high level the ray tracer logic flows in the following steps:

1. obtain the number of cpus/cores
2. create a render thread for each core
3. nominate one thread to the be the master thread
4. split the screen into tiles and give each thread an equal portion of the tiles

5. All render threads apart from the master thread wait by call await() on the "state" CyclicBarrier
6. master thread updates world state and calls await() on the "state" CyclicBarrier awaking all threads.
7. all threads render their tiles and then wait by call await() on the "render" CyclicBarrier.
8. once all threads have called await() on the "render" CyclicBarrier all threads wake and the non-master threads goto step 5.
9. the master thread displays the rendered image and then goes to step 6.'

An applet of a recent version of the raytracer can be found here:

http://javaunlimited.net/hosted/moogie/testApplet.html

The above version has some banding on the specular reflection which is fixed in a later version. To see fps an executable JAR version can be found here:

http://javaunlimited.net/hosted/moogie/jrtrt_specular_fixed.jar



Offline broumbroum

Junior Member





« Reply #1 - Posted 2007-10-24 05:28:03 »

Multi- and Hyper-Threading cannot be confused because the latter is some TradeMark sold by Intel solutions, while we'd say HT from Intel is Multi-Threading which is quite correct.
But in fact, softwares are somekind of logic-systems that cannot be confused with hardware logic-systems like CPU, MIPS, GPU etc.
Whereas we have the opportunity to impl. multiple Threads simultaneously, that cannot have a direct impact on how the processor will receive the datas to compute, that means one HT-CPU vs. Celeron CPU (saying multiple- vs. single- pipelines) have equal chance to find out the results at equal frequency. Of course, multi-Threaded  softwares can be run on either multi-core or single-core (or pipeline(s)) without changing the source code, but if there's any specific compiler solution, then we can observe significant ("full") difference between 2 types of processors.
By the way it can be a good idea to have a list of all specific compiler solutions Java or Sun can bring out to their developers.  Embarrassed 
What you're explaining is a bit tricky and may be understood if one can figure out how a very CPU works... Cheesy

::::... :..... :::::: ;;;:::™ b23:production 2006 GNU/GPL @ http://b23prodtm.webhop.info
on sf.net: /projects/sf3jswing
Java (1.6u10 plz) Web Start pool
dev' VODcast[/ur
Offline moogie

JGO Knight


Medals: 11
Projects: 6
Exp: 10 years


Java games rock!


« Reply #2 - Posted 2007-10-24 06:54:03 »

When i say multi core/cpu i mean it in a general sense... i.e. a CPU which is able to perfrom multiple "threads" of execution simultaneously.

I have been discussing my particular problem with a work collegue and it may be beneficial to change the paradigm from each thread owning tiles to render to having a stack of tiles to render and each thread poping a tile from the stack when they have finished rendering a tile. This might reduce the idle time experienced in the current implementation. However it remains to see whether the added cost of using a syncronised method to pop from the stack will be less than the cost of idle cycles.
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline Don Kiddick

Junior Member





« Reply #3 - Posted 2007-10-24 10:40:27 »

The more processors you have the less likely you'll max them all out. That's inevitable.
You might want to try having more threads than processors, say twice or three times as much and see what the result is. Generally having more threads will increase processor utilization and benefit performance until the overhead from synchronization and thread creation counterbalances that.
Experiment.

D.
Online Riven
« League of Dukes »

JGO Overlord


Medals: 613
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #4 - Posted 2007-10-24 20:11:35 »

Having a single point of tile-distribution can be (but in this case certainly won't be) a bottleneck.

In this case the proceesing of a single tile is most likely so much work, that the sync-time will be roughly 0.00001% of the CPU-time.

In extreme cases (like each ray is a 'runnable' and popped off a stack) you might want the thread to pop a bunch of tasks, with a local runnable-queue for each thread.

When some thread runs out of runnables, and another seems to be grinding to a halt on it's local-queue, the thread can go hijacking runnables from other threads local-queues. You need sync-ing here too, but when you have a threadpool of N, you have N locks (pick a random thread and hijack some task), thus reducing the sync-bottleneck by factor N. Building this correctly is extremely hard for 99.93% of all people.

AFAIK this is the most efficient way to distribute tasks from a single source.

Anyway, sync-ing can't be your bottleneck, the tasks are simply too big. You'll probably notice CPU-usage goes from 50% to 99.9%. With the design describes above, you might get 99.99%.

Also... never block on a barrier with all your worker-threads, just start with the next frame. Obviously, for this to work, you need to make a copy of your world for each frame.

P.S.
The numbers in this post are completely fictional, yet pretty good guesstimates.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline moogie

JGO Knight


Medals: 11
Projects: 6
Exp: 10 years


Java games rock!


« Reply #5 - Posted 2007-10-25 00:32:23 »

That is my though also (sync time is negligible compare to task processing time) so all it remains to do is implement the stack based approach Smiley

I do not believe that i will need to explicitly balance the load amongst the worker threads as I would have thought that by the nature of the stack approach the balance is  implicitly handled. i.e. once a thread has finished a task it will pop the next task from the stack which means some threads will be able to process more tasks while one thread is processing a much larger task.

As you have suggested, I have thought about "double" buffering the world state so that each thread can render while the world is being updated, however the world will eventually exist of potentially millions of objects and I would imagine that there would a significant cost in both memory requirements and time to copy the world state each frame. I have not actually attempted it so I cannot be certain about the net loss however.
Offline Don Kiddick

Junior Member





« Reply #6 - Posted 2007-10-25 10:27:39 »

NB. the stack based approach is basically done for you in the 1.5 concurrency package.

e.g.
1  
2  
3  
 
ExecutorService  executorService = Executors.newFixedThreadPool(1000)
executorService.invokeAll(myCollectionOfTasks);


D.
Offline moogie

JGO Knight


Medals: 11
Projects: 6
Exp: 10 years


Java games rock!


« Reply #7 - Posted 2007-10-25 14:06:54 »

I definitely will look into the ExecutorService.. I am a little apprehensive about any object creation over head.

Hmm... the more i look into using Executor Service, the more i think to use it i will need to make significant changes... Currently a "task" needs to know what thread is processing it so it can use the variables defined in the thread's scope. With an executor service is looks like there is no way to achieve this without massive replication of data and the potential performance cost involved.

Offline Don Kiddick

Junior Member





« Reply #8 - Posted 2007-10-25 16:37:11 »

Thread.currentThread() gives you a handle to the current. Not sure why you would want that though.
Offline moogie

JGO Knight


Medals: 11
Projects: 6
Exp: 10 years


Java games rock!


« Reply #9 - Posted 2007-10-26 00:21:08 »

That might work... depending whether I can pass the Executor a derived class from Thread to use.

To help understand why i need to know what thread is currently processing a task I have set up a hypothetical scenario:

I have three threads:

T1,T2,T3

each thread has a pool of objects. This pool is represented by a linked list.

in the main logic loop of each thread:

A Thread grabs task from task queue
same thread calls a "dowork" method on the task.
Task requests an object from the pool and uses it internally.


Thus, if i do not know what thread is currently processing the task I will have to either have one super large pool with synchronised access which is in a scope common to the tasks, or have a pool of objects for each task. The former introducing performance over head and the latter increasing memory usage profoundly.

Perhaps my current implementation is not particuarly efficient? can someone else offer a better solution? Basically i have shifted all object creation to the initialisation phase and simply use these pre-allocated objects during the main runtime of the program. It has given a massive speed increase!
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Online Riven
« League of Dukes »

JGO Overlord


Medals: 613
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #10 - Posted 2007-10-26 01:27:19 »

I doubt you can gain much performance from here. If all CPUs are running >99% on your tasks, is it worth the effort to squeeze out the last percent?

It's often better to work on the bottleneck, like... optimizing the tasks themselves.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline moogie

JGO Knight


Medals: 11
Projects: 6
Exp: 10 years


Java games rock!


« Reply #11 - Posted 2007-10-26 02:56:35 »

At the moment on multi core CPUs i am not achieving 99% utilisation I believe this is due to the current implementation with each thread being assigned a static list of tasks and all threads waiting until all tasks are finished. I will attempt a stack based implementation to see whether i can achieve higher utilisation.

I agree, once i achieve 90%+ cpu utilistion I need to heavily optimise the hot spots.
Offline Don Kiddick

Junior Member





« Reply #12 - Posted 2007-10-26 10:30:58 »

Quote
each thread has a pool of objects. This pool is represented by a linked list.

in the main logic loop of each thread:

A Thread grabs task from task queue
same thread calls a "dowork" method on the task.
Task requests an object from the pool and uses it internally.

I'm envisaging a different algorithm.

For each task in the queue, create a class that implements Callable, takes a task and calls dowork in it's call method.
Create an ExecutorService with the amount of threads you want (e.g Executors.newFixedThreadPool() or cachedThreadPool())
Submit all the Callables to the service via invokeAll

Done!

The executor service handles the rest for you, monitoring the queue of tasks, handing them out to idle threads, and retasking threads when they have finished.

thanks, D.
Offline blahblahblahh

JGO Coder


Medals: 1


http://t-machine.org


« Reply #13 - Posted 2007-10-26 12:59:20 »

That might work... depending whether I can pass the Executor a derived class from Thread to use.

To help understand why i need to know what thread is currently processing a task I have set up a hypothetical scenario:


I think you want ThreadLocal instead - have a look at the javadocs to see what it does.

malloc will be first against the wall when the revolution comes...
Offline moogie

JGO Knight


Medals: 11
Projects: 6
Exp: 10 years


Java games rock!


« Reply #14 - Posted 2007-10-26 15:21:41 »

Thanks for the heads up Smiley

I will look into both suggestions.

This project will be going on hiatus for two weeks... it my wedding this weekend and then i am going to singapore and hong kong for a honey moon Smiley

When i get back i will see which suggestion best suits my needs.

Cheers
Offline CommanderKeith
« Reply #15 - Posted 2007-10-26 15:32:16 »

That's a pretty good excuse.  Congratulations!  Smiley

Offline AI Guy

Senior Newbie





« Reply #16 - Posted 2007-11-02 16:22:44 »

I agree with java.util.concurrent being one of the best ways to go for single applications benefiting from multi-core cpu's.  Would strongly reccomend the book "Java Concurrency in Practice", by Brian Goetz.  Like the title suggests, it is not just a bunch of hard to understand theory.  This will make things alot easier than just reading the Java docs, since these type of applications can fail in very difficult ways to understand and reproduce, unless a regiment is followed.
Offline zendar

Senior Newbie




Go Java!


« Reply #17 - Posted 2007-11-06 00:12:48 »


The garbage collector needs to stop all the threads across cpus/cores to do it's work. Even if you use a parallell collection GC, it needs to stop everything for some tasks. Generating and discarding a lot of objects is never a good idea, but you probably have to be even more careful as the number of parallell threads running on seperate cores increases.

It would seem that because of the jvm having a stop-the-world point in garbage collection, it is pretty important to ensure that the gc doesn't run too often on multiple cores.

Just a thought. :-)
Offline cylab

JGO Knight


Medals: 34



« Reply #18 - Posted 2007-11-06 09:45:14 »

The garbage collector needs to stop all the threads across cpus/cores to do it's work.

First time I hear this. Any pointers for reference?

Mathias - I Know What [you] Did Last Summer!
Offline bienator

Senior Member




OutOfCoffeeException


« Reply #19 - Posted 2007-11-06 11:54:15 »

First time I hear this. Any pointers for reference?
that does just happen on special circumstances (eg. if the VM runs OutOfMemory). You can track this easily if you debug the GC. Everytime you read "full collection" it was a stop the world collection... You should always tweak your realtime app for minor collections they run concurrently.

Offline Chris61182

Junior Member





« Reply #20 - Posted 2007-11-06 18:41:56 »

First time I hear this. Any pointers for reference?

http://www.ibm.com/developerworks/java/library/j-jtp11253/

And there are probably more, that was just the first that I found with a quick Google search.
Offline moogie

JGO Knight


Medals: 11
Projects: 6
Exp: 10 years


Java games rock!


« Reply #21 - Posted 2007-11-14 01:24:41 »

I am back from my honeymoon which was a nice diversion from real life Smiley

I have not yet had time to get back into this project but just for kicks i decided to make a native windows executable of the real time raytracer to see what the performance would be.

I was both very shocked and very disappointed! When using Excelsior Jet to make a native win32 exe there is a  33% increase in performance as compared to java 1.6 JRE. I did not think that there would be such a marked difference... infact i thought the hotspot complier would be able to perform better optimisations...

Here is the native executable (9.9 meg) and there is the jar it was compiled from.

Please let me know if you see similar results... perhaps it is just an oddity with my system?

During my holiday i have thought up some other optimisations which may help speed up the rtrt. However the first thing i will attempt is the stack based tile render discussed earlier.
Offline VeaR

Junior Member





« Reply #22 - Posted 2007-11-14 01:47:31 »

Try the 1.7EA JDK. I was getting 30% speed increase with my own tests.

Offline moogie

JGO Knight


Medals: 11
Projects: 6
Exp: 10 years


Java games rock!


« Reply #23 - Posted 2007-11-14 01:58:55 »

thanks for the link. It will be interesting to see if any improvment as going from JRE 1.5 to 1.6 gave about 20% increase.
Offline moogie

JGO Knight


Medals: 11
Projects: 6
Exp: 10 years


Java games rock!


« Reply #24 - Posted 2007-11-14 02:29:09 »

hmm... going from 1.6 to 1.7 seems to have only a minor increase 1-3% which could be easily be staistically insigificant

more interestingly is that on a core duo system, java 1.7 is actually slower (~27fps compared to ~30fps with jre1.6)
on the same core duo system the native exe version is running at ~41 fps
Offline moogie

JGO Knight


Medals: 11
Projects: 6
Exp: 10 years


Java games rock!


« Reply #25 - Posted 2007-11-16 12:53:35 »

I have had some free time and have implemented the tile-stack based render thread load balancing and from my tests with a core duo laptop i am achieving a high level of cpu(s) utilisation.

I would like to ask if those with two or more cpu(s)/cores could run the latest version and let me know if you achieve a high level of processor(s) utilisation? If you could list your system and the fps that would be great as well Smiley

You can compare it to the version 8 in the above post which does not implement the stack based approach.
Online princec

JGO Kernel


Medals: 285
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #26 - Posted 2007-11-16 13:18:35 »

What that really shows you is just how good Jet is Wink

What about the 1.6 server VM? In days gone by I generally found the server VM to be about 30% faster than the client VM, matching Jet typically (except of course on startup time which Jet rules).

Cas Smiley

Offline Martin Strand

Junior Member





« Reply #27 - Posted 2007-11-16 13:24:29 »

athlon 64 x2 4200+
3gb ram
1.6 client vm

29 fps / 88% cpu with version 8
32 fps / 99% cpu with version 10
Offline brackeen

Junior Member





« Reply #28 - Posted 2007-11-16 20:28:45 »

2.16 Ghz Core 2 Duo, Java 5 on Leopard (Activity Monitor reports 200% CPU if both cores are fully utilized)

33 fps, 160% CPU with version 8
33 or 36 fps, 175% CPU with version 10

Version 8 was consistent - always around 33fps.
Version 10 varied on each execution.  One execution would be 33 fps, another would be 36fps. Maybe something to do with thread scheduling?

Offline oNyx

JGO Coder


Medals: 1


pixels! :x


« Reply #29 - Posted 2007-11-16 21:07:36 »

AMD 64 x2 4200+
2gb ram
1.6 client vm
winxp

v8: 33fps@88%
v10: 35fps@95%

(Guess my ram is faster than Martin's.)

edit:
1.6 server vm

v8: 46fps@85%
v10: 50fps@91%

弾幕 ☆ @mahonnaiseblog
Pages: [1] 2
  ignore  |  Print  
 
 
You cannot reply to this message, because it is very, very old.

 

Add your game by posting it in the WIP section,
or publish it in Showcase.

The first screenshot will be displayed as a thumbnail.

theagentd (6 views)
2014-04-24 23:00:44

xsi3rr4x (83 views)
2014-04-15 18:08:23

BurntPizza (75 views)
2014-04-15 03:46:01

UprightPath (86 views)
2014-04-14 17:39:50

UprightPath (69 views)
2014-04-14 17:35:47

Porlus (86 views)
2014-04-14 15:48:38

tom_mai78101 (109 views)
2014-04-10 04:04:31

BurntPizza (169 views)
2014-04-08 23:06:04

tom_mai78101 (265 views)
2014-04-05 13:34:39

trollwarrior1 (217 views)
2014-04-04 12:06:45
List of Learning Resources
by SHC
2014-04-18 03:17:39

List of Learning Resources
by Longarmx
2014-04-08 03:14:44

Good Examples
by matheus23
2014-04-05 13:51:37

Good Examples
by Grunnt
2014-04-03 15:48:46

Good Examples
by Grunnt
2014-04-03 15:48:37

Good Examples
by matheus23
2014-04-01 18:40:51

Good Examples
by matheus23
2014-04-01 18:40:34

Anonymous/Local/Inner class gotchas
by Roquen
2014-03-11 15:22:30
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!