Java-Gaming.org    
Featured games (79)
games approved by the League of Dukes
Games in Showcase (476)
Games in Android Showcase (106)
games submitted by our members
Games in WIP (533)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: [1] 2
  ignore  |  Print  
  Yet another particle engine update!  (Read 8052 times)
0 Members and 1 Guest are viewing this topic.
Offline theagentd
« Posted 2012-11-10 18:12:00 »

Hey, guys. I know you're getting tired of this but I just can't stop it!  Grin

I managed to get 3.7 million particles running at 60 FPS!



This latest version is simply an improvement to fix a problem with the old transform feedback particle engine: It didn't work with SLI (multiple GPUs). The driver does not explicitly synchronize the buffer memory after using transform feedback between GPUs, so trying to render particles from the last frame simply did nothing. I solved this by letting each GPU have its own particle buffer and update it twice (for 2 GPUs that is) but only render it once. That way my two GPUs can work with only their own feedback buffers and no driver synchronization. It's not very effective of course since the updating is has to be done twice per GPU now, but at least the rendering of the particles is only done once per GPU which pretty much doubles fill-rate. Even with just my smoothed pixel-sized point particles I got a performance increase from 3.0 million particles to 3.7 million particles, an almost 25% increase in performance. Note that this was with an Nvidia GTX 295 which came out in January 2009; high-end at the time but not very spectacular today.

The main limitation at the moment is actually memory usage. My transform feedback code is extremely unoptimized when it comes to memory usage (I could reduce it by 25-30% with relative ease). The real however problem is that the driver isn't smart enough to figure out that the buffers are only used by one GPU, so they are both allocated on both GPUs. For 2 GPUs, I need 4 full particle buffers to be able to ping pong between two of them on each GPU. 3 700 000 * 36 bytes * 2 * 2 = 508MBs of data... Of course, you don't need that many particles in a real game, so memory usage will be a much smaller problem there. The high efficiency of this technique combined with the fact that I got it working at all on SLI/Crossfire systems still makes it worth using even if you "only" have 100k particles or so.

As I wrote above, fill-rate is basically double what it was before, while the cost of updating particles is the same. With 3 million particles with a point size of 1 (4 pixels covered per particle due to smoothing)  particles I "only" got an almost 23.3% increase in performance (60 --> 74), but with 100 000 particles with a point size of 43 (1 849 pixels per particle) the performance increase was 93.3% (60 --> 116). Fillrate scales linearly with the number of GPUs, while particle performance does not scale at all. My program can handle any number of GPUs (= up to 4, a limitation of SLI/Crossfire), but memory usage may become a problem on quad-SLI systems. =S

It should be possible to further optimize this by simply doing the update once on each GPU but with a twice as high delta, but this may cause inconsistencies between the GPUs due to floating point errors that build up over the life time of a particle. Might be worth investigating though since particles generally live a very short life time. I'd estimate performance of such an implementation to at least 5 million particles on my graphics card since it'd scale perfectly with any number of GPUs.

Myomyomyo.
Offline Danny02
« Reply #1 - Posted 2012-11-10 18:49:17 »

Hey, guys. I know you're getting tired of this but I just can't stop it!  Grin

ok, thats it!!!  Angry

CHALLENGE ACCEPTED!
Offline matheus23

JGO Kernel


Medals: 106
Projects: 3


You think about my Avatar right now!


« Reply #2 - Posted 2012-11-10 20:01:16 »

Hey, guys. I know you're getting tired of this but I just can't stop it!  Grin

ok, thats it!!!  Angry

CHALLENGE ACCEPTED!

Wut? You try that too now?

See my:
    My development Blog:     | Or look at my RPG | Or simply my coding
http://matheusdev.tumblr.comRuins of Revenge  |      On Github
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline Ultroman

JGO Knight


Medals: 24
Projects: 1


Snappin' at snizzes since '83


« Reply #3 - Posted 2012-11-10 22:07:15 »

Everyone get a chair and some popcorn!

- Jonas
Offline Sickan

Senior Member


Medals: 8



« Reply #4 - Posted 2012-11-10 22:10:45 »

I'm jealous and amazed of the wizardry you pull off.

Cheers! Cool
Offline theagentd
« Reply #5 - Posted 2012-11-11 01:44:43 »

Hey, guys. I know you're getting tired of this but I just can't stop it!  Grin

ok, thats it!!!  Angry

CHALLENGE ACCEPTED!

Heh, I quickly hacked together a new version which does what I wrote before:
It should be possible to further optimize this by simply doing the update once on each GPU but with a twice as high delta, but this may cause inconsistencies between the GPUs due to floating point errors that build up over the life time of a particle. Might be worth investigating though since particles generally live a very short life time. I'd estimate performance of such an implementation to at least 5 million particles on my graphics card since it'd scale perfectly with any number of GPUs.
However, I "solved" the floating point problems by simply doing the updating twice for the particles that needed it in the shader with a for-loop instead of doing the whole transform feedback thingy twice. This turned out to be free (it's probably memory bottlenecked). I just hacked it all together, so it explodes for >2 GPUs and I honestly don't know exactly how it's working ^^', but I've compared it frame by frame with my original (non SLI) version and it's identical.  persecutioncomplex Performance speaks for itself: 5 790 000 particles at 62 FPS. If I increase the number of particles any more than that I run out of VRAM (I only have 896 MBs, minus the 41MBs Windows uses) and performance drops to 1-5 FPS due to swapping. Seems like optimizing memory usage would improve performance too since it seems to be bottlenecked by that.

I'm, uh, awaiting your counter-attack? =S

Myomyomyo.
Offline Joshua Waring

Senior Member


Medals: 4
Projects: 2



« Reply #6 - Posted 2012-11-11 06:09:05 »

My particle system can only get 100K particles at 23fps... although that's only one thread on the cpu.... and only uses a few megabytes and has point gravity.....

EDIT : My counter is better than yours! cough

The world is big, so learn it in small bytes.
Offline theagentd
« Reply #7 - Posted 2012-11-11 06:22:13 »

My particle system can only get 100K particles at 23fps... although that's only one thread on the cpu.... and only uses a few megabytes and has point gravity.....

EDIT : My counter is better than yours! cough
I get around 2100-2200FPS (<0.5ms) with 100k particles...

javaw.exe uses 25 MBs of RAM. VRAM usage is around 31MBs, including 6 MBs for the 1080p framebuffer (theoretically the particle buffers uses around 14MBs). CPU usage close to 0% since the only thing the CPU does is generate new particles and issue a few OpenGL commands per frame.

Myomyomyo.
Offline Joshua Waring

Senior Member


Medals: 4
Projects: 2



« Reply #8 - Posted 2012-11-11 06:28:26 »

The point here is GPU completely devastates the CPU for these calculations. I would like to learn OpenCL and LWJGL has the capabilities Smiley

I would also like to test it at home on my 6970.

PS is there a .jar we can play around with Smiley?

The world is big, so learn it in small bytes.
Offline Sickan

Senior Member


Medals: 8



« Reply #9 - Posted 2012-11-11 15:19:04 »

I'm sorry if this sound a bit leachy, but can I see your source code somewhere? I'm very, very curious! Thanks in advance! Cheesy
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline ra4king

JGO Kernel


Medals: 336
Projects: 2
Exp: 5 years


I'm the King!


« Reply #10 - Posted 2012-11-11 16:35:10 »

I'm sorry if this sound a bit leachy, but can I see your source code somewhere? I'm very, very curious! Thanks in advance! Cheesy
I second this notion.

I wonder how many particles my GTX 580 with 1.5GB of VRAM could handle........ Wink

Offline Danny02
« Reply #11 - Posted 2012-11-11 17:02:12 »

time flies so fast Cool
didn't even start yet really(only some research)

would be nice to compare later the sources and do real benchmarks. What kind of features does your particle system/simulation do?
Offline theagentd
« Reply #12 - Posted 2012-11-11 18:11:15 »

I discovered a stupid bug in my barely working new SLI version. The ordering of the particles is different between GPUs. This of course doesn't affect performance, but does cause some flickering. I didn't see it since the particles are so small they rarely overlap, and when they do it's so chaotic it's impossible to spot. When I increased the particle size I could easily see it. Well, it doesn't affect performance, so meh.

The point here is GPU completely devastates the CPU for these calculations. I would like to learn OpenCL and LWJGL has the capabilities Smiley

I would also like to test it at home on my 6970.

PS is there a .jar we can play around with Smiley?


I wonder how many particles my GTX 580 with 1.5GB of VRAM could handle........ Wink

Not yet, I'll throw something together... Will be interesting to see how well it performs on other architectures. My old GTX 295 seems to work pretty well with transform feedback considering that it's very fast compared to my older shader/OpenCL implementations, but my laptop's GTX 460M takes a pretty big performance hit from it, probably because it has a lot less memory bandwidth. It seems like it's very tied to which architecture the GPU has, so it will be very interesting to see how it performs on later Nvidia cards and Radeon cards. I plan to release the single GPU version soon.

Note that transform feedback has nothing to do with OpenCL. It's just an extension to OpenGL that's available to OGL3 cards (core in OGL4).

I'm sorry if this sound a bit leachy, but can I see your source code somewhere? I'm very, very curious! Thanks in advance! Cheesy
I'm still planning on creating a tutorial on transform feedback. I might be able to do it tonight, but don't count on it...

time flies so fast Cool
didn't even start yet really(only some research)

would be nice to compare later the sources and do real benchmarks. What kind of features does your particle system/simulation do?
Right now? Almost no features at all. Not even texturing. The point is that after updating the particles with transform feedback you have a perfectly formatted VBO with data that you can do whatever you want with. Want to draw an asteroid 3D model for each particle? Just use instancing. 2D sprites? Use a geometry shader.

Myomyomyo.
Offline StumpyStrust
« Reply #13 - Posted 2012-11-12 07:25:09 »

hmmm...this would take just about everything off of the cpu. The only question I ask is how much of the gpu do you take away? It is great for just some simulations but when it comes to actually using it in a game you still have all those triangles you be needing to render.

Offline theagentd
« Reply #14 - Posted 2012-11-12 09:20:45 »

hmmm...this would take just about everything off of the cpu. The only question I ask is how much of the gpu do you take away? It is great for just some simulations but when it comes to actually using it in a game you still have all those triangles you be needing to render.
Performance of the AFR version on my GPU was 358 980 000 particles per second (5 790 000 * 62) so around 350 million particles per second, including some cheap rendering. 1 million particles runs at around 350 FPS, so it the math seems to work out correctly. Since particles are usually fragment limited, I think the gains from completely eliminating the updating and reuploading from the CPU and doing it basically for free on the GPU is a good thing. 100k particles should in theory run at 3500 FPS and therefore takes around .28ms to update and render. When I get home (I'm at uni) I can benchmark it with rasterizing disabled to check the raw updating performance of it.

Myomyomyo.
Offline Joshua Waring

Senior Member


Medals: 4
Projects: 2



« Reply #15 - Posted 2012-11-12 10:53:24 »

I assume you're using OpenCL?

The world is big, so learn it in small bytes.
Offline Danny02
« Reply #16 - Posted 2012-11-12 18:21:12 »

so university is a bit boring atm, so i threw together the first version:
~4mio at 30fps
~16mio at 5fps

running only on my laptop with a NVidia 550, but without driver on mesa linux
have to test it one my desktop^^
Offline theagentd
« Reply #17 - Posted 2012-11-12 18:53:56 »

I assume you're using OpenCL?
Not at all! Just OpenGL 3 with extensions!

so university is a bit boring atm, so i threw together the first version:
~4mio at 30fps
~16mio at 5fps

running only on my laptop with a NVidia 550, but without driver on mesa linux
have to test it one my desktop^^
How are you updating and drawing your particles?

EDIT:
Benchmark without any rendering (only updating with transform feedback):
5 000 000 particles at 110 FPS = 550 000 000 particles per second. One million particles take about 1.8ms to update. That's definitely less than it would take to just upload that amount of data each frame.

If anyone missed it: http://www.java-gaming.org/topics/opengl-transform-feedback/27786/view.html

Myomyomyo.
Offline Joshua Waring

Senior Member


Medals: 4
Projects: 2



« Reply #18 - Posted 2012-11-13 08:21:55 »

Would we expect this much performance from openCL O.o?

The world is big, so learn it in small bytes.
Offline theagentd
« Reply #19 - Posted 2012-11-13 08:45:46 »

In some ways it might, but keep in mind that for me, OpenCL performed identically to OpenGL using textures to store the data and a shader to update them. Without transform feedback, you also have a huge problem of keeping track of which particles are alive. It has good peak performance when all particles are alive, but is almost as fast when you have no particles at all, since you'll have to render a point for every allocated particle each frame to find the alive ones. Transform feedback solves this since it compacts the alive ones to the beginning of the VBO and also allows you to draw only the number of alive particles, but might be a little slower on some hardware. I also don't think that multi-GPU rendering will work with OpenCL.

OpenCL was a bit disappointing. It's really only faster if you manage to utilize the shared memory of the clusters to do calculations (which you can't for particles) and even then, it might not be faster. Just getting it up to OpenGL in performance was hard since you need to care about how you read memory so its aligned and stuff.

Myomyomyo.
Offline StumpyStrust
« Reply #20 - Posted 2012-11-14 01:51:25 »

theagentd: please make a program better then this

http://www.youtube.com/watch?v=1utq8kW8Yog

you know...to show how boss java is.

Offline sproingie

JGO Kernel


Medals: 201



« Reply #21 - Posted 2012-11-14 02:03:29 »

Java isn't the one that's "boss" in all this, it's GLSL, or on the case of that demo, HLSL (basically the same).  Java is just the glue layer here, which should perform as well as nearly anything else.  It's more like DX vs GL rather than anything vs Java.

That demo does use a compute shader, which only has a direct equivalent in OpenGL 4.3, but is otherwise morally equivalent to OpenCL (or perhaps a subset of it).  The author does link to the source (I'll link it here too) so it would be interesting to see how much of it is directly portable.

Offline theagentd
« Reply #22 - Posted 2012-11-14 02:23:31 »

theagentd: please make a program better then this

http://www.youtube.com/watch?v=1utq8kW8Yog

you know...to show how boss java is.
My desktop doesn't have a OGL4 graphics card, so I can't test it on my computer. There's a big chance that compute shaders are faster, but they're still not as flexible as transform feedback. Sure, you might be able to cram out a few more particles, but it's extremely ineffective when you only have a few. He's probably getting better performance since his particle contain less information, most likely just 24 bytes vs my 36 bytes per particle. Besides, I just need to use a geometry shader to expand the points into quads, which is exactly what I did for that sprite engine. =S

Java isn't the one that's "boss" in all this, it's GLSL, or on the case of that demo, HLSL (basically the same).  Java is just the glue layer here, which should perform as well as nearly anything else.  It's more like DX vs GL rather than anything vs Java.

That demo does use a compute shader, which only has a direct equivalent in OpenGL 4.3, but is otherwise morally equivalent to OpenCL (or perhaps a subset of it).  The author does link to the source (I'll link it here too) so it would be interesting to see how much of it is directly portable.
I wouldn't say that OpenCL = compute shaders, compute shaders are much easier to use correctly (handling memory). I should really make one using OGL 4.3...

Myomyomyo.
Offline StumpyStrust
« Reply #23 - Posted 2012-11-14 06:12:05 »

Man everyone takes things so literally. I know it is shaders and not java just saying that that app is wicked cool.

Offline ra4king

JGO Kernel


Medals: 336
Projects: 2
Exp: 5 years


I'm the King!


« Reply #24 - Posted 2012-11-14 07:18:20 »

He's probably getting better performance since his particle contain less information, most likely just 24 bytes vs my 36 bytes per particle.
What's in those 24 bytes and 36 bytes?

Offline Joshua Waring

Senior Member


Medals: 4
Projects: 2



« Reply #25 - Posted 2012-11-14 10:28:09 »

I'm trying to learn OpenCL and I'm currently reading Heterogeneous Computing with OpenCL and I was amazed by the lack of tutorials on the internet.
Is OpenCL really that unpopular?

The world is big, so learn it in small bytes.
Offline StumpyStrust
« Reply #26 - Posted 2012-11-14 10:54:53 »

It is viable but hard to implement into a game from what I understand.

The big problem with things such as particle system/physics is that they can be very computational intensive. With particle systems, just having 100k particles means that if everything is done on cpu the cpu has to calculate the position 100k times, calculate anything else the particle has 100k times, and still needs to send the updated particles to gpu.

With openCL you can have the gpu do all the calculations but you still need to send things to the gpu which is where my particle system dies at. theagentd suggestion is very nice as you get a prebuilt VBO that you could simply throw at the gpu meaning the CPU does next to nothing. Also, it is usable in opengl 3.0 which is very nice. I am now wondering if I would make a sprite batcher using this since geometry shaders are already 3.0

Offline theagentd
« Reply #27 - Posted 2012-11-14 17:33:01 »

He's probably getting better performance since his particle contain less information, most likely just 24 bytes vs my 36 bytes per particle.
What's in those 24 bytes and 36 bytes?
CPU version: 12 bytes {2D float position, RGBA byte color), multithreaded, RAM bandwidth limited (I have some cheap 1600MHz DDR3 RAM), 2.0 million particles.
OpenGL version: 23 bytes (padded to 24) {2D float position, 2D float velocity, RGB byte color, short maxLife, short lifeLeft} stored in 3 textures, OGL 3 only, 3.8 million particles (one GPU)
OpenCL version: 23 bytes (padded to 24) {2D float position, 2D float velocity, RGB byte color, short maxLife, short lifeLeft} stored in 3 VBOs, updated with OpenCL, 1.1 million particles (one GPU), see below.
Transform feedback version: 36 bytes {2D float position, 2D float velocity, RGB FLOAT color, float maxLife, floalt lifeLeft} stored interleaved in a VBO, updated with transform feedback, 3.0 million particles (one GPU), 6 million particles (two GPUs).

It's only possible to output 4 byte floats and ints with transform feedback, so instead of compressing stuff I just converted everything to floats, hence the inflated particle byte size.

WTF IS UP WITH OPENCL AGAIN?! I am getting really tired of how sensitive OpenCL seems to be. On my laptop's GTX 460M, the exact same code performs the same as the GPU version (2.2 million particles). I think it's because the 400 series had extensive hardware changes compared to the 200 series, but I'm really not thrilled to start delving into that stuff again...

It is viable but hard to implement into a game from what I understand.

The big problem with things such as particle system/physics is that they can be very computational intensive. With particle systems, just having 100k particles means that if everything is done on cpu the cpu has to calculate the position 100k times, calculate anything else the particle has 100k times, and still needs to send the updated particles to gpu.

With openCL you can have the gpu do all the calculations but you still need to send things to the gpu which is where my particle system dies at. theagentd suggestion is very nice as you get a prebuilt VBO that you could simply throw at the gpu meaning the CPU does next to nothing. Also, it is usable in opengl 3.0 which is very nice. I am now wondering if I would make a sprite batcher using this since geometry shaders are already 3.0

100k times isn't really that much since you have 4 processors doing 3 billion clock cycles per second. The problem is actually the insane amount of memory bandwidth needed. Just getting two RAM sticks and running them in dual channel gave me a 60% speed boost on a dual-core laptop compared to single channel. Most particles only need some basic math.

I wouldn't recommend a sprite batcher on the GPU. You'd need to run your whole game on the GPU to know HOW to move your sprites around.

Myomyomyo.
Offline ra4king

JGO Kernel


Medals: 336
Projects: 2
Exp: 5 years


I'm the King!


« Reply #28 - Posted 2012-11-15 02:16:55 »

He's probably getting better performance since his particle contain less information, most likely just 24 bytes vs my 36 bytes per particle.
What's in those 24 bytes and 36 bytes?
CPU version: 12 bytes {2D float position, RGBA byte color), multithreaded, RAM bandwidth limited (I have some cheap 1600MHz DDR3 RAM), 2.0 million particles.
OpenGL version: 23 bytes (padded to 24) {2D float position, 2D float velocity, RGB byte color, short maxLife, short lifeLeft} stored in 3 textures, OGL 3 only, 3.8 million particles (one GPU)
OpenCL version: 23 bytes (padded to 24) {2D float position, 2D float velocity, RGB byte color, short maxLife, short lifeLeft} stored in 3 VBOs, updated with OpenCL, 1.1 million particles (one GPU), see below.
Transform feedback version: 36 bytes {2D float position, 2D float velocity, RGB FLOAT color, float maxLife, floalt lifeLeft} stored interleaved in a VBO, updated with transform feedback, 3.0 million particles (one GPU), 6 million particles (two GPUs).
What's the use for "maxLife"?

I wouldn't recommend a sprite batcher on the GPU. You'd need to run your whole game on the GPU to know HOW to move your sprites around.
Wouldn't that be awesome? Haha Grin

Offline theagentd
« Reply #29 - Posted 2012-11-15 12:58:37 »

He's probably getting better performance since his particle contain less information, most likely just 24 bytes vs my 36 bytes per particle.
What's in those 24 bytes and 36 bytes?
CPU version: 12 bytes {2D float position, RGBA byte color), multithreaded, RAM bandwidth limited (I have some cheap 1600MHz DDR3 RAM), 2.0 million particles.
OpenGL version: 23 bytes (padded to 24) {2D float position, 2D float velocity, RGB byte color, short maxLife, short lifeLeft} stored in 3 textures, OGL 3 only, 3.8 million particles (one GPU)
OpenCL version: 23 bytes (padded to 24) {2D float position, 2D float velocity, RGB byte color, short maxLife, short lifeLeft} stored in 3 VBOs, updated with OpenCL, 1.1 million particles (one GPU), see below.
Transform feedback version: 36 bytes {2D float position, 2D float velocity, RGB FLOAT color, float maxLife, floalt lifeLeft} stored interleaved in a VBO, updated with transform feedback, 3.0 million particles (one GPU), 6 million particles (two GPUs).
What's the use for "maxLife"?

I wouldn't recommend a sprite batcher on the GPU. You'd need to run your whole game on the GPU to know HOW to move your sprites around.
Wouldn't that be awesome? Haha Grin
Ah, I couldn't come up with a better name for maxLife. It's just how much life (= how many frame) the particle is supposed to last, while life is the amount of life left (= how many more frames it should last). I use it to calculate the alpha, alpha = life / maxLife; I do this on the CPU for the CPU version, hence I had a 4 byte RGBA color. When I had life available on the GPU I only needed 3 bytes but I padded it to 4 anyway to gain some performance.

And yeah, a game almost fully run on the GPU would be awesome.

Myomyomyo.
Pages: [1] 2
  ignore  |  Print  
 
 
You cannot reply to this message, because it is very, very old.

 

Add your game by posting it in the WIP section,
or publish it in Showcase.

The first screenshot will be displayed as a thumbnail.

pw (18 views)
2014-07-24 01:59:36

Riven (17 views)
2014-07-23 21:16:32

Riven (14 views)
2014-07-23 21:07:15

Riven (17 views)
2014-07-23 20:56:16

ctomni231 (45 views)
2014-07-18 06:55:21

Zero Volt (40 views)
2014-07-17 23:47:54

danieldean (32 views)
2014-07-17 23:41:23

MustardPeter (36 views)
2014-07-16 23:30:00

Cero (51 views)
2014-07-16 00:42:17

Riven (50 views)
2014-07-14 18:02:53
HotSpot Options
by dleskov
2014-07-08 03:59:08

Java and Game Development Tutorials
by SwordsMiner
2014-06-14 00:58:24

Java and Game Development Tutorials
by SwordsMiner
2014-06-14 00:47:22

How do I start Java Game Development?
by ra4king
2014-05-17 11:13:37

HotSpot Options
by Roquen
2014-05-15 09:59:54

HotSpot Options
by Roquen
2014-05-06 15:03:10

Escape Analysis
by Roquen
2014-04-29 22:16:43

Experimental Toys
by Roquen
2014-04-28 13:24:22
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!