Java-Gaming.org    
Featured games (79)
games approved by the League of Dukes
Games in Showcase (477)
Games in Android Showcase (107)
games submitted by our members
Games in WIP (536)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: [1]
  ignore  |  Print  
  OpenCL being slower than OpenGL  (Read 5137 times)
0 Members and 1 Guest are viewing this topic.
Offline theagentd
« Posted 2012-01-26 10:22:14 »

After cannibalizing on SpriteShootoutCL and the sum example I finally have something that is starting to resemble my particle engine again. I've got my CL commands sorted out, and rendering is working fine. The code is obviously a lot cleaner and shorter than my earlier GPU particles test, which used textures to store particle data and then draw them. The next logical step was obviously to use OpenCL instead of a fragment shader to update my particles, and that also allowed me to just store them in a VBO and just process the data with OpenCL. Sadly, my OpenCL knowledge is a little bit... well, terrible, so I must be doing something wrong, because I get terrible performance.

In my OpenGL particle simulation I keep the particle data in 3 textures, one RGBA 32-bit float texture for position and velocity, one RGBA8 texture for color and one RG16 texture for life. This runs at 65 FPS with 2 million particles, with several thousand particles being created every second. My OpenCL version, however, can manage 62 FPS for just moving around particles on the screen, but this is without any data for color or life. Just keeping and copying that data drops performance to 54 FPS. This is even without any commands to create new particles too!

This is the OpenCL program source:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
typedef struct Particle {
    float2 pos;
    float2 vel;
    ushort4 colorLife;
   
} Particle;

kernel void update(const int width, const int height, const float gravity, const float airResistance, global const struct Particle* input, global struct Particle* output){
    unsigned int i = get_global_id(0);
   
    Particle p = input[i];
   
    p.pos += p.vel;
   
    if(p.pos.x < 0 && p.vel.x < 0){
        p.vel.x = -p.vel.x;
    }
   
    if(p.pos.y < 0 && p.vel.y < 0){
        p.vel.y = -p.vel.y;
    }
   
    if(p.pos.x >= width && p.vel.x > 0){
        p.vel.x = -p.vel.x;
    }
   
    if(p.pos.y >= height && p.vel.y > 0){
        p.vel.y = -p.vel.y;
    }
   
    output[i] = p;
}


The amount of data copied (24 bytes per particle -> ~46MBs) seems to be the bottleneck, since the actual logic can be commented away without any increase in performance.

1  
2  
3  
4  
5  
6  
kernel void update(const int width, const int height, const float gravity, const float airResistance, global const struct Particle* input, global struct Particle* output){
    unsigned int i = get_global_id(0);
   
    Particle p = input[i];
    output[i] = p;
}


For reference here is the update fragment shader I use in the OpenGL version:

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
47  
#version 330

uniform sampler2D posVelSampler;
uniform sampler2D lifeSampler;

uniform float gravity;
uniform float airResistance;
uniform vec2 screenSize;

in vec2 texPos;

#define POSVEL 0 //RGBA 32-bit float texture
#define LIFE 1   //RG16 texture

layout(location = POSVEL) out vec4 posVelOut;
layout(location = LIFE) out vec2 lifeOut;

void main()
{
    lifeOut = texture2D(lifeSampler, texPos, 0).xy;
    lifeOut.x -= 1.0/65535;
    //if(lifeOut.x == 0){
       //discard;
       //return;
   //}

    posVelOut = texture2D(posVelSampler, texPos, 0);
    posVelOut.w += gravity;
    posVelOut.zw *= airResistance;
    posVelOut.xy += posVelOut.zw;
   
    if(posVelOut.x < 0 && posVelOut.z < 0){
        posVelOut.z = -posVelOut.z;
    }
   
    if(posVelOut.y < 0 && posVelOut.w < 0){
        posVelOut.w = -posVelOut.w;
    }
   
    if(posVelOut.x > screenSize.x - 1 && posVelOut.z > 0){
        posVelOut.z = -posVelOut.z;
    }
   
    if(posVelOut.y > screenSize.y - 1 && posVelOut.w > 0){
        posVelOut.w = -posVelOut.w;
    }
}


It doesn't make any sense if OpenGL is faster at computing than OpenCL. Therefore I assume I am doing something wrong.

TL;DR:
Updating particles:
 - OpenGL: 65FPS
 - OpenCL: 54FPS.
WTF?

Myomyomyo.
Offline Spasi
« Reply #1 - Posted 2012-01-26 12:36:40 »

You are not comparing apples to apples. The fragment shader version uses the GPU's texture sampling hardware, which goes through a cache, while the kernel version uses raw memory. The data structures are different too. A couple of things you could try:

- Split the Particle struct to a pos/vel buffer (float4*) and a life buffer (ushort4*).
- Use sampler objects instead of buffers.
Offline theagentd
« Reply #2 - Posted 2012-01-26 14:04:52 »

But no single value is accessed twice! Why would a cache help in this case? And how in the world do I use samplers?

Myomyomyo.
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline Spasi
« Reply #3 - Posted 2012-01-26 14:24:33 »

The cache won't make it faster per se, but texture sampling will cause the fragment shader to be scheduled and executed in a different way than the kernel. The memory access pattern is different, in the fragment shader case the GPU is most likely reading and writing in batches of 4x4 fragments. The sampling hardware is designed for such a pattern and will probably do a single read/write for each such fragment block.

Check out clCreateSampler, clCreateImage in the OpenCL spec and also the sampler_t data type. You'll need to use the read_imageX and write_imageX functions in the kernel.
Offline theagentd
« Reply #4 - Posted 2012-01-26 16:10:39 »

This doesn't make any sense at all. Why should I need to handle my data as an "image" in OpenCL when I'm updating particle motion? Isn't there some other way of making it access the data in a faster way? There's a problem with handling it like that too. The data processed is in an OpenGL VBO, and I'd prefer to keep it like that since it simplifies rendering a lot.

Sorry, but I really am a n00b at this, so please bear with me! You're the author of the sprite shootout example, right? Thanks for making that! =S

Myomyomyo.
Offline Danny02
« Reply #5 - Posted 2012-01-26 17:31:08 »

Your problem is that u access your global memory wrong. Something what one should do in most cases, is to use a Struct of Arrays (SoA) instead of an Array of Struct (AoS) as you do now.

So use this
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
struct Particles {
    float2 * pos;//64bit
   float2 * vel;//64bit
   ushort4 * colorLife;//64bit?    
};

kernel void update(const int width, const int height, const float gravity, const float airResistance, global const struct Particles input, global struct Particles output){
    unsigned int i = get_global_id(0);
   
    output.pos[i] = input.pos[i];
    output.vel[i] = input.vel[i];
    output.colorLife[i] = input.colorLife[i];
}



explanation following...
Offline Danny02
« Reply #6 - Posted 2012-01-26 17:46:23 »

Beforehand, sry if I mess up the terminology I use CUDA atm and only know that OpenCL memory spaces are named a little bit different.
so Global := device memory, Shared := memory per Multiprocessor shared by one processing block
Also I only know the best practices for NVidia cards, so see fro your self if things are working for AMD cards as well.


So lets get our hands dirty^^
Threads are arranged in little sets called Wraps, which have always the size of 32. And there is a lot about which affect how the card works, but I'm only interested how they affect global memory reads.

So to make it short there is something called coalescence reads, when you have some data (float a[]) you want to make sure that all threads in a wrap read exactly in their order so t0 reads a[0], t1 a[1] ... t31 a[31]. This will work for 32bit reads and should on most hardware also for 64bit.
Also make sure to use the predefined data-types float2, float3 and no self-build ones.

I hope you can now understand why SoA is better then AoS. If you have any other questions just ask.



Offline theagentd
« Reply #7 - Posted 2012-01-26 18:36:15 »

I have 24 bytes of data per particle: Position (2 floats), velocity (2 floats), color (4 bytes) and life (2 shorts). For some reason it was faster to combine color and life into an ushort4 instead of having an int and a ushort2, so that's why they are combined in the kernel I posted.

I can't define my kernel to take a Particles struct. It either has to be pointer (Particles*) or not be defined global. >_> Anyway, what you mean is that I should separate the data into 4 different buffers for each parameter, so the easiest way to get it right is just to get rid of the struct in the first place and keep a float2* for position and velocity, a char4* or an int* for color and a ushort2 for life. The reason why I wanted the data to be in the same buffer is because it's easier and faster to draw it using OpenGL later...

Sorry, but I don't really understand why it's better to keep structs of arrays instead of arrays of structs...

Myomyomyo.
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 744
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #8 - Posted 2012-01-26 18:39:49 »

The reason why I wanted the data to be in the same buffer is because it's easier and faster to draw it using OpenGL later...
OpenGL also supports AoS and SoA, so rendering should be the same (just other strides and offsets).

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline Danny02
« Reply #9 - Posted 2012-01-27 00:28:18 »

when you are doing coalescent global memory reads/writes (what you are doing with SoA but not with AoS) the hardware need to shift around a lot less memory then when your reads/writes aren't.

Think of it like this, when u call for a float the hardware won't read exactly one float from the global memory, instead it will always read a hole chunk of memory which is somehow aligned. This chunk was on old hardware half the size of a Wrap and on newer exactly the size of a Wrap.

Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline theagentd
« Reply #10 - Posted 2012-01-27 09:50:36 »

The reason why I wanted the data to be in the same buffer is because it's easier and faster to draw it using OpenGL later...
OpenGL also supports AoS and SoA, so rendering should be the same (just other strides and offsets).
Yeah, it's not a big deal, but I think it's slightly faster to keep the data in the same buffer, but that doesn't matter if OpenCL is faster with different buffers.

when you are doing coalescent global memory reads/writes (what you are doing with SoA but not with AoS) the hardware need to shift around a lot less memory then when your reads/writes aren't.

Think of it like this, when u call for a float the hardware won't read exactly one float from the global memory, instead it will always read a hole chunk of memory which is somehow aligned. This chunk was on old hardware half the size of a Wrap and on newer exactly the size of a Wrap.
It seems to be difficult to have a struct inside the kernel which has a variable size (at least for me >_<), so would I get the same effect by just keeping 4 different input and output?


EDIT:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
kernel void update(
      const int width, const int height, const float gravity, const float airResistance,
      global const float2* inPos, global const float2* inVel, global const uchar4* inCol, global const ushort2* inLife,
      global float2* outPos, global float2* outVel, global uchar4* outCol, global ushort2* outLife
){
    unsigned int i = get_global_id(0);
   
    float2 pos = inPos[i];
    float2 vel = inVel[i];
    uchar4 col = inCol[i];
    ushort2 life = inLife[i];
   
    pos += vel;
   
    if(pos.x < 0 && vel.x < 0){
        vel.x = -vel.x;
    }
   
    if(pos.y < 0 && vel.y < 0){
        vel.y = -vel.y;
    }
   
    if(pos.x >= width && vel.x > 0){
        vel.x = -vel.x;
    }
   
    if(pos.y >= height && vel.y > 0){
        vel.y = -vel.y;
    }
   
    outPos[i] = pos;
    outVel[i] = vel;
    outCol[i] = col;
    outLife[i] = life;
}

62 FPS. It's almost as fast as the OpenCL version now... >_>

Myomyomyo.
Offline Danny02
« Reply #11 - Posted 2012-01-27 12:06:41 »

take a look: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/OpenCL_Best_Practices_Guide.pdf

at 3.2.1
Offline Spasi
« Reply #12 - Posted 2012-01-27 12:17:15 »

It seems to be difficult to have a struct inside the kernel which has a variable size (at least for me >_<), so would I get the same effect by just keeping 4 different input and output?

It should be fine to use a single float4 buffer for position and velocity.
Offline theagentd
« Reply #13 - Posted 2012-01-27 15:59:25 »

Sorry, but I still feel really stupid here. Using structs I was accessing memory in 24 byte chunks from a large buffer in global memory. Now I am accessing the same total amount of data from 4 different buffers and it's suddenly noticeably faster? Please correct me if I'm wrong, but the problem was that I loaded a 24 byte struct which caused the data transaction to not be aligned to 32 bytes? By separating the data into different buffers the individual parts of the old struct were could be aligned to 32 bytes which resulted in a single transaction?

It seems to be difficult to have a struct inside the kernel which has a variable size (at least for me >_<), so would I get the same effect by just keeping 4 different input and output?

It should be fine to use a single float4 buffer for position and velocity.
Keeping the position and velocity separate let's me get rid of the stride when rendering the particles with OpenGL later, but working with OpenCL has made me question everything I knew about memory and bandwidth... I dunno...


I recently found a particle system test that runs with OpenGL and CUDA. Compared to the dissected SpriteShootoutCL example I have (which just renderers points without textures) the performance was identical +- 1 FPS at around 60 FPS with the same particle count. Now the really weird thing is that the performance of my OpenGL GPU particle test is almost exactly the same. That's insanely weird!
 - SpriteShootoutCL uses a float4 per particle. That's 16 bytes per particle.
 - I assume the CUDA implementation uses the same amount of data per particle since it does not feature individually colored particles and the particles don't have a lifetime (they are permanent just like in SSCL).
 - My OpenGL version uses 20 bytes when updating particles and another 4 bytes for color (which are not copied during updating). The updating shader therefore uses more memory bandwidth but still retains the same performance as the other two. HOW?! Why is OpenGL with the overhead of rendering each particle with a geometry shader with more data per particle and also keeping this data in textures faster than using OpenCL or CUDA?

This isn't really an apples to apples comparison though, since the OpenGL version also creates new particles at a constant rate, so the OpenCL and CUDA versions actually have less work to do. Right now I'm working on creating an identical OpenCL version of my OpenGL particle simulation (with "living" particles). We'll see how it goes...

Myomyomyo.
Offline Danny02
« Reply #14 - Posted 2012-01-27 16:44:49 »

Some more thoughts:
OpenGL is faster 'because' it is using textures and your OpenCL implementation is slower 'because' of AoS.

Remember that you can read from global memory in a coalescence(C) and in an uncoalescence(UC) way. I try to explain both again a little bit better in just a few sec, but first compare those two with texture fetches.

[FAST] numbers aren't based on facts just assumptions
  • coalescence global memory reads
~100%
  • texture memory reads
~80%
  • uncoalescence global memory reads
~5%
[/list]
[SLOW]

So why are texture reads faster then UC but slower then C reads. First of all your GPU has an extra cache for texture data, so you can randomly sample your texture with only a very small performance hit. Also in some cases, can the Z memory layout of the texture memory give you some speedups because of relative memory locality.

C reads are the fastest way to read from global memory just because you have zero overhead.


When you take a look at your code and think about it like this, every line gets executed simultaneously. And then take a look at what memory addresses you are accessing. Assuming 24byte AoS
i.e. for only your first 4 threads.
thread 1 is accessing address 0-23
2: 24-47
3: 48 - 71
4: 72 - 95

When you now have hardware which has 32byte memory banks your threads access the following ones:
1: 0
2: 0,1
3: 1, 2
4: 2
with this approach we have only 1 read but access 3 different memory banks. Also, the hardware has to do 5 reads in total, because the memory accesses overlap.

now we take a 3 8byte SoA approach
1: 0-7, 2: 8-15, 3: 16-23, 4: 24-31 Buffer1
1: 0-7, 2: 8-15, 3: 16-23, 4: 24-31 Buffer2
1: 0-7, 2: 8-15, 3: 16-23, 4: 24-31 Buffer3

as you can see we need 3 reads but only access 1 memory bank in each read. So we have a total of 3 Memory reads


As you can see a AoS approach needs far more memory interactions, this applies to both reads and writes.

And another thing, AoS uses one big memory access and SoA some smaller ones. This affects the thread scheduling of the GPU, having more smaller task allows the GPU to manage the threads better so when one thread Wrap is blocking because he is waiting for memory the GPU can throw some other thread Wrap in the prepossessing stage. In the case every thread wrap is waiting for memory, because the memory instructions are so big, there is nothing else to do for the GPU then waiting^^

Offline theagentd
« Reply #15 - Posted 2012-01-27 16:57:34 »

Awesome explanation! That completely cleared it up for me, though it doesn't explain the performance I'm actually getting.  Clueless

EDIT: Thanks for all the help guys!!! I just got my OpenCL particle test working, and it very very slightly outperformed the OpenGL version (as in 54 vs 55 FPS at 3 072 000 particles). It works fine, but I was hoping for a bigger speed increase by getting rid of the overhead of textures and everything, but I guess this just proves that OpenGL is insanely good at utilizing the available GPU resources. I believe I've crammed out all the speed I can from my dear (not) little laptop's GPU now, so I'll just get back to actually being productive. Thanks again for all the help with memory handling! We'll see if I get tempted to implement something in OpenCL later. Hmmm, bone interpolation perhaps? =D

Myomyomyo.
Offline gimbal

JGO Knight


Medals: 25



« Reply #16 - Posted 2012-01-30 13:56:23 »

but I guess this just proves that OpenGL is insanely good at utilizing the available GPU resources.

OpenGL is a specification, not something which has performance metrics. The driver you have installed is very good at utilizing GPU resources. Which I don't find too amazing, its one of its primary functions nowadays if you have to believe the changelogs of both nVidia and AMD drivers.
Pages: [1]
  ignore  |  Print  
 
 
You cannot reply to this message, because it is very, very old.

 

Add your game by posting it in the WIP section,
or publish it in Showcase.

The first screenshot will be displayed as a thumbnail.

Riven (12 views)
2014-07-29 18:09:19

Riven (8 views)
2014-07-29 18:08:52

Dwinin (9 views)
2014-07-29 10:59:34

E.R. Fleming (26 views)
2014-07-29 03:07:13

E.R. Fleming (10 views)
2014-07-29 03:06:25

pw (39 views)
2014-07-24 01:59:36

Riven (39 views)
2014-07-23 21:16:32

Riven (27 views)
2014-07-23 21:07:15

Riven (28 views)
2014-07-23 20:56:16

ctomni231 (59 views)
2014-07-18 06:55:21
HotSpot Options
by dleskov
2014-07-08 03:59:08

Java and Game Development Tutorials
by SwordsMiner
2014-06-14 00:58:24

Java and Game Development Tutorials
by SwordsMiner
2014-06-14 00:47:22

How do I start Java Game Development?
by ra4king
2014-05-17 11:13:37

HotSpot Options
by Roquen
2014-05-15 09:59:54

HotSpot Options
by Roquen
2014-05-06 15:03:10

Escape Analysis
by Roquen
2014-04-29 22:16:43

Experimental Toys
by Roquen
2014-04-28 13:24:22
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!