Some more thoughts:
OpenGL is faster 'because' it is using textures and your OpenCL implementation is slower 'because' of AoS.
Remember that you can read from global memory in a coalescence(C) and in an uncoalescence(UC) way. I try to explain both again a little bit better in just a few sec, but first compare those two with texture fetches.
[FAST] numbers aren't based on facts just assumptions
- coalescence global memory reads
~100%
~80%
- uncoalescence global memory reads
~5%
[/list]
[SLOW]
So why are texture reads faster then UC but slower then C reads. First of all your GPU has an extra cache for texture data, so you can randomly sample your texture with only a very small performance hit. Also in some cases, can the Z memory layout of the texture memory give you some speedups because of relative memory locality.
C reads are the fastest way to read from global memory just because you have zero overhead.
When you take a look at your code and think about it like this, every line gets executed simultaneously. And then take a look at what memory addresses you are accessing. Assuming 24byte AoS
i.e. for only your first 4 threads.
thread 1 is accessing address 0-23
2: 24-47
3: 48 - 71
4: 72 - 95
When you now have hardware which has 32byte memory banks your threads access the following ones:
1: 0
2: 0,1
3: 1, 2
4: 2
with this approach we have only 1 read but access 3 different memory banks. Also, the hardware has to do 5 reads in total, because the memory accesses overlap.
now we take a 3 8byte SoA approach
1: 0-7, 2: 8-15, 3: 16-23, 4: 24-31 Buffer1
1: 0-7, 2: 8-15, 3: 16-23, 4: 24-31 Buffer2
1: 0-7, 2: 8-15, 3: 16-23, 4: 24-31 Buffer3
as you can see we need 3 reads but only access 1 memory bank in each read. So we have a total of 3 Memory reads
As you can see a AoS approach needs far more memory interactions, this applies to both reads and writes.
And another thing, AoS uses one big memory access and SoA some smaller ones. This affects the thread scheduling of the GPU, having more smaller task allows the GPU to manage the threads better so when one thread Wrap is blocking because he is waiting for memory the GPU can throw some other thread Wrap in the prepossessing stage. In the case every thread wrap is waiting for memory, because the memory instructions are so big, there is nothing else to do for the GPU then waiting^^