Java-Gaming.org Hi !
Featured games (81)
games approved by the League of Dukes
Games in Showcase (513)
Games in Android Showcase (119)
games submitted by our members
Games in WIP (576)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: [1] 2
  ignore  |  Print  
  z-sorting versus z-buffering  (Read 5955 times)
0 Members and 1 Guest are viewing this topic.
Offline Mads

JGO Ninja


Medals: 26
Projects: 3
Exp: 6 years


One for all!


« Posted 2012-01-24 04:30:11 »

Just saw a tweet by Notch:
"z sorting lots of sprites is slow. z buffers are faster. z sorting when you have space partitioning is even faster as the buckets are tiny."

When needing the z-layer in a landscape like Minicraft or Pokemon, I've always just sorted a list every frame. Maybe this is not what this is stating, but is that not the fastest way? What are z-buffers, and how do we use them?

Does anyone mind elaborating on this, very sparesomely explaining tweet?

Offline theagentd
« Reply #1 - Posted 2012-01-24 05:20:08 »

Z-buffering is a technique used in 3D rendering which allows you to render your stuff in any order you want, and only the stuff closest to the screen will be visible. It works by keeping a z-value for each pixel, and every time a polygon covers a pixel a "depth test" is performed and only if the new sample is closer to the screen than the already stored sample will the pixel be updated.

Myomyomyo.
Offline sproingie

JGO Kernel


Medals: 202



« Reply #2 - Posted 2012-01-24 05:25:43 »

He's talking about using the depth buffer ("z-buffer" is an archaic term for the depth buffer) vs using the painters algorithm and manually painting the furthest objects first, closest last.  For fully opaque geometry, there's simply no question, the depth buffer will run circles around everything else.  

Sprites however, are rarely fully opaque:  The problem comes with blending, which doesn't interact well with depth testing, and for many blend modes will require z-sorting to look right (particle systems are the big culprit here).

There are also advanced techniques (over my head) using the depth buffer to render "soft particles" (read: fire and smoke) but I don't think he was talking about that, and you can't really get away with z-sorting every smoke particle anyway.

Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline Danny02
« Reply #3 - Posted 2012-01-24 10:35:03 »

when u have static spirits you can presort them quite easily.
If you have for example your spirits layed out only in 2D, there are only 8 possible ways of sorting them, you can even half this number cause of symmetry. In 3D you would have 32 => 32/2 = 16 sorting orders.

Also, when you have something where the sorting order changes very slowly. Presorting can be faster then anything else. Remember that you won't have to resort them if the camera changes.
Offline theagentd
« Reply #4 - Posted 2012-01-24 12:46:27 »

Maybe you could just do a single (or a few) iteration of bubble sort each frame and hope that it gets sorted well enough?

Myomyomyo.
Offline Danny02
« Reply #5 - Posted 2012-01-24 13:41:17 »

I'm doing a lot of GPU sorting at the moment.

On GPUs radix sort is the fastes way to sort something, there are plenty of opensource implementations out there for CUDA and OpenCL(i.e. libCL). You can sort around 500mio Elements per second.

Also, what I do like to do with particles. When I have a lot of particle Emitters in my World, normally you get a bunch of particle chunks in your world. Like if you have some fireplaces, you can sort all particles hierarchically.
Each fire emitter has a bounding volume. Then you can sort the volumes first and after that the particles inside of each volume. When some volumes intersect you can merge sort them together.

ps: merge-sort can be done in parallel like bubble-sort when the merging step is done with pivot buckets
Offline theagentd
« Reply #6 - Posted 2012-01-24 14:32:39 »

I'm doing a lot of GPU sorting at the moment.

On GPUs radix sort is the fastes way to sort something, there are plenty of opensource implementations out there for CUDA and OpenCL(i.e. libCL). You can sort around 500mio Elements per second.

Also, what I do like to do with particles. When I have a lot of particle Emitters in my World, normally you get a bunch of particle chunks in your world. Like if you have some fireplaces, you can sort all particles hierarchically.
Each fire emitter has a bounding volume. Then you can sort the volumes first and after that the particles inside of each volume. When some volumes intersect you can merge sort them together.

ps: merge-sort can be done in parallel like bubble-sort when the merging step is done with pivot buckets

Very interesting! I'll definitely try to implement a radix sort in OpenCL some time! 500 million elements per second is probably enough, but that mostly depends on if that's a high-end card or a performance card. I mean, with 2 million particles I'd need to sort 120 million particles per second at 60 FPS, which is about 1/4th as many as the GPU can sort if it works solely on sorting particles (right?). I also need to render them, which obviously needs some CPU time too. Since I can update and render my particles in about 14-15ms I only have 1-2ms over for sorting. If radix sort scales linearly with the number of elements as the Wiki article says then 120 million objects should take around 1/4th as long time as 500 million elements = 1/4th * 16.666 = around 4ms. It seems like getting 2 million sorted particles running at 60 FPS is impossible on my laptop by these numbers, but I'd still like to try. xD

You also mentioned a parallel merge sort. I've written a single-threaded CPU merge sort, but how in the world can it be implemented with threads/OpenCL? What are pivot buckets?

Myomyomyo.
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 816
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #7 - Posted 2012-01-24 14:46:38 »

Merge-sort is trivial to multithread (although I don't know anything about OpenCL)

You just divide your set in N parts, sort them concurrently (using any sorting algorithm) and push the results into a queue.
Another (set of) thread(s) pops off pairs of sets, merge the ordered sets and push the result back into the queue.

In the end, you have 1 ordered set left.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline theagentd
« Reply #8 - Posted 2012-01-24 14:57:34 »

Yes, I see how different I can merge different pairs of sets in parallel, but at the end I just have two huge sets to merge. How do I merge them using multiple threads?

Myomyomyo.
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 816
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #9 - Posted 2012-01-24 15:09:37 »

The last step is so fast that it doesn't really matter that only 1 thread is merging them.

My reasonably informed guess is that memory is the bottleneck anyway. Throwing a bunch of threads at it won't solve that problem.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline theagentd
« Reply #10 - Posted 2012-01-24 15:47:56 »

The last step is so fast that it doesn't really matter that only 1 thread is merging them.

My reasonably informed guess is that memory is the bottleneck anyway. Throwing a bunch of threads at it won't solve that problem.
If I have 2048 stream processors to keep busy, the last step is sure as hell a bottleneck. Memory bottlenecks can be improved by threading too if your CPU has hyperthreading. I've gotten a slightly OVER 2.0x scaling in memory limited programs on a dual-core! =D It should be because even if one thread stalls because of a slow memory fetch from RAM it can work on the other one.

I've got a radix sort running. How do I thread it? =D

Myomyomyo.
Offline Danny02
« Reply #11 - Posted 2012-01-24 15:51:36 »

When you have any kind of merge algorithm to merge two groups into one with one thread.

And now you want to merge two big groups with more then one thread you have to partition your two group in as many sub groups as threads you want to merge with.

And then merge these new pairs of sub-groups with your normal merge algorithm.



Little example, hope I can explain it.

group1:              
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

group2:
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32

in this example i want to merge with 4 threads, so i have to split up each group int 4 sub groups.

I do this like this, first:
from each group i pick 3 values, which partition the groups equally
group1: 7 15 23
group2: 8 16 24

now i sort these 6 values and i get 7 8 15 16 23 24

Now i want to merge the 1st sub-group of group1 with all elements of group2 which are smaller then 7.
I know that those values from group2 will be in the first sub-group of group2, because of the sorting of the partition values. So i get with a binary search the index of the last value which is smaller then 7 from that sub-group. and merge now those two sub-groups and put them at the beginning of the output.

I do this with each sub-group of group1 and copy the remaining elements from group2 to the output.

To get the index of the sub-group of group2 in which to search for the border value from the partitioning of group1. One can just subtract the sorted index of the border value of the sub-group from its index in the group.

i.e. the border 15 from group1 has the index 1 in group1 but in the sorted list the index is 2. So 2-1 = 1 tells us that we will find the biggest value of group2 which is smaller then 15 in the 2nd sub-group of group2.


This should scale very well, because the overhead from sorting the border values and the binary searches are very small, because the number of values to sort/search are very small.

When doing a merge-sort the number of groups half each merge so one would double the number of sub-groups for each merge to keep all threads busy.
Offline theagentd
« Reply #12 - Posted 2012-01-24 16:04:08 »

@Danny02
Awesome! I'll need some time to let this sink in though... I think I understand some of it. Maybe.

Myomyomyo.
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 816
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #13 - Posted 2012-01-24 18:08:53 »

This should scale very well, because the overhead from sorting the border values and the binary searches are very small, because the number of values to sort/search are very small.
The problem is that many advanced sorting algorithms become less effective in smaller sets. If you're really dealing with sets with a size of 6, you probably want bubblesort, eventhough it's O ( n[size=6pt]2[/size][/i] ).

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 816
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #14 - Posted 2012-01-24 18:11:51 »

The last step is so fast that it doesn't really matter that only 1 thread is merging them.

My reasonably informed guess is that memory is the bottleneck anyway. Throwing a bunch of threads at it won't solve that problem.
If I have 2048 stream processors to keep busy, the last step is sure as hell a bottleneck. Memory bottlenecks can be improved by threading too if your CPU has hyperthreading. I've gotten a slightly OVER 2.0x scaling in memory limited programs on a dual-core! =D It should be because even if one thread stalls because of a slow memory fetch from RAM it can work on the other one.

I've got a radix sort running. How do I thread it? =D
Maybe it wasn't clear enough, but I was speaking about a CPU solution. Smiley

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline Danny02
« Reply #15 - Posted 2012-01-24 18:24:42 »

Of course one would tackle such little problem sizes like 6 with something like an insertion or bit-tonic sort or an perfect sorting network.

On the other hand on would use the parallel merge of course only for bigger Problems, because
its only purpose is to let you sort a very large set by effectively splitting the big problem in smaller chunks and keeping everything busy while doing it.

Radix sort also depends on some kind of separating the problem in buckets which makes it also suitable for parallel sorting.

ps: I don't have exact values, but one wouldn't sort sets smaller then a fewer 1000 elements on a GPU or in parallel anyway Smiley
Offline theagentd
« Reply #16 - Posted 2012-01-25 03:48:51 »

This should scale very well, because the overhead from sorting the border values and the binary searches are very small, because the number of values to sort/search are very small.
The problem is that many advanced sorting algorithms become less effective in smaller sets. If you're really dealing with sets with a size of 6, you probably want bubblesort, eventhough it's O ( n[size=6pt]2[/size][/i] ).
Well, this is a lot better than when sorting gets less effective the more elements you have.  Wink

Myomyomyo.
Online princec

JGO Kernel


Medals: 404
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #17 - Posted 2012-01-25 09:34:25 »

If you're sorting something with less than a few hundred elements in it why worry about the performance anyway...

FWIW Radixsort for the win. Don't bother threading it, it's already fast as hell.

Cas Smiley

Offline theagentd
« Reply #18 - Posted 2012-01-25 09:42:02 »

If you're sorting something with less than a few hundred elements in it why worry about the performance anyway...

FWIW Radixsort for the win. Don't bother threading it, it's already fast as hell.

Cas Smiley
Inb4 you realize I want to sort on the GPU... I'd prefer to use all of my 192 stream processors, not just 1. The Radeon HD 7970 even has 2048 stream processors... =3

Myomyomyo.
Online princec

JGO Kernel


Medals: 404
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #19 - Posted 2012-01-25 10:00:00 »

Why not use them for something else?

... and how come you've got so much stuff to sort anyway?

Cas Smiley

Online princec

JGO Kernel


Medals: 404
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #20 - Posted 2012-01-25 10:01:32 »

See this for an MT RadixSort and see how you get on: http://www.codercorner.com/blog/?p=90

Cas Smiley

Offline Roquen
« Reply #21 - Posted 2012-01-25 10:14:25 »

Since particles have a high temporal coherence of sorted order between frames (as long as the camera's not teleporting around) just a single pass of bubbling sort per execution unit seems more than reasonable, where for each odd execution the chucks are offset by 50% for the borders between chunks. 
Online princec

JGO Kernel


Medals: 404
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #22 - Posted 2012-01-25 11:20:01 »

The radix sort implementation I'm using seems pretty good at dealing with temporal coherence but I don't know if it's more or less efficient than a bubble sort. Then again I do more than just particles - the whole sprite engine goes through it and sorts on about 8 different factors. I think that the fact that it sorts about, hm 5000 odd sprites on 8 variables in something like a millisecond means it's fast enough for my purposes though Smiley

Cas Smiley

Offline Riven
« League of Dukes »

JGO Overlord


Medals: 816
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #23 - Posted 2012-01-27 14:24:31 »

In Java7 the dual-pivot QuickSort algorithm has become the default implementation of Arrays.sort(..).


Having said that, I gave CPU sorting a try on this massive dataset:
1  
2  
3  
4  
5  
data = new int[256 * 1024 * 1024]; // 1GB RAM
r.setSeed(1337L);
for (int i = 0; i < data.length; i++) {
   data[i] = r.nextInt();
}


Results:
1  
2  
Arrays.sort(data); // 31.4s
QuadCoreArrays.sort(data); // 12.2sec

The input is split in 4 ranges and sorted concurrently on 4 cores (5220ms).
Two pairs are merged on 2 cores (3599ms).
Final pair is merged on 1 core (3381ms).

It uses an auxilary buffer for the merging, doubling the memory usage. In singlethreaded code it would look like:
1  
2  
3  
4  
int[] aux = new int[data.length];
mergeInAux(data, data.length / 4 * 0, data.length / 4 * 1, data.length / 4, aux);
mergeInAux(data, data.length / 4 * 2, data.length / 4 * 3, data.length / 4, aux);
mergeInAux(aux,  data.length / 2 * 0, data.length / 2 * 1, data.length / 2, data);


31.4s -> 12.2s = 2.57x

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline theagentd
« Reply #24 - Posted 2012-01-27 15:20:51 »

That's pretty awesome! Are you using Arrays.sort(...) to sort the data on each CPU? Multithreading the merging too would probably improve the scaling a bit more though.

Also, minor typo in the merging code you posted. The second argument on the second line should be data.length / 4 * 3, not 2...  Wink

Myomyomyo.
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 816
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #25 - Posted 2012-01-27 15:43:21 »

That's pretty awesome! Are you using Arrays.sort(...) to sort the data on each CPU?
Yes, Arrays.sort(values, fromIndex, toIndex) is pretty much the most convenient thing to use in this case.

Multithreading the merging too would probably improve the scaling a bit more though.
Multithreading merging is actually not that hard, but not trivial either.

You probably want to generalize the situation, ending up with something like:
1  
public static void streamMerge(Queue<int[]> input1, Queue<int[]> input2, Queue<int[]> output, int outputChunkSize)
It would probably cut the final stage in half, shaving off 1.6sec from the total 12.2. It's hardly worth the effort.

Also, minor typo in the merging code you posted. The second argument on the second line should be data.length / 4 * 3, not 2...  Wink
Typo indeed. This was not part of the actual code Smiley

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline theagentd
« Reply #26 - Posted 2012-01-27 15:50:57 »

It would probably cut the final stage in half, shaving off 1.6sec from the total 12.2. It's hardly worth the effort.
What?! That's like a 10-15% speed increase! It's a crime not to implement it!!! xDDD

Myomyomyo.
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 816
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #27 - Posted 2012-01-27 17:03:01 »

It would probably cut the final stage in half, shaving off 1.6sec from the total 12.2. It's hardly worth the effort.
What?! That's like a 10-15% speed increase! It's a crime not to implement it!!! xDDD
I just tried. The overhead of the additional data-copy is so large that it's slightly slower than not multithreading the merge.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 816
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #28 - Posted 2012-01-27 17:08:06 »

For your entertainment: (and maybe to point at the stupidity / performance hog...)
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
47  
48  
49  
50  
51  
52  
53  
54  
55  
56  
57  
58  
59  
60  
   static void streamMerge(SimpleBlockingQueue<int[]> input1, SimpleBlockingQueue<int[]> input2, SimpleBlockingQueue<int[]> output, int maxOutputChunkSize) {
      int[] data1 = input1.take();
      int[] data2 = input2.take();
     
      int[] aux = new int[maxOutputChunkSize];
      int off = 0;
      int read1 = 0;
      int read2 = 0;
      int end1 = data1.length;
      int end2 = data2.length;
     
      while (true) {
         
         while (((maxOutputChunkSize - off - 1) | (end1 - read1 - 1) | (end2 - read2 - 1)) >= 0) {
            int a = data1[read1];
            int b = data2[read2];
            if (a < b) {
               aux[off++] = a;
               read1++;
            } else {
               aux[off++] = b;
               read2++;
            }
         }
         
         if (off > 0) {
            output.put(Arrays.copyOf(aux, off));
            off = 0;
         }
         
         if (read1 == end1) {
            data1 = input1.take();
            if (data1 == null) {
               do {
                  output.put(Arrays.copyOfRange(data2, read2, data2.length));
                  read2 = 0;
               } while ((data2 = input2.take()) != null);
               break;
            }
            read1 = 0;
            end1 = data1.length;
         }
         
         if (read2 == end2) {
            data2 = input2.take();
            if (data2 == null) {
               do {
                  output.put(Arrays.copyOfRange(data1, read1, data1.length));
                  read1 = 0;
               } while ((data1 = input1.take()) != null);
               break;
            }
            read2 = 0;
            end2 = data2.length;
         }
      }
     
      // signal done
      output.put(null);
   }


1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
47  
48  
49  
50  
51  
52  
53  
54  
55  
56  
57  
58  
   private static void sort3(final int[] values) {
     
      final SimpleBlockingQueue<int[]>[] inputs = new SimpleBlockingQueue[4];
      for (int i = 0; i < inputs.length; i++) {
         inputs[i] = new SimpleBlockingQueue<int[]>();
      }
     
      final SimpleBlockingQueue<int[]>[] outputs = new SimpleBlockingQueue[2];
      for (int i = 0; i < outputs.length; i++) {
         outputs[i] = new SimpleBlockingQueue<int[]>();
      }
     
      final SimpleBlockingQueue<int[]> target = new SimpleBlockingQueue<int[]>();
     
      for (int i = 0; i < 4; i++) {
         final int j = i;
         pool.execute(new Runnable() {
            @Override
            public void run() {
               
               int from = values.length / 4 * (j + 0);
               int to = values.length / 4 * (j + 1);
               
               Arrays.sort(values, from, to);
               
               inputs[j].put(Arrays.copyOfRange(values, from, to));
               inputs[j].put(null);
            }
         });
      }
     
      final int chunkSize = 1024 * 1024;
     
      for (int i = 0; i < 2; i++) {
         final int j = i;
         pool.execute(new Runnable() {
            @Override
            public void run() {
               streamMerge(inputs[j * 2 + 0], inputs[j * 2 + 1], outputs[j], chunkSize);
            }
         });
      }
     
      for (int i = 0; i < 1; i++) {
         final int j = i;
         pool.execute(new Runnable() {
            @Override
            public void run() {
               streamMerge(outputs[j + 0], outputs[j + 1], target, chunkSize);
            }
         });
      }
     
      int offset = 0;
      for (int[] data; (data = target.take()) != null; offset += data.length) {
         System.arraycopy(data, 0, values, offset, data.length);
      }
   }

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline theagentd
« Reply #29 - Posted 2012-01-27 17:11:22 »

I am entertained. =D

Myomyomyo.
Pages: [1] 2
  ignore  |  Print  
 
 
You cannot reply to this message, because it is very, very old.

 

Add your game by posting it in the WIP section,
or publish it in Showcase.

The first screenshot will be displayed as a thumbnail.

Longarmx (38 views)
2014-10-17 03:59:02

Norakomi (29 views)
2014-10-16 15:22:06

Norakomi (24 views)
2014-10-16 15:20:20

lcass (28 views)
2014-10-15 16:18:58

TehJavaDev (56 views)
2014-10-14 00:39:48

TehJavaDev (55 views)
2014-10-14 00:35:47

TehJavaDev (45 views)
2014-10-14 00:32:37

BurntPizza (64 views)
2014-10-11 23:24:42

BurntPizza (36 views)
2014-10-11 23:10:45

BurntPizza (78 views)
2014-10-11 22:30:10
Understanding relations between setOrigin, setScale and setPosition in libGdx
by mbabuskov
2014-10-09 22:35:00

Definite guide to supporting multiple device resolutions on Android (2014)
by mbabuskov
2014-10-02 22:36:02

List of Learning Resources
by Longor1996
2014-08-16 10:40:00

List of Learning Resources
by SilverTiger
2014-08-05 19:33:27

Resources for WIP games
by CogWheelz
2014-08-01 16:20:17

Resources for WIP games
by CogWheelz
2014-08-01 16:19:50

List of Learning Resources
by SilverTiger
2014-07-31 16:29:50

List of Learning Resources
by SilverTiger
2014-07-31 16:26:06
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!