Java-Gaming.org Hi !
Featured games (81)
games approved by the League of Dukes
Games in Showcase (513)
Games in Android Showcase (119)
games submitted by our members
Games in WIP (577)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: [1]
  ignore  |  Print  
  Parallel insertion sort for almost sorted data sets  (Read 2978 times)
0 Members and 1 Guest are viewing this topic.
Offline theagentd
« Posted 2014-04-29 16:40:34 »

Hello.

Sorting has several important uses in games, but performance is usually not that important due to the data set being sorted usually only consisting of a small number of elements. However, there are cases where good performance is necessary. One of those cases is the broad phase of collision detection. The idea is simple: Sort your objects along one of the axes (say the X axis) and only process objects that overlap along this axis. In this case, sorting performance is obviously very important since the sorting part can easily get more expensive than the collision detection it's supposed to optimize.

Sorting is difficult to do in parallel on multiple cores though. There are a few algorithms that can be partially threaded on the CPU (merge sort) and even some that can be run on GPUs (radix sort), but they generally do not scale that well with more cores. They are however much faster than single-threaded solution. Sorting for a broad phase has a very important quirk though: We can assume that the order of the objects does not change drastically between each collision detection pass due to the nature of how the objects move around. In the case of a space game with objects moving slowly through space, this is even more true.

Although O(n*log(n)) sorting algorithms like merge sort are fast for large data sets with randomly ordered elements, they're not optimal for almost sorted data sets. A simple insertion sort is actually a very good choice in this case due to the fact that insertion sort runs at O(n) if the list is perfectly sorted already. Moving elements a small distance is also cheap, which fits my problem description perfectly. Insertion sort cannot however be easily parallelized to run on multiple cores due to the potential of a single element being moved from the end of the array to the beginning of the array, and the algorithm itself relies on that all elements before the element being processed are already sorted. However, we can make the assumption that almost all elements will only need to be moved a small distance, and we can optimize for this special case.

The idea is relatively simple. We split up the array into evenly sized chunks and use insertion sort to sort these chunks in parallel. Then we identify "overlaps" between chunks that need to be sorted to fix the ordering, and if necessary we run a second pass of insertion sorts on these overlaps. A simple algorithm is used to identify overlaps which conservatively sorts the array to guarantee correct order. The worst case is if the array is completely random, in which case the algorithm identifies the whole array as a single overlap area and the algorithm breaks down to a slow single-threaded insertion sort.



Here is a visualization of the algorithm.

To generate test data, we start with a sorted array.


We then shuffle around the data a bit to simulate the changing order of elements.


We split it up into chunks...


and sort them concurrently on multiple cores.


We identify the overlaps...


and sort them as well.


We're done!



Performance of this algorithm is excellent. The following are the results of sorting 4 identical arrays using 4 different algorithms. The first value is for an array shuffled in the same manner as the visualization above, while the second one simply processes an already sorted list. The result is verified to be identical for all 4 algorithms (this verification is not included in the timings below). The test was run on an i7-4770K with Hyperthreading enabled. When using a single core, the processor is clocked to 3.9GHz. When using all 4 cores (8 threads for Hyperthreading) the processor drops to 3.7GHz.

Quote
Java sort unsorted: 67.185 ms
Java sort sorted:   11.238 ms
Java parallel sort unsorted: 26.689 ms
Java parallel sort sorted:   15.36 ms
Insertion sort unsorted: 62.769 ms
Insertion sort sorted:   15.984 ms
Parallel insertion sort unsorted: 10.968 ms
Parallel insertion sort sorted:   4.836 ms

Despite the lower clock speed when using all 4 cores, the parallel insertion sort achieves an almost ridiculous 6x performance boost over a single threaded insertion sort.


I also implemented the algorithm in a simple space simulation with 200 000 ships being affected by the gravity of a nearby planet. The ships first had their velocity and position updated and were then sorted by the parallel insertion sort. The below are the timings in milliseconds when using different number of cores. Both the ship updating and ship sorting is parallelized to run on any number of cores. Also note that the single core test used a non-parallel standard insertion sort to avoid the overhead of sorting the overlaps.

Quote
1 core:     46.5ms   1.00x   13% CPU load
2 cores:    23.7ms   1.96x   25% CPU load
4 cores:    13.8ms   3.37x   49-50% CPU load
8 cores:     9.5ms   4.89x   94-98% CPU load

Myomyomyo.
Offline Danny02
« Reply #1 - Posted 2014-04-29 20:27:19 »

where is the code? Wink
Offline theagentd
« Reply #2 - Posted 2014-04-29 21:55:23 »

I'm still working on the overlap detection code. Although it currently always sorts the array correctly, it's not optimal.

Myomyomyo.
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline theagentd
« Reply #3 - Posted 2014-04-29 23:42:53 »

I believe I have managed to perfect the overlap detection and also optimized the overlap sorting slightly by skipping elements that are known to be sorted already.



My implementation of insertion sort: InsertionSort.java
My implementation of a parallel insertion sort: ParallelInsertionSort.java

All thread management is handled by the threading library I've developed (which is a powerful library for multithreading your whole game). The sorting code is well documented and should be easy to understand when combined with the visualization in the first post, so you can probably implement this with your own threading system if you use my code as reference.



If you wish to try out the search algorithm yourself without having to write your own implementation, you need to download my threading library (a massive 70 kilobyte download).

Threading library download link.

Running the sorting algorithm is as simple as this:
1  
2  
3  
4  
5  
ParallelInsertionSort<SortedObjectTypeHere> sorter = new ParallelInsertionSort(numChunks);
MultithreadedExecutor executor = new MultithreadedExecutor(numThreads);

sorter.sort(arrayToSort, startIndex, endIndex, myComparator, executor);
//Of course, both the sorter and the executor can be reused any number of times.




Finally, here is a sorting performance test program I used to compare InsertionSort and ParallelInsertionSort with Java's Arrays.sort() and Arrays.parallelSort(). This requires the insertion sort and parallel insertion sort code above and my threading library on the classpath.

SortTest.java

EDIT: The previous test had some peculiarities due to the fact that it was using Integer wrappers around ints. Here's an improved test program:
SortTest.java

Sample results:

Quote
Java sort unsorted: 416.035 ms
Java sort sorted:   60.669 ms
Java parallel sort unsorted: 148.362 ms
Java parallel sort sorted:   82.6 ms
Insertion sort unsorted: 405.491 ms
Insertion sort sorted:   92.827 ms
Parallel insertion sort unsorted: 75.997 ms
Parallel insertion sort sorted:   28.197 ms
Unsorted: 5.34x faster
Sorted: 3.29x faster

Myomyomyo.
Online Roquen
« Reply #4 - Posted 2014-04-30 05:41:40 »

For almost sorted data you might want to look at some of the bubble sort variants.  I haven't thought at all about the pains it might require to make massively parallel.
Offline Abuse

JGO Knight


Medals: 13


falling into the abyss of reality


« Reply #5 - Posted 2014-04-30 08:09:35 »

Just from a quick skim of your initial diagrams, I'm curious what the algorithm does when an element belongs in a chunk that isn't a neighbouring chunk?
Eg if an element supposed to be in the last chunk,  was placed in the first.

Are the overlaps rechecked until the list is completely sorted?

Make Elite IV:Dangerous happen! Pledge your backing at KICKSTARTER here! https://dl.dropbox.com/u/54785909/EliteIVsmaller.png
Online Roquen
« Reply #6 - Posted 2014-04-30 09:29:15 »

Various flavors of mostly sorted is a pretty common case.

EDIT: Warm the methods.
Offline theagentd
« Reply #7 - Posted 2014-04-30 14:02:27 »

Just from a quick skim of your initial diagrams, I'm curious what the algorithm does when an element belongs in a chunk that isn't a neighbouring chunk?
Eg if an element supposed to be in the last chunk,  was placed in the first.

Are the overlaps rechecked until the list is completely sorted?

My priority here was correctness over performance, so if the algorithm detects that that a single element needs to be moved farther than 1 chunk away it will simply include all affected chunk in a single overlap. If the array is "too random", it will actually end up being a single-threaded insertion sort over almost the whole buffer. Here is an example where I have 8 chunks and 1000 elements being shuffled around a bit too much. Note how the 3rd overlap extends over a total of 3 chunks due to the dip in the third chunk overlapping the top in the first chunk.


Here's a visualization of the overlap sorting optimization I made. It simply skips the first elements that are already sorted in each chunk. The slightly darker red parts of each overlap is simply skipped over. On average this will skip half the elements being sorted (although the ones that were the cheapest to sort) as long as there are no overlaps over more than 2 chunks.


Various flavors of mostly sorted is a pretty common case.

EDIT: Warm the methods.

Yeah, I'm still working on improving the testing. More realistic numbers seem to point to around 4.1x scaling with 8 threads.

Myomyomyo.
Offline theagentd
« Reply #8 - Posted 2014-04-30 14:49:32 »

Also, the algorithm currently uses a very inefficient linear search for finding out the overlap sorting bounds (although this search is threaded per overlap). A better version would use a binary search to locate those bounds since the search is inside a single chunk and all chunks are already sorted.

Myomyomyo.
Online Roquen
« Reply #9 - Posted 2014-04-30 17:18:10 »

I'll toss together some generators for you to try.
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline theagentd
« Reply #10 - Posted 2014-04-30 18:47:46 »

I'll toss together some generators for you to try.

Oh, sweet! That'd be really useful! I'm currently doing something like this to get more consistent scaling measurements:

1  
2  
3  
4  
5  
6  
7  
8  
long normal = 0;
long parallel = 0;

while(true){
    normal += testInsertionSort();
    parallel += testParallelInsertionSort();
    System.out.println((double)normal / parallel + "x scaling");
}

This seems to (and should) approach a stable and reproducible value, around 4.1x (with this particular data set and shuffling algorithm).

Quote
Scaling: 4.088887102760233

I'm gonna do some detail profiling on each part of the algorithm tonight.

EDIT: Left it running over dinner. Stabilized at around 4.26x scaling.

EDIT2: Here are some preliminary profiling results:
Quote
    Sort chunks : 15.802ms
    Update overlaps : 0.001ms
    Sort overlaps : 0.089ms
Note that the only single-threaded part of the algorithm is the "Update overlaps" part, which only depends on the number of chunks, not the number of elements in the array.

Myomyomyo.
Online Roquen
« Reply #11 - Posted 2014-05-01 08:53:32 »

OK.  A first pass is here: https://github.com/roquendm/JGO-Grabbag/blob/master/src/roquen/math/seq/TestIntSequence.java

Some expectations.  The original test bed didn't warm the sorting.  If anyone bothered I'd expect that with and without warming the timings are going to be pretty much the same.  Sorting is going to be dominated by stalls, notable branch-misprediction and memory architecture related (reads: waiting on cache filling and writes: full write buffers).  Hopefully the parallel sorts are working far enough apart to not cause thrashing.

As such these test generators don't copy data but fill or modify in place using a static PRNG.
Offline Abuse

JGO Knight


Medals: 13


falling into the abyss of reality


« Reply #12 - Posted 2014-05-01 09:45:09 »

So teleporting entities from one end of the domain to the other would be a catastrophe for the efficiency of this algorithm.
Cylindrical/toroidal coordinate systems are a big no no then!

Make Elite IV:Dangerous happen! Pledge your backing at KICKSTARTER here! https://dl.dropbox.com/u/54785909/EliteIVsmaller.png
Online Roquen
« Reply #13 - Posted 2014-05-01 09:48:45 »

No problem.  Don't represent the 'universe' in a single data structure.
Offline ags1

JGO Wizard


Medals: 67
Projects: 3
Exp: 5 years


Make code not war!


« Reply #14 - Posted 2014-05-01 10:10:44 »

No problem.  Don't represent the 'universe' in a single data structure.

If the universe is not a single data structure, then I can sort the separate data structures on separate threads, rather than multi-thread sorts on each structure.

Offline theagentd
« Reply #15 - Posted 2014-05-04 19:36:29 »

So teleporting entities from one end of the domain to the other would be a catastrophe for the efficiency of this algorithm.
Cylindrical/toroidal coordinate systems are a big no no then!
It's possible to work around this. You'll always have to add and remove stuff to the list, so this needs to be addressed. Usually, you'd do the sorting after updating each object. So first you can loop through all objects, mark all objects that need to be either removed or teleported, and queue up all teleporting or newly created objects into a queue. You can then combine my parallel sorting algorithm with a double buffered list. Each chunk needs to know how many objects we currently have in it, how many of those we are going to remove and how many new/teleporting objects we are going to insert into it. Then we can simply run a merge between the (sorted) insertion queue and the chunk's old objects simultaneously with the insertion sort.

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
//Before (normal insertion sort)
for each object in array
    insert currentObject into its correct position in array
}

//After
int writeIndex = 0;
for each object in array1 {
    if object is dead or teleported {
        continue;
    }

    while next object in insertion queue should be before the current object{
        write queued object to array2[writeIndex];
        writeIndex++;
    }
   
    insert currentObject into its correct position in array, but write to array2!!!
    writeIndex++;
}

Myomyomyo.
Online Riven
« League of Dukes »

JGO Overlord


Medals: 816
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #16 - Posted 2014-05-05 13:53:35 »

You have to consider that a merge operation is incredibly slow, as you (by definition) trash your cache by walking two input arrays and writing to one output array. A trivial merge is likely to ruin your performance.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline theagentd
« Reply #17 - Posted 2014-05-05 14:14:13 »

Why exactly would a simple merge trash the cache? Shouldn't the data easily fit the cache? Besides, branch prediction should not be a real problem since the number of objects being inserted will most likely be far lower than the number of elements that are already in the chunk, so almost all iterations will evaluate the nested while-loop as false which should be an easy pattern for branch prediction to guess (default to false).

Myomyomyo.
Offline pitbuller
« Reply #18 - Posted 2014-05-05 16:46:58 »

Modern caches work well for multiple input arrays. I just don't see the problem?
Offline theagentd
« Reply #19 - Posted 2014-05-05 18:28:29 »

Regardless of how much slower this is, you need to make a tradeoff between the cost of adding the merge operation and the reduced parallelism that occurs as a result of elements being moved large distances in the array. Depending on how rarely this occurs, it may be faster to simply insert the object at its correct position using a binary search to find the index and then inserting it by shifting all following elements to make room for it.

EDIT: The data caches in my CPU (i7-4770K) are as follows:
    L1: 32 KB, 8-way set associative, 64-byte line size per core
    L2: 256 KB, 8-way set associative, 64-byte line size per core
    L3: 8192 KB, 8-way set associative, 64-byte line size shared

Myomyomyo.
Pages: [1]
  ignore  |  Print  
 
 

 

Add your game by posting it in the WIP section,
or publish it in Showcase.

The first screenshot will be displayed as a thumbnail.

Longarmx (49 views)
2014-10-17 03:59:02

Norakomi (39 views)
2014-10-16 15:22:06

Norakomi (31 views)
2014-10-16 15:20:20

lcass (35 views)
2014-10-15 16:18:58

TehJavaDev (65 views)
2014-10-14 00:39:48

TehJavaDev (65 views)
2014-10-14 00:35:47

TehJavaDev (55 views)
2014-10-14 00:32:37

BurntPizza (72 views)
2014-10-11 23:24:42

BurntPizza (43 views)
2014-10-11 23:10:45

BurntPizza (84 views)
2014-10-11 22:30:10
Understanding relations between setOrigin, setScale and setPosition in libGdx
by mbabuskov
2014-10-09 22:35:00

Definite guide to supporting multiple device resolutions on Android (2014)
by mbabuskov
2014-10-02 22:36:02

List of Learning Resources
by Longor1996
2014-08-16 10:40:00

List of Learning Resources
by SilverTiger
2014-08-05 19:33:27

Resources for WIP games
by CogWheelz
2014-08-01 16:20:17

Resources for WIP games
by CogWheelz
2014-08-01 16:19:50

List of Learning Resources
by SilverTiger
2014-07-31 16:29:50

List of Learning Resources
by SilverTiger
2014-07-31 16:26:06
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!