Java-Gaming.org Hi !
Featured games (83)
games approved by the League of Dukes
Games in Showcase (539)
Games in Android Showcase (132)
games submitted by our members
Games in WIP (603)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: [1]
  ignore  |  Print  
  Code optimization article  (Read 2899 times)
0 Members and 1 Guest are viewing this topic.
Offline erikd

JGO Ninja


Medals: 16
Projects: 4
Exp: 14 years


Maximumisness


« Posted 2007-05-15 18:51:57 »

I recently came across this article:
http://www.azillionmonkeys.com/qed/optimize.html

The article (although a bit old) was written with games in mind, is geared towards C and ASM, and also deals with branch predictability and such.

I've done a small test in an inner loop in a real-life application with this 'else' removal technique explained in the article, but with no success.
I'm wondering to what degree some of these (micro-)optimization techniques (such as 'else' removal) are applicable to java programs (or in general on today's CPUs)?

The micro-optimization techniques aside, it's an interesting read.

Offline elias

Senior Devvie





« Reply #1 - Posted 2007-05-15 19:36:46 »

A guess would be that while C compilers rely on a fixed rule about branches, "else branch is rarely taken", hotspot is more than able to insert profiling code and simply measure which branch to optimize for.

 - elias

Offline rreyelts

Junior Devvie




There is nothing Nu under the sun


« Reply #2 - Posted 2007-05-15 20:12:35 »

The 'else' removal technique really depends upon the code in the if and else blocks being very small and very easy to undo. It also depends upon a branch prediction miss being likely and very expensive. You're best worrying about branch prediction within tight for loops. Even in that case, the processor's probably only going to mis-predict a few times out of all of the loop iterations. Also, you're increasing your code size, working against the instruction cache.

In general, I think we're moving away from the trend of branch prediction getting costlier, because we're moving more towards parallel programming - larger numbers of simpler cores combined with effective use of SIMD instructions. Cas made a statement a couple of years back about multi-threaded pipelines being pointless, but the reality is that anyone buying a new machine now is guaranteed to be getting at least a dual core setup, and in less than a year it's going to be quad core.

About me: http://jroller.com/page/rreyelts
Jace - Easier JNI: http://jace.reyelts.com/jace
Retroweaver - Compile on JDK1.5, and deploy on 1.4: http://retroweaver.sf.net.
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Online princec

« JGO Spiffy Duke »


Medals: 434
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #3 - Posted 2007-05-21 15:25:30 »

Yep, no longer entirely pointless to think about multicore designs, though the big majority of users out there are going to have single core for the next 4-5 years. In the meantime though it's nice to know that the garbage collector  and hotspot server compiler can run quite effectively in their own threads now.

Cas Smiley

Offline Riven
« League of Dukes »

« JGO Overlord »


Medals: 840
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #4 - Posted 2007-05-21 19:45:56 »

Nice find!

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
   private static float simpleSum(float[] arr)
   {
      float sum = 0.0f;
      for (int i = 0; i < arr.length; i++)
         sum += arr[i];
      return sum;
   }

   private static float advancedSum(float[] arr)
   {
      int i = 0;

      float sumA = 0.0f;
      float sumB = 0.0f;
      float sumC = 0.0f;
      float sumD = 0.0f;

      int end = arr.length;

      end -= 3;

      while (i < end)
      {
         sumA += arr[i + 0];
         sumB += arr[i + 1];
         sumC += arr[i + 2];
         sumD += arr[i + 3];

         i += 4;
      }

      end += 3;

      float sum = (sumA + sumB) + (sumC + sumD);
      while (i < end)
         sum += arr[i++];

      return sum;
   }


simpleSum 6705ns (149K sums/s)
advancedSum 3911ns (255K sums/s)

71% faster!

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social
Offline CaptainJester

JGO Knight


Medals: 12
Projects: 2
Exp: 14 years


Make it work; make it better.


« Reply #5 - Posted 2007-05-22 03:52:05 »

My tests show this one is faster by 25% over the advancedSum:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
    private static float fastAdvancedSum(float[] arr) {
        int i = 0;
       
        float sum = 0.0f;
       
        int end = arr.length;
       
        end -= 3;
       
        while (i < end) {
            sum += arr[i++];
            sum += arr[i++];
            sum += arr[i++];
            sum += arr[i++];
        }
       
        end += 3;
       
        while (i < end) {
            sum += arr[i++];
        }

        return sum;
    }


and 35% if you do it this way:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
    private static float advancedSum(float[] arr) {
        int i = 0;
       
        float sum = 0.0f;
       
        int end = arr.length;
       
        end -= 3;
       
        while (i < end) {
            sum += arr[i++];
            sum += arr[i++];
            sum += arr[i++];
            sum += arr[i++];
            sum += arr[i++];
            sum += arr[i++];
            sum += arr[i++];
            sum += arr[i++];
        }
       
        end += 3;
       
        while (i < end) {
            sum += arr[i++];
        }

        return sum;
    }

Online princec

« JGO Spiffy Duke »


Medals: 434
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #6 - Posted 2007-05-22 09:09:23 »

That does rather strike me as being silly Smiley The hotspot server compiler should probably be thinking about those kind of optimisations as it doesn't half make the actual source code much more difficult to understand!

Cas Smiley

Offline Riven
« League of Dukes »

« JGO Overlord »


Medals: 840
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #7 - Posted 2007-05-22 09:43:35 »

The Server VM (and Client VM sometimes) is driving me nuts...

I'm putting the contents of a float[] into a strided datastructure with a certain offset and width (like VBO interleaved stuff)

One element can look like: [x,x,X,X,X,X,x,x,x] (repeats iself)

The code for this is:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
   public final void putElements(int elementOffset, int elementLength, float[] array, int arrayOffset)
   {
      int dataOffset = this.dataOffsetOfElement(elementOffset);
      int lastDataOffset = dataOffset + elementLength * stride;

            while (dataOffset < lastDataOffset)
            {
               data[dataOffset + 0] = array[arrayOffset + 0];
               data[dataOffset + 1] = array[arrayOffset + 1];
               data[dataOffset + 2] = array[arrayOffset + 2];
               data[dataOffset + 3] = array[arrayOffset + 3];
               arrayOffset += width; // if I replace 'width' (which is 4) with '4', performance gets cut in half
               dataOffset += stride;
            }
   }


Now the silly part is that it's gets executed much faster with a switch-statement around it:

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
      switch (width)
      {
         case 1:
            // this is the madness
            dataOffset += stride * elementLength;
            for (int e = elementLength - 1; e >= 0; e--)
               data[dataOffset -= stride] = array[arrayOffset + e];
            break;

         case 2:
            // code optimized for width of 2
            break;

         case 3:
            // code optimized for width of 3
            break;

         case 4: // running the code for width '4' inside the switch-statement is 21% faster than specialized method for width 4!! arg!
            while (dataOffset < lastDataOffset)
            {
               data[dataOffset + 0] = array[arrayOffset + 0];
               data[dataOffset + 1] = array[arrayOffset + 1];
               data[dataOffset + 2] = array[arrayOffset + 2];
               data[dataOffset + 3] = array[arrayOffset + 3];
               arrayOffset += width;
               dataOffset += stride;
            }
            break;

         case 5:
            // code optimized for width of 5
            break;

         default:
            // very very slow generic case
            for (int e = elementLength - 1; e >= 0; e--)
            {
               for (int i = width - 1; i >= 0; i--)
                  data[dataOffset + i] = array[arrayOffset + i];
               arrayOffset += width;
               dataOffset += stride;
            }
            break;
      }
   }


Making tiny changes, like replacing 'width' with '4' makes the HotSpot VM drop everything and create some slow execution-path (maybe it requires an additional register, and it just ran out? Should that cause a performanceloss of 50%?)

When I code for performance, I have to find the fastest way with trial and error, the difference can be up to factor 3, when applying lots of seamingly irrelevant changes, or putting a lot of never-to-be-executed code around it, to make it 21% faster. And then that will only be the fastest loop on that version of that VM vendor.

I'm going to implement this tiny loop in C and make a DLL, that way I'm certain the code is natively compiled properly on every VM. it's such a shame HotSpot isn't predicatable.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social
Offline CaptainJester

JGO Knight


Medals: 12
Projects: 2
Exp: 14 years


Make it work; make it better.


« Reply #8 - Posted 2007-05-22 11:12:21 »

That does rather strike me as being silly Smiley The hotspot server compiler should probably be thinking about those kind of optimisations as it doesn't half make the actual source code much more difficult to understand!

Cas Smiley

Could be the warm up thing.  I did't allow for the VM to run for a long time through each of the examples to give it time to optimize.

Offline CaptainJester

JGO Knight


Medals: 12
Projects: 2
Exp: 14 years


Make it work; make it better.


« Reply #9 - Posted 2007-05-22 13:37:08 »

Using this code:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
47  
48  
49  
50  
51  
52  
53  
54  
55  
56  
57  
58  
59  
60  
61  
62  
63  
64  
65  
66  
67  
68  
69  
70  
71  
72  
73  
74  
75  
76  
77  
78  
79  
80  
81  
82  
83  
84  
85  
86  
87  
88  
89  
90  
91  
92  
93  
94  
95  
96  
97  
98  
99  
100  
101  
102  
103  
104  
105  
106  
107  
108  
109  
110  
111  
112  
113  
114  
115  
116  
117  
118  
119  
120  
121  
122  
123  
124  
125  
126  
127  
128  
129  
130  
131  
132  
133  
134  
135  
136  
137  
import java.util.Random;

public class Test {
    public static void main(String[] args) {
        float nums[] = new float[2000];
        long time1, time2, time3, time4;
        float sum1, sum2, sum3, sum4;
       
        Random random = new Random(1);
        for(int i=0;i<2000;i++) {
            nums[i] = random.nextFloat() * 100.0f;
        }
       
        long time = System.nanoTime();
        long tick = time;
        int seconds = 0;
        while(System.nanoTime() - time < 300000000000L) {
            simpleSum(nums);
            advancedSum(nums);
            fastAdvancedSum(nums);
            if(System.nanoTime() - tick > 10000000000L) {
                tick = System.nanoTime();
                seconds += 10;
                System.out.println(seconds);
            }
        }
       
        time1 = System.nanoTime();
        sum1 = simpleSum(nums);
        time1 = System.nanoTime() - time1;
        time2 = System.nanoTime();
        sum2 = advancedSum(nums);
        time2 = System.nanoTime() - time2;
        time3 = System.nanoTime();
        sum3 = fastAdvancedSum(nums);
        time3 = System.nanoTime() - time3;
        time4 = System.nanoTime();
        sum4 = fastAdvancedSum(nums);
        time4 = System.nanoTime() - time4;
       
        System.out.println("After a 5 minute warm up the execution times are\r\n");
        System.out.println("      simpleSum " + sum1 + " took " + time1 + " nanoseconds");
        System.out.println("    advancedSum " + sum2 + " took " + time2 + " nanoseconds " + ((1.0 - ((double)time2 / (double)time1)) * 100.0) + "% faster");
        System.out.println(" fixAdvancedSum " + sum3 + " took " + time3 + " nanoseconds " + ((1.0 - ((double)time3 / (double)time1)) * 100.0) + "% faster");
        System.out.println("fastAdvancedSum " + sum4 + " took " + time4 + " nanoseconds " + ((1.0 - ((double)time4 / (double)time1)) * 100.0) + "% faster");
    }

    private static float simpleSum(float[] arr) {
        float sum = 0.0f;
        for (int i = 0; i < arr.length; i++) {
            sum += arr[i];
        }

        return sum;
    }

    private static float advancedSum(float[] arr) {
        int i = 0;

        float sumA = 0.0f;
        float sumB = 0.0f;
        float sumC = 0.0f;
        float sumD = 0.0f;

        int end = arr.length;

        end -= 3;

        while (i < end) {
            sumA += arr[i + 0];
            sumB += arr[i + 1];
            sumC += arr[i + 2];
            sumD += arr[i + 3];

            i += 4;
        }

        end += 3;

        float sum = (sumA + sumB) + (sumC + sumD);
        while (i < end) {
            sum += arr[i++];
        }

        return sum;
    }
    private static float fixAdvancedSum(float[] arr) {
        int i = 0;

        float sum = 0.0f;

        int end = arr.length;

        end -= 3;

        while (i < end) {
            sum += arr[i + 0];
            sum += arr[i + 1];
            sum += arr[i + 2];
            sum += arr[i + 3];

            i += 4;
        }

        end += 3;

        while (i < end) {
            sum += arr[i++];
        }

        return sum;
    }
    private static float fastAdvancedSum(float[] arr) {
        int i = 0;
       
        float sum = 0.0f;
       
        int end = arr.length;
       
        end -= 3;
       
        while (i < end) {
            sum += arr[i++];
            sum += arr[i++];
            sum += arr[i++];
            sum += arr[i++];
        }
       
        end += 3;
       
        while (i < end) {
            sum += arr[i++];
        }

        return sum;
    }
}


I get the following results:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
Client VM
      simpleSum 99848.016 took 27099 nanoseconds
    advancedSum 99848.05 took 13968 nanoseconds 48.45566257057457% faster
 fixAdvancedSum 99848.016 took 30172 nanoseconds -11.339901841396372% faster
fastAdvancedSum 99848.016 took 29613 nanoseconds -0.0927709509576% faster

Server VM
      simpleSum 99848.016 took 69003 nanoseconds
    advancedSum 99848.05 took 7543 nanoseconds 89.06859122067156% faster
 fixAdvancedSum 99848.016 took 8660 nanoseconds 87.44982102227439% faster
fastAdvancedSum 99848.016 took 7543 nanoseconds 89.06859122067156% faster


After a 5 minute warm up, the server VM really does its job.  advancedSum and fastAdvancedSum run the same spped.  However, advancedSum suffers from floating point error and never returns the correct result.  fixAdvancedSum fixxes this problem, but then runs a little slower.

Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline erikd

JGO Ninja


Medals: 16
Projects: 4
Exp: 14 years


Maximumisness


« Reply #10 - Posted 2007-05-22 14:50:37 »

That does rather strike me as being silly Smiley The hotspot server compiler should probably be thinking about those kind of optimisations as it doesn't half make the actual source code much more difficult to understand!

Cas Smiley

Well, this particular performance optimization trick in 'advanceSum' is known to degrade floating point accuracy, so HotSpot should never do this.
And besides, I can't remember having to write a for loop which sums up the contents of an entire array the last few years *and* which was also performance bottleneck  Smiley

But sure, I suppose there's still room in HotSpot for improvement.

Pages: [1]
  ignore  |  Print  
 
 
You cannot reply to this message, because it is very, very old.

 

Add your game by posting it in the WIP section,
or publish it in Showcase.

The first screenshot will be displayed as a thumbnail.

rwatson462 (30 views)
2014-12-15 09:26:44

Mr.CodeIt (23 views)
2014-12-14 19:50:38

BurntPizza (50 views)
2014-12-09 22:41:13

BurntPizza (84 views)
2014-12-08 04:46:31

JscottyBieshaar (45 views)
2014-12-05 12:39:02

SHC (59 views)
2014-12-03 16:27:13

CopyableCougar4 (57 views)
2014-11-29 21:32:03

toopeicgaming1999 (123 views)
2014-11-26 15:22:04

toopeicgaming1999 (114 views)
2014-11-26 15:20:36

toopeicgaming1999 (32 views)
2014-11-26 15:20:08
Resources for WIP games
by kpars
2014-12-18 10:26:14

Understanding relations between setOrigin, setScale and setPosition in libGdx
by mbabuskov
2014-10-09 22:35:00

Definite guide to supporting multiple device resolutions on Android (2014)
by mbabuskov
2014-10-02 22:36:02

List of Learning Resources
by Longor1996
2014-08-16 10:40:00

List of Learning Resources
by SilverTiger
2014-08-05 19:33:27

Resources for WIP games
by CogWheelz
2014-08-01 16:20:17

Resources for WIP games
by CogWheelz
2014-08-01 16:19:50

List of Learning Resources
by SilverTiger
2014-07-31 16:29:50
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!