Java-Gaming.org Hi !
Featured games (83)
games approved by the League of Dukes
Games in Showcase (513)
Games in Android Showcase (120)
games submitted by our members
Games in WIP (577)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: [1]
  ignore  |  Print  
  The JIT learns by example, better teach it right!  (Read 2669 times)
0 Members and 1 Guest are viewing this topic.
Online Riven
« League of Dukes »

JGO Overlord


Medals: 817
Projects: 4
Exp: 16 years


Hand over your head.


« Posted 2007-01-09 01:40:50 »

I fed (float) data with various sizes into an algorithm, and measured the performance.

Case 1:
1  
2  
3  
4  
5  
6  
7  
8  
for(int i=0 i<256; i++)
{
   process(new float[64]);
   process(new float[256]);
   process(new float[1024]);
   process(new float[4096]);
   process(new float[16384]);
}


Case 2:
1  
2  
3  
4  
5  
6  
7  
8  
for(int i=0 i<256; i++)
{
   process(new float[16384]);
   process(new float[4096]);
   process(new float[1024]);
   process(new float[256]);
   process(new float[64]);
}


How fast will these perform?

Case 1: all variations run at 2350 elements/sec
Case 2: all variations run at  850 elements/sec


It is significant!
I once had a framedrop of 50% in my (then geometry-transform bottlenecked) game after I changed something trivial that seemed to have nothing to do with it. So it's not so much about feeding the JIT realworld examples, it's trial and error, then feed what the JIT likes best.... and then you only know how to threat that very JIT.


To get the best performance with any JIT, you might launch a few sub-processes, see which version the JIT likes, then use that as a warmup for the JIT you're dealing with in your main-process. This quickly spirals out of control, when you're dealing with more than 3 performance-critical sections in an application.

Or get rid of this JIT magic and go native, as that always performs best!

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline swpalmer

JGO Coder


Exp: 12 years


Where's the Kaboom?


« Reply #1 - Posted 2007-01-09 02:19:41 »

Or get rid of this JIT magic and go native, as that always performs best!

I hate to nit-pick but the above is of course not true as it ignores the advantage of a JIT compiler's ability to adapt to the current runtime conditions.

Anyway, I'm curious about your experiment.  Can you provide more info on the "process" method?   I wonder for example if the adaptive nature of the JIT backfired by compiling the code in a manner that was more appropriate for smaller data sets before it got to the larger ones.

I really want to get a copy of a JVM that supports a fancy flag that spits out the machine code it generates.

Online Riven
« League of Dukes »

JGO Overlord


Medals: 817
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #2 - Posted 2007-01-09 02:24:14 »

I hate to nit-pick but the above is of course not true as it ignores the advantage of a JIT compiler's ability to adapt to the current runtime conditions.
True, but C/C++ always does vector code (pointers + SIMD) roughly factor 2.4 faster (on my machine)

Anyway, I'm curious about your experiment.  Can you provide more info on the "process" method?   I wonder for example if the adaptive nature of the JIT backfired by compiling the code in a manner that was more appropriate for smaller data sets before it got to the larger ones.
The JIT didn't backfire, if the first pass(es) are small arrays, it kept performing better on larger and larger arrays/buffers.

I really want to get a copy of a JVM that supports a fancy flag that spits out the machine code it generates.
I thought there IS a flag like that in Java 6.0, was searching for it... some folks talked about it on this very forum! (don't you just love this community)

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Online Riven
« League of Dukes »

JGO Overlord


Medals: 817
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #3 - Posted 2007-01-09 02:28:00 »

Aaah... found it:
-XX:+PrintOptoAssembly

Doesn't work.. Roll Eyes

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Online Riven
« League of Dukes »

JGO Overlord


Medals: 817
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #4 - Posted 2007-01-09 03:36:01 »

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
   public static final void main(String[] args)
   {
      int max = 9;

      int start, stop, step;

      // this controls the order
      boolean smallToLarge = true;

      if (smallToLarge)
      {
         start = 0;
         stop = max;
         step = 1;
         System.out.println("ORDER: SMALL[64] -> LARGE[16k]");
      }
      else
      {
         start = max - 1;
         stop = 0;
         step = -1;
         System.out.println("ORDER: LARGE[16k] -> SMALL[64]");
      }

      for (int i = start; (smallToLarge ? i < stop : i > stop); i += step)
      {
         int size = 1 << (i + 6);

         long elapsed = benchmark(size);

         System.out.println("size=" + size + ", \ttook: " + (elapsed/1000L) + "us \t(" + (elapsed / size) + " ns/element -> "+(size * 1000000000L / elapsed)+" elements/s)");

         for (int k = 0; k < 8; k++)
            System.gc();

         try
         {
            Thread.sleep(2 * 1000);
         }
         catch (Exception exc)
         {
            //
         }
      }
   }


1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
   public static final long benchmark(int size)
   {
      int byteSize = size * 4;

      FloatBuffer bufferA = ByteBuffer.allocateDirect(byteSize).order(ByteOrder.nativeOrder()).asFloatBuffer();
      FloatBuffer bufferB = ByteBuffer.allocateDirect(byteSize).order(ByteOrder.nativeOrder()).asFloatBuffer();
      FloatBuffer bufferC = ByteBuffer.allocateDirect(byteSize).order(ByteOrder.nativeOrder()).asFloatBuffer();

      int runs = 32;
      int loops = 1024;

      int recordedRuns = 0;

      long recordedElapsed = 0L;

      for (int i = 0; i < runs; i++)
      {
         if (i == runs / 2 - 1)
         {
            // reset halfway
            recordedElapsed = 0L;
            recordedRuns = 0;
         }

         long start = System.nanoTime();
         for (int k = 0; k < loops; k++)
         {
            faddN_buffer(bufferA, bufferB, bufferC, size);
            faddN_buffer(bufferB, bufferC, bufferA, size);
            faddN_buffer(bufferC, bufferA, bufferB, size);
         }
         long elapsed = System.nanoTime() - start;
         recordedElapsed += elapsed;

         recordedRuns++;
      }

      return recordedElapsed / recordedRuns;
   }


1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
47  
48  
49  
50  
51  
52  
53  
54  
55  
56  
57  
58  
59  
60  
61  
62  
63  
64  
65  
66  
67  
68  
69  
70  
71  
   /**
    * method that controls the unrollment of the algorithm
    */


   private static final void faddN_buffer(FloatBuffer a, FloatBuffer b, FloatBuffer c, int n)
   {
      int n8 = n >> 3;
      int n4 = (n & 7) >> 2;
      int n1 = n & 3;

      int pA = 0;
      int pB = 0;
      int pC = 0;

      if (n8 != 0)
      {
         pA -= 8;
         pB -= 8;
         pC -= 8;
         for (int i = n8 - 1; i >= 0; i--)
            fadd8_buffer(a, b, c, pA += 8, pB += 8, pC += 8);
      }

      if (n4 != 0)
      {
         pA -= 4;
         pB -= 4;
         pC -= 4;
         for (int i = n4 - 1; i >= 0; i--)
            fadd4_buffer(a, b, c, pA += 4, pB += 4, pC += 4);
      }

      if (n1 != 0)
      {
         pA -= 1;
         pB -= 1;
         pC -= 1;
         for (int i = n1 - 1; i >= 0; i--)
            fadd1_buffer(a, b, c, pA += 1, pB += 1, pC += 1);
      }
   }

   /**
    * methods that manually unroll loops (much faster)
    */


   private static final void fadd8_buffer(FloatBuffer a, FloatBuffer b, FloatBuffer c, int pa, int pb, int pc)
   {
      c.put(pc + 0, a.get(pa + 0) + b.get(pb + 0));
      c.put(pc + 1, a.get(pa + 1) + b.get(pb + 1));
      c.put(pc + 2, a.get(pa + 2) + b.get(pb + 2));
      c.put(pc + 3, a.get(pa + 3) + b.get(pb + 3));

      c.put(pc + 4, a.get(pa + 4) + b.get(pb + 4));
      c.put(pc + 5, a.get(pa + 5) + b.get(pb + 5));
      c.put(pc + 6, a.get(pa + 6) + b.get(pb + 6));
      c.put(pc + 7, a.get(pa + 7) + b.get(pb + 7));
   }

   private static final void fadd4_buffer(FloatBuffer a, FloatBuffer b, FloatBuffer c, int pa, int pb, int pc)
   {
      c.put(pc + 0, a.get(pa + 0) + b.get(pb + 0));
      c.put(pc + 1, a.get(pa + 1) + b.get(pb + 1));
      c.put(pc + 2, a.get(pa + 2) + b.get(pb + 2));
      c.put(pc + 3, a.get(pa + 3) + b.get(pb + 3));
   }

   private static final void fadd1_buffer(FloatBuffer a, FloatBuffer b, FloatBuffer c, int pa, int pb, int pc)
   {
      c.put(pc, a.get(pa) + b.get(pb));
   }





Java 6.0 Server VM
ORDER: SMALL[64] -> LARGE[16k]
size=64,     took: 3850us      (60157 ns/element -> 16622 elements/s)
size=128,    took: 1515us      (11839 ns/element -> 84461 elements/s)
size=256,    took: 2638us      (10305 ns/element -> 97038 elements/s)
size=512,    took: 5657us      (11050 ns/element -> 90494 elements/s)
size=1024,   took: 12177us     (11892 ns/element -> 84087 elements/s)
size=2048,   took: 24268us     (11849 ns/element -> 84390 elements/s)
size=4096,   took: 48544us     (11851 ns/element -> 84376 elements/s)
size=8192,   took: 97251us     (11871 ns/element -> 84234 elements/s)
size=16384,  took: 200368us    (12229 ns/element -> 81769 elements/s)

ORDER: LARGE[16k] -> SMALL[64]
size=16384,    took: 678194us    (41393 ns/element -> 24158 elements/s)
size=8192,     took: 310363us    (37886 ns/element -> 26394 elements/s)
size=4096,     took: 149891us    (36594 ns/element -> 27326 elements/s)
size=2048,     took: 74897us     (36570 ns/element -> 27344 elements/s)
size=1024,     took: 37610us     (36728 ns/element -> 27226 elements/s)
size=512,      took: 18856us     (36829 ns/element -> 27152 elements/s)
size=256,      took: 9516us      (37174 ns/element -> 26900 elements/s)
size=128,      took: 4693us      (36668 ns/element -> 27271 elements/s)
size=64,       took: 2405us      (37581 ns/element -> 26608 elements/s)

Note: performance difference: factor 3+



Java 6.0 Client VM
ORDER: SMALL[64] -> LARGE[16k] (all 4.8k to 5.2k elements/s)
ORDER: LARGE[16k] -> SMALL[64] (all 5.3k to 5.4k elements/s)

Note: slower than Client 5.0 (400%) and Server 5.0 (700%) !! Shocked (see below)




Java 5.0 Server VM
ORDER: SMALL[64] -> LARGE[16k] (all 33k to 37k elements/s)
ORDER: LARGE[16k] -> SMALL[64] (all 33k to 37k elements/s)

Java 5.0 Client VM
ORDER: SMALL[64] -> LARGE[16k] (all 19k elements/s)
ORDER: LARGE[16k] -> SMALL[64] (all 19k elements/s)

Note: server ~80% faster than client

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline erikd

JGO Ninja


Medals: 16
Projects: 4
Exp: 14 years


Maximumisness


« Reply #5 - Posted 2007-01-09 09:43:33 »

So if I understand correctly you've hit a Java6 VM bug, and this is not an example of a general flaw of JIT technology.
Looks like a good test case for your bug report   Smiley

Online Riven
« League of Dukes »

JGO Overlord


Medals: 817
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #6 - Posted 2007-01-09 13:34:27 »

Well, I've had similair problems in other JITs, where the initial *order of algorithm* made a significant difference in performance.

And ofcourse all the code that has exactly the same functionality, with marginally reordered bytecode, that could take or gain you 50% performance.


When I need fast code these days, I write a number of permutations of the algorithm (tiny changes) and pick the fastest.


For example, there is a tiny bug in the benchmark code.

1  
2  
3  
4  
5  
         pA -= 8;
         pB -= 8;
         pC -= 8;
         for (int i = n8 - 1; i >= 0; i--)
            fadd8_buffer(a, b, c, pA += 8, pB += 8, pC += 8);


Should have been:

1  
2  
3  
4  
5  
6  
7  
8  
         pA -= 8;
         pB -= 8;
         pC -= 8;
         for (int i = n8 - 1; i >= 0; i--)
            fadd8_buffer(a, b, c, pA += 8, pB += 8, pC += 8);
         pA += 8;
         pB += 8;
         pC += 8;



While this reduces the amount of operations with 8, performance was 95k/s, and after the fix it's 50k/s.

This is exactly what I mean with tiny changes in bytecode have major effect Undecided

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Online Riven
« League of Dukes »

JGO Overlord


Medals: 817
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #7 - Posted 2007-01-09 15:23:27 »

Implementing the same algorithm with different data-structures, also can impact performance pretty seriously

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
      // tight loop
      for (int k = 0; k < n; k++)
         c.put(k, a.get(k) + b.get(k));

      // unrolled(4)
      for (int k = 0; k < n;)
      {
         c.put(k, a.get(k) + b.get(k));         k++;
         c.put(k, a.get(k) + b.get(k));         k++;
         c.put(k, a.get(k) + b.get(k));         k++;
         c.put(k, a.get(k) + b.get(k));         k++;
      }

      // unrolled(8)
      for (int k = 0; k < n;)
      {
         c.put(k, a.get(k) + b.get(k));         k++;
         c.put(k, a.get(k) + b.get(k));         k++;
         c.put(k, a.get(k) + b.get(k));         k++;
         c.put(k, a.get(k) + b.get(k));         k++;
         c.put(k, a.get(k) + b.get(k));         k++;
         c.put(k, a.get(k) + b.get(k));         k++;
         c.put(k, a.get(k) + b.get(k));         k++;
         c.put(k, a.get(k) + b.get(k));         k++;
      }



C [ i ] =  A [ i ] + B [ i ]
buffer:    tight loop 118k/s      unrolled(4x) 155k/s ( +31%)     unrolled(8x) 165k/s ( +40%)
array:     tight loop  89k/s      unrolled(4x) 184k/s (+106%)     unrolled(8x) 229k/s (+157%)


See the extreme difference in performance between buffer/array! (ignoring pointer for the moment)
For other algorithms (more complex - lots of random access R/W), pointer-structures almost always win.


Let's change the algorithm to an exponential one:

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
47  
48  
49  
50  
      // Case 1
      for (int i = 0; i < n; i++)
         for (int j = 0; j < n; j++)
            for (int k = 0; k < n; k++)
               c.put(k, a.get(i) + b.get(j));

      // Case 2a
      for (int i = 0; i < n; i++)
         for (int j = 0; j < n; j++)
            for (int k = 0; k < n;)
            {
               c.put(k, a.get(i) + b.get(j)); k++;
               c.put(k, a.get(i) + b.get(j)); k++;
               c.put(k, a.get(i) + b.get(j)); k++;
               c.put(k, a.get(i) + b.get(j)); k++;
            }


      // Case 2b
      for (int i = 0; i < n; i++)
         for (int j = 0; j < n; j++)
            for (int k = 0; k < n;)
            {
               c.put(k++, a.get(i) + b.get(j));
               c.put(k++, a.get(i) + b.get(j));
               c.put(k++, a.get(i) + b.get(j));
               c.put(k++, a.get(i) + b.get(j));
            }

      // Case 2c
      for (int i = 0; i < n; i++)
         for (int j = 0; j < n; j++)
            for (int k = 0; k < n; k += 4)
            {
               c.put(k + 0, a.get(i) + b.get(j));
               c.put(k + 1, a.get(i) + b.get(j));
               c.put(k + 2, a.get(i) + b.get(j));
               c.put(k + 3, a.get(i) + b.get(j));
            }

      // Case 2d
      for (int i = 0; i < n; i++)
         for (int j = 0; j < n; j++)
            for (int k = 0; k < n; k += 4)
            {
               c.put(k, a.get(i) + b.get(j));
               c.put(k + 1, a.get(i) + b.get(j));
               c.put(k + 2, a.get(i) + b.get(j));
               c.put(k + 3, a.get(i) + b.get(j));
            }


1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
      // Case 2a
      for (int i = 0; i < n; i++)
         for (int j = 0; j < n; j++)
            for (int k = 0; k < n;)
            {
               c[k] = a[i] + b[j]; k++;
               c[k] = a[i] + b[j]; k++;
               c[k] = a[i] + b[j]; k++;
               c[k] = a[i] + b[j]; k++;
            }

      // Case 2b
      for (int i = 0; i < n; i++)
         for (int j = 0; j < n; j++)
            for (int k = 0; k < n;)
            {
               c[k++] = a[i] + b[j];
               c[k++] = a[i] + b[j];
               c[k++] = a[i] + b[j];
               c[k++] = a[i] + b[j];
            }

      // Case 2c
      for (int i = 0; i < n; i++)
         for (int j = 0; j < n; j++)
            for (int k = 0; k < n; k += 4)
            {
               c[k + 0] = a[i] + b[j];
               c[k + 1] = a[i] + b[j];
               c[k + 2] = a[i] + b[j];
               c[k + 3] = a[i] + b[j];
            }

      // Case 2d
      for (int i = 0; i < n; i++)
         for (int j = 0; j < n; j++)
            for (int k = 0; k < n; k += 4)
            {
               c[k] = a[i] + b[j];
               c[k + 1] = a[i] + b[j];
               c[k + 2] = a[i] + b[j];
               c[k + 3] = a[i] + b[j];
            }



FloatBuffer
Case  1: 1801k/s
Case 2a: 2630k/s
Case 2b: 2872k/s
Case 2c: 3218k/s Kiss
Case 2d: 2623k/s Cry  22% performance-loss by removing "+ 0"

Float Array
Case  1: 3464k/s
Case 2a: 4116k/s
Case 2b: 4225k/s
Case 2c: 4226k/s
Case 2d: 4228k/s Kiss

Float Pointer
Case 1: 3336k/s
Case 2: 3571k/s




Int Buffer
Case 2a: 2593k/s
Case 2b: 2423k/s
Case 2c: 2562k/s
Case 2d: 2564k/s

Int Array
Case  1: 1784k/s
Case 2a: 1887k/s
Case 2b: 2030k/s
Case 2c: 2064k/s
Case 2d: 2064k/s

Int Pointer
Case 1: 3272k/s
Case 2: 3930k/s Kiss

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline javazoid

Junior Duke




Where's Flender?


« Reply #8 - Posted 2007-01-11 08:27:09 »

It seems to me somewhat curious that no "officials" posted a comment on all the nice stuff above.
It would be very nice to hear from them and get some good indications and hints.

Cheers,

Mik

Offline erikd

JGO Ninja


Medals: 16
Projects: 4
Exp: 14 years


Maximumisness


« Reply #9 - Posted 2007-01-11 12:03:20 »

I'm not sure what you mean by Int Pointer and Float Pointer structures, could you explain? Is it what you just demonstrated in your other thread?

Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Online Riven
« League of Dukes »

JGO Overlord


Medals: 817
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #10 - Posted 2007-01-11 13:41:44 »

It's the code using sun.misc.Unsafe, not using 'mapped objects'

unsafe.putInt(pntr, value);
value = unsafe.getInt(pntr);

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline swpalmer

JGO Coder


Exp: 12 years


Where's the Kaboom?


« Reply #11 - Posted 2007-01-12 00:31:45 »

Aaah... found it:
-XX:+PrintOptoAssembly

Doesn't work.. Roll Eyes

I thought there was one too, but my quick search didn't find it.  If I recall correctly it only works in the debug builds of the JVM.. you could probably get a debug version of Mustang, er.. Java 6,  from the project site at java.net.


BTW, have you filed a bug report?  I think Sun takes performance regressions very seriously, and it appears the Java 6 client VM certainly has a problem.

Pages: [1]
  ignore  |  Print  
 
 
You cannot reply to this message, because it is very, very old.

 

Add your game by posting it in the WIP section,
or publish it in Showcase.

The first screenshot will be displayed as a thumbnail.

Longarmx (52 views)
2014-10-17 03:59:02

Norakomi (45 views)
2014-10-16 15:22:06

Norakomi (34 views)
2014-10-16 15:20:20

lcass (38 views)
2014-10-15 16:18:58

TehJavaDev (68 views)
2014-10-14 00:39:48

TehJavaDev (68 views)
2014-10-14 00:35:47

TehJavaDev (60 views)
2014-10-14 00:32:37

BurntPizza (73 views)
2014-10-11 23:24:42

BurntPizza (45 views)
2014-10-11 23:10:45

BurntPizza (87 views)
2014-10-11 22:30:10
Understanding relations between setOrigin, setScale and setPosition in libGdx
by mbabuskov
2014-10-09 22:35:00

Definite guide to supporting multiple device resolutions on Android (2014)
by mbabuskov
2014-10-02 22:36:02

List of Learning Resources
by Longor1996
2014-08-16 10:40:00

List of Learning Resources
by SilverTiger
2014-08-05 19:33:27

Resources for WIP games
by CogWheelz
2014-08-01 16:20:17

Resources for WIP games
by CogWheelz
2014-08-01 16:19:50

List of Learning Resources
by SilverTiger
2014-07-31 16:29:50

List of Learning Resources
by SilverTiger
2014-07-31 16:26:06
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!