Java-Gaming.org Hi !
Featured games (81)
games approved by the League of Dukes
Games in Showcase (513)
Games in Android Showcase (119)
games submitted by our members
Games in WIP (577)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: [1]
  ignore  |  Print  
  Understanding -XX:+PrintCompilation  (Read 2138 times)
0 Members and 1 Guest are viewing this topic.
Offline lhkbob

JGO Knight


Medals: 32



« Posted 2012-03-11 19:26:40 »

I was running some micro-benchmarks comparing 3x3 matrix and 4x4 matrix operations with different ways of storing and fetching the data.  I had three variations:
 1. Arrays of Matrix3 or Matrix4 objects (depending on the type being tested)
 2. Matrices packed into double[] data, and read into local Matrix3 or Matrix4 to compute everything
 3. Matrices packed into float[] data

The Matrix3 and Matrix4 all store their components in double fields, e.g. m00 for the 1st row and 1st column.

Obviously the Matrix3 multiplications were faster than Matrix4, but I noticed that as I continued to run test iterations, the Matrix3 class would drastically improve performance after the first iteration, just as if it had been JIT compiled.  The Matrix4 tests did not exhibit this behavior.  I then enabled the flag -XX:+PrintCompilation to see when things were compiled and a couple of things stood out:
 1. The tests claimed that both Matrix3 and Matrix4 multiplication methods were compiled
 2. The compilation happened at the very start of the test cases, before the drastic improvements with Matrix3 and it was never recompiled later on.

Here is a pastebin of the log output from Matrix3: http://pastebin.java-gaming.org/099d10734

This prints out 3 test iterations. Each iteration tests the above 3 variations by performing multiplications over 10,000 matrices, 10,000 times.  So each test iteration does 300,000,000 actual invokations.  The timings are the total time it took to do the 100,000,000 operations for each variation.

As you can see, the first iteration has everything running pretty slowly but at the same performance level.  The start of iteration 2 is slightly faster, but you can see something changes part way through and it speeds up by iteration 3.  By the end performance has improved 10-fold.  Unfortunately, the lines claiming to have compiled Matrix3's methods happen way earlier.

I won't show a paste bin for Matrix4's run, but its output looks very similar except that there are no performance changes for each iteration.  Whatever caused the performance improvements for Matrix3 never kicked in, even though Matrix4's methods were apparently JIT compiled.

-------

So for those of you who thought tl;dr, I'm wondering what is actually causing the JVM to optimize the Matrix3 code faster than the Matrix4 code.  Is it possible that method size is causing a problem.  The multiplication methods are unrolled so Matrix4's version is much larger.  Are there any restrictions to how many arguments a method can take that might affect inlining or optimization.  Theoretically, neither Matrix3 nor Matrix4's multiply methods should be inlined since they're both over the 35 byte inline limit.

Could the delay in performance just be a delay before the JVM started using the compiled method, even though it was compiled right away? Or perhaps System.out and whatever prints the compilations aren't synchronized?

Thanks

Offline sproingie

JGO Kernel


Medals: 202



« Reply #1 - Posted 2012-03-11 19:37:10 »

Try using -XX:+PrintOptoAssembly and see what it generates.

Even if it generates nearly the same assembly, it could be the 3x3 matrix fits better into a cache line than the 4x4 which is nearly twice the size.
Offline lhkbob

JGO Knight


Medals: 32



« Reply #2 - Posted 2012-03-11 19:38:46 »

I should post a quick summary of the performance results to demonstrate my confusion:
IterationStorage typeMatrix3Matrix4
1Matrix[]~1100ms~5300ms
1double[]~800ms~5600ms
1float[]~1200ms~6000ms
2Matrix[]~500ms~4800ms
2double[]~70ms~5200ms
2float[]~70ms~6000ms
3Matrix[]~100ms~5000ms
3double[]~70ms~5300ms
3float[]~70ms~5700ms

As you can see, the Matrix3 operations get optimized quite effectively, but the Matrix4 do not.  The compilations supposedly occur for both Matrix3 and Matrix4, at the very start of iteration 1 using the Matrix[] arrays.

Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline lhkbob

JGO Knight


Medals: 32



« Reply #3 - Posted 2012-03-11 19:44:12 »

Try using -XX:+PrintOptoAssembly and see what it generates.

Even if it generates nearly the same assembly, it could be the 3x3 matrix fits better into a cache line than the 4x4 which is nearly twice the size.

That requires a debug JVM as I understand it.  ATM I don't have one of those, but I will try and get one to build for Mac (mayhaps I'll get OpenJDK 7 installed then, too).

Offline lhkbob

JGO Knight


Medals: 32



« Reply #4 - Posted 2012-03-11 21:13:30 »

I've written up a standalone test that combines both Matrix3 and Matrix4 into the same benchmark: http://www.java-gaming.org/?action=pastebin&id=31

Writing it gave me some the opportunity to improve upon the benchmark and I discovered the problem.  The line:
1  
   3851   6%      com.ferox.math.Matrix3Test::main @ 205 (375 bytes)

occurs right when I notice the performance boost in my original run.  I think the JVM was able to notice that my inner loops were not doing "anything" and optimized them.  In the standalone test, I accumulate the total results, and the JVM was no longer able to make that optimization in the main() method.  Performance then remained the same for all iterations for both Matrix3 and Matrix4.

So I've solved it, but people can feel free to run the benchmark if they want or not. I'd be curious to see the output if someone could run it with -XX:+PrintOptoAssembly for me.

Pages: [1]
  ignore  |  Print  
 
 
You cannot reply to this message, because it is very, very old.

 

Add your game by posting it in the WIP section,
or publish it in Showcase.

The first screenshot will be displayed as a thumbnail.

Longarmx (49 views)
2014-10-17 03:59:02

Norakomi (38 views)
2014-10-16 15:22:06

Norakomi (31 views)
2014-10-16 15:20:20

lcass (34 views)
2014-10-15 16:18:58

TehJavaDev (65 views)
2014-10-14 00:39:48

TehJavaDev (65 views)
2014-10-14 00:35:47

TehJavaDev (55 views)
2014-10-14 00:32:37

BurntPizza (72 views)
2014-10-11 23:24:42

BurntPizza (43 views)
2014-10-11 23:10:45

BurntPizza (84 views)
2014-10-11 22:30:10
Understanding relations between setOrigin, setScale and setPosition in libGdx
by mbabuskov
2014-10-09 22:35:00

Definite guide to supporting multiple device resolutions on Android (2014)
by mbabuskov
2014-10-02 22:36:02

List of Learning Resources
by Longor1996
2014-08-16 10:40:00

List of Learning Resources
by SilverTiger
2014-08-05 19:33:27

Resources for WIP games
by CogWheelz
2014-08-01 16:20:17

Resources for WIP games
by CogWheelz
2014-08-01 16:19:50

List of Learning Resources
by SilverTiger
2014-07-31 16:29:50

List of Learning Resources
by SilverTiger
2014-07-31 16:26:06
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!