Hi !
Featured games (90)
games approved by the League of Dukes
Games in Showcase (707)
Games in Android Showcase (206)
games submitted by our members
Games in WIP (781)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
    Home     Help   Search   Login   Register   
Pages: [1]
  ignore  |  Print  
  Microbenchmark - new vs. reuse  (Read 4235 times)
0 Members and 1 Guest are viewing this topic.
Offline walter_bruce

Junior Newbie

Performance matters.

« Posted 2003-08-20 17:53:49 »

I've created a little microbenchmark test the relative costs of a few different ways of getting access to a temporary object inside a short method, and I thought the results might be of interest to other people.

My test computes the cross product of two 3D vectors and returns the result in the first vector.  This is a small but real-world-useful operation, and it requires some temporary space for the cross product.  I coded up multiple versions that used different techniques to get the needed temporary space as follows:
    [1] Local var.  This method just used local double variables for its temporary space.  This is the only version that does not use an object (a Vector3d) for its temporary space.
    [2] New.  Allocate a new Vector3d each time the method is called.  
    [3] ThreadLocal.  Get a temporary object using a ThreadLocal object
    [4]  Field.  Get temporary object stored in a private field.
    [5]  Field sync.  Synchronized method which gets its temporary from a private field as in [4].
    [6] TempStack.  Get temporary object from a TempStack which is essentially a object pool where objects must be returned in the reverse-order that they were gotten.  TempStack is obtained using a ThreadLocal
    [7] TempStack param.  Use a TempStack passed in an explicit extra parameter.  Ugly in that it requires an extra parameter but can be relatively fast.

Method 4 is not thread-safe and methods 3, 4, and 5 cannot be used in recursive methods.  Method 2 is the cleanest of the object based methods, but how does its performance compare to the other?  Here are some timings from my 1.7GHz Pentium 4 machine:

Test  +++++++++++   JVM 1.4.2 -client             JVM 1.4.2 -server
(1) Local var0.0760.015
(2) New0.1440.120
(3) ThreadLocal0.1000.039
(4) Field0.0480.040
(5) Field sync0.2050.216
(6) TempStack0.1270.047
(7) TempStack param0.0690.016

Times are in microseconds per method call and you can get the complete source code here

A few thing to note from the results
  • As others have noted, under 1.4.2 -server is much faster for floating point code than -client
  • The difference between (1) and (2) gives the approximate cost of allocating and garbage collecting the temporary Vector3d object.  Allocation increases the cost of the cross product method by a factor between 2 (client) and 8 (server), so it is still a very significant cost in this case.
  • The synchronized method is the most expensive in all cases, so it is still best to avoid synchronization when possible.
  • We use a technique similar to (7) in performance-critical sections of our own code, and I would happily change the code to something cleaner like (2), if the cost was small.   However the cleaner object-based techniques are still significantly slower.
  • Although I expected (1) to be the fastest, under -client it turns out to be actually slower than (4) and (7) for reasons I don't understand.
  • The field method (4) seems to be relatively slow under -server, for reasons I also don't understand.

Caveat: This is a microbenchmark and performance may be different in real applications. I think garbage collection is a wonderful thing, and I do not advocate abandoning it for object pools except when really necessary for performance reasons (preferably after profiling your code first).  Comments and critiques are welcome.
Offline swpalmer

JGO Coder

Exp: 12 years

Where's the Kaboom?

« Reply #1 - Posted 2003-08-20 22:03:29 »

Mac OS X 10.2.6  Java 1.4.1-client  (1GHz)
Cross product local variable speed testing
1)local double var.... avg 0.082891665 usecs
2)new................. avg 0.14235833 usecs
3)ThreadLocal..........avg 0.14793333 usecs
4)Instance field.......avg 0.06655 usecs
5)Instance field sync..avg 0.12054167 usecs
6)TempStack............avg 0.16006666 usecs
7)TempStack param......avg 0.09599167 usecs

Note that #5 is cheaper than #2,3,6 which is significantly different from your results with the 1.4.2 Intel VM

Mac OS X 10.2.6  Java 1.4.1-server  (1GHz)
Cross product local variable speed testing
1)local double var.....avg 0.08573333 usecs
2)new..................avg 0.1584 usecs
3)ThreadLocal..........avg 0.153775 usecs
4)Instance field.......avg 0.06821667 usecs
5)Instance field sync..avg 0.124858335 usecs
6)TempStack............avg 0.168125 usecs
7)TempStack param......avg 0.10123333 usecs

Offline AndersDahlberg

Junior Devvie

« Reply #2 - Posted 2003-08-21 10:43:52 »

redhat 9, jre 1.4.2
local double var     avg 0.053925 usecs   total 6.471 secs
new                  avg 0.20695834 usecs   total 24.835 secs
ThreadLocal          avg 0.11985833 usecs   total 14.383 secs
Instance field       avg 0.04611667 usecs   total 5.534 secs
Instance field sync  avg 0.046491668 usecs   total 5.579 secs
TempStack            avg 0.137975 usecs   total 16.557 secs
TempStack param      avg 0.0551 usecs   total 6.612 secs

local double var     avg 0.036875002 usecs   total 4.425 secs
new                  avg 0.17416666 usecs   total 20.9 secs
ThreadLocal          avg 0.082575 usecs   total 9.909 secs
Instance field       avg 0.043675 usecs   total 5.241 secs
Instance field sync  avg 0.056941666 usecs   total 6.833 secs
TempStack            avg 0.09243333 usecs   total 11.092 secs
TempStack param      avg 0.0455 usecs   total 5.46 secs

EDIT: added a server run + modified client (as the previous one was taken while doing a lot of other stuff... idea, xmms, xine etc...)
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline walter_bruce

Junior Newbie

Performance matters.

« Reply #3 - Posted 2003-08-21 12:45:36 »

Thanks for the additional data points.  A few observations
  • All three client JVMs seem to preserve the oddity that using a field (4) is cheaper than using local variables (1).  I still don't understand why this is the case.
  • The relative cost of a synchronized method (5) seems to be lower on MacOSX and much lower under Redhat as compared to my Windows results.
  • The relative cost of new and garbage collection seems slightly lower under MacOSX but significantly higher under Redhat.  It still large enough in all cases to be a potential bottleneck in truly performance-critical code.
  • Under MacOSX, -client and -server are not significantly different which is not surprising since my understanding was that -server is ignored under the current MacOSX JVM.
Offline AndersDahlberg

Junior Devvie

« Reply #4 - Posted 2003-08-21 13:13:04 »

If you're interested I could do a test on the 2.6.0-test3 kernel too (previous results comes from 2.4)?

Offline swpalmer

JGO Coder

Exp: 12 years

Where's the Kaboom?

« Reply #5 - Posted 2003-08-21 13:29:40 »

Yes, there is only a single shared library for the hotspot VM on Mac OS X.  I didn't realise at first, but the server shared library and client shared library files are both there, but as alias's to a single 'hotspot' library.

I guess the slight differences are caused by different parameter values to the same VM  (e.g. compile thresholds, size of young generation in heap etc.)

I too have no clue why a field is faster than a stack variable - weird.  If only there was a way to disassemble the native ode produced by HotSpot.

Offline walter_bruce

Junior Newbie

Performance matters.

« Reply #6 - Posted 2003-08-22 18:09:33 »

I ran some more tests to see why the cost of synchronization seemed to vary so much and it seems to depend on whether you are running on a single processor or dual processor machine.  My previous tests were run on a dual processor which I didn't mention because it didn't seem relevant since the test is entirely single threaded (ie the other processor just sits idle).  However I've gone back and redone my timings (with fewer other applications open) on single and dual processor machines which are otherwise similar and here are the results:

 JVM 1.4.2   client   client  |   server   server  
 1.7GHz P4   single   dual   |   single   dual  
(1) Local var  ---------0.0770.076|  0.0120.012
(2) New  --------------0.0700.141|  0.0560.124
(3) ThreadLocal  -----0.1020.100|  0.0430.039
(4) Field  -------------0.0430.042|  0.0450.043
(5) Field sync  -------0.0570.231|  0.0550.178
(6) TempStack  ------0.1210.128|  0.0450.047
(7) TempStack param 0.0530.072|  0.0160.016

  • Most results are similar except for the cost of new (2) and synchronization (5) are much higher on a dual processor machine.
  • Adding synchronized to a method is virtually free on a single processor (assuming no contention), but fairly expensive on a dual processor
  • Using new (2) on single processor machine under the client JVM seems to be reasonably fast.  Only the field (4) and field sync (5) methods are faster and not by that much.  However on a dual processor or when using -server, then there are other techniques that are much faster than using new.
  • TempStack param (7) also seems to slows down somewhat on a dual processor for reasons I don't understand, but only under -client, not -server.
  • Would we see the same slowdowns on a single processor machine with HyperThreading enabled?  (It was not enabled in any of my tests.)  I'll try to test this if I can find a suitable machine.

I wonder if the JVM actually generates different code on a single vs. dual processor machine or if there is something else going here (cache effects? context switching?).
Offline princec

« JGO Spiffy Duke »

Medals: 869
Projects: 3
Exp: 16 years

Eh? Who? What? ... Me?

« Reply #7 - Posted 2003-08-22 18:27:39 »

The PPC architecture uses register windowing and suchlike doesn't it from its RISC beginnings? It's pretty poorly adapted for stack-based architectures like the JVM, whereas the x86, with its paucity of general purpose registers, is much better at stack ops. So let's guess that the fields get mirrored into registers on PPC.

Cas Smiley

Offline swpalmer

JGO Coder

Exp: 12 years

Where's the Kaboom?

« Reply #8 - Posted 2003-08-22 19:02:06 »

Yes dual processor makes a big difference for synchronisation.  For single CPU raising the IRQ level is enough to prevent a context switch and therefore gain exclusive access for a moment.  With dual CPUs fancier mechanisms must be used.  On windows the kernel will use spinlocks in dual processor mode, operations that are noops with the single processor kernel.

I don't know if Mac OS X has the same distinction for synchronisation operations.  I get the feeling that in the world of Macs dual processor machines are much more popular than in the world of windows.

Cas, I'm not sure why the PPC would be any worse at stack ops, the compiler can choose any register to be the stack pointer and it will work pretty much the same as an Intel stack.  I believe the available addressing modes will mimic the push,pop without the need for any extra instructions or longer execution times.  The proper set of general purpose registers is, in most cases, a win.  The main problem, until recently with IBMs latest PPC chips, has been the lagging clock speeds for the PPC CPUs.  But this discussion is for another thread if it is worth pursuing at all Smiley

So who has a theory on the field versus local observations?

Offline NVaidya

Junior Devvie

Java games rock!

« Reply #9 - Posted 2003-08-24 16:54:25 »

> *  As others have noted, under 1.4.2 -server is much faster for >floating point code than -client

For something quite fundamental as floating point performance,
just would like to know the reason behind this discrepancy
between client and server options.

As a rough benchmark, I ran some cases with my Java3D particle
tracking algorithm that involves fully double precision calcs.,
newing of dynamic primitive arrays and objects of that type only,
no synchronization anywhere, extensive polymorphic method calls
(since the cells are different kinds of polyhedron), no accounting
for gc times, and no newing of objects within loops.

The time taken for creating 1000 traces for repeated invocations without any pauses are (in secs):

client:   30.98, 31.20, 32.84, 20.98, 21.48, 20.82
server: 25.02, 10.32, 11.15, 10.65, 10.25. 10.27

Looks like after the initial warm-up period, the server is roughly
twice as fast.

What would it take to get Sun to spruce up the client so it would number crunch as fast as the server ?

Edit: Forgot to ask if anyone has compared Java vs C++ FP
performance with the JVM in server mode ? That may be very
interesting.....I have done some here but not exactly apples to
apples comparisons.

Gravity Sucks !
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline swpalmer

JGO Coder

Exp: 12 years

Where's the Kaboom?

« Reply #10 - Posted 2003-08-25 02:45:35 »

I suspect the reason is that the server is allowed to spend more time compiling bytecode to machine code, and the algorithm required to generate more optimized floating point instructions (e.g. the SSE instructs that are used on Intel by the server VM) requires too much processing time in the compiler.  So for a client VM the lag caused by a runtime compilation pause would be considered unacceptable.  That's my guess anyway.

BTW.. I did a test 10 months ago with MS Visual C++ 6.0 to do a simple conversion from RGB colour space to YUV.  The Java version ran faster than the C++ .exe.   The reason I suspect was the very poor performance of the Microsoft compiler for floating point to integer conversions.  Apparently it sets the floating point rounding mode twice every time a conversion to int is required (once to set the mode to C style rounding, once to set it back to natural round to the closest number).  Intel's C++ compiler at the time was considered vastly superior.  The GNU C++ compiler on Intel at least prior to version 3 also produced VERY poor code.. so bad that I know of one project that abandoned the idea of a Linux port because they didn't want to release something with such poor performance and possibly ruin the reputation of the company.

Offline walter_bruce

Junior Newbie

Performance matters.

« Reply #11 - Posted 2003-08-25 19:44:40 »

--- Field (4) faster than local variables (1) under -client
I've profiled the code using VTune which has the side benefit of allowing one to view the assembly code produced by the hotspot compiler.  I've posted the resulting assembly code for the local variable and field routines here (sorry for the strange formatting):
I'm not an x86 assembly expert but perhaps there is an expert out there who can analyze the differences.  One thing I noticed is that local variable code computes the results using fp registers and then copies the results using int registers while the field code uses fp registers throughout.

--- Field (4) much slower than local variables (1) under -server
This turns out to be an inlining effect.  Hotspot inlines method (1) by default but not method (4).  If I disable all inlining (using the -XX:MaxInlineSize=1 -XX:FreqInlineSize=1 flags) then the local var method slows down to 0.038 us (or just a hair faster than field).  However I could not find any parameter setting that would convince hotspot to inline the field method (4) the same way it is inlining method (1) by default.

Incidently I don't think its actually any more difficult to generate the SSE/2 instructions instead of x87 for the floating point code (in fact the SSE/2 code is simpler and probably easier to generate).  Hotspot -server does not use the SIMD parts of SSE/2, just the scalar instructions as shown in the assembly code linked below.  I think SSE/2 fp code is a feature that is likely to migrate down into the client JVM in the next version, especially if enough people request it.
Offline NVaidya

Junior Devvie

Java games rock!

« Reply #12 - Posted 2003-08-26 12:08:24 »

Much appreciate the info swpalmer and walter_bruce.

Speedups of 2-8 are quite incredible - imagine all the hoopla
people go thru' with itty-bitty micro-optimizations.

Anyone interested in pursuing this further and taking it to Sun for
the client option ? I'd love that.


Edit: OK ! The upper bound should actually read 5 and not 8
- based on the microbenchmark.

Gravity Sucks !
Pages: [1]
  ignore  |  Print  
You cannot reply to this message, because it is very, very old.

Galdo (239 views)
2017-01-12 13:44:09

Archive (407 views)
2017-01-02 05:31:41

0AndrewShepherd0 (865 views)
2016-12-16 03:58:39

0AndrewShepherd0 (802 views)
2016-12-15 21:50:57

Lunch (939 views)
2016-12-06 16:01:40

ral0r2 (1170 views)
2016-11-23 16:08:26

ClaasJG (1271 views)
2016-11-10 17:36:32

CoffeeChemist (1306 views)
2016-11-05 00:46:53

jay4842 (1391 views)
2016-11-01 19:04:52

theagentd (1207 views)
2016-10-24 17:51:53
List of Learning Resources
by elect
2016-09-09 09:47:55

List of Learning Resources
by elect
2016-09-08 09:47:20

List of Learning Resources
by elect
2016-09-08 09:46:51

List of Learning Resources
by elect
2016-09-08 09:46:27

List of Learning Resources
by elect
2016-09-08 09:45:41

List of Learning Resources
by elect
2016-09-08 08:39:20

List of Learning Resources
by elect
2016-09-08 08:38:19

Rendering resources
by Roquen
2016-08-08 05:55:21 is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!