swpalmer
|
 |
«
Posted
2004-01-17 04:19:02 » |
|
I'm trying to understand the generational collector and I don't seem to have it quite right.
I think in general it works like this: When the Eden is filled GC runs on it.. any objects in Eden that are still alive with a generation count that exceeds some threshold are promoted to the next generation and the Eden space is compacted - those objects that remain in Edne have their generation count incremented. The new object is created in the available space that is reclaimed. If the new object still doesn't fit in Eden it goes straight to the older generation which usually has a larger max size.
Is that correct? If so...
What happens when you allocate several large objects in a row.. such that they are too large to fit in Eden after it is collected but too small for the VM to know to place them in the larger heap right away. Does the 2nd large object and those after it cause GCs in Eden that basically do redundant collections that don't reclaim any/enough new Eden space yet still bump the generation count on the young objects so that they are prematurely promoted?
Or after a Eden GC does the VM track the amount of KNOWN free space in Eden.. hmm no that isn't possible, everything left in Eden could be garbage by the time the next allocation happens, so there just may be enough space, it will have to try to collect.
I'm trying to learn how to best tune both the GC parameters and object creation patterns.
E.g. I have an app that generates lots of short lived, large objects (images from motion-JPEG video frames). I had some issues with GC pauses. these mostly went away when I tried -XX:+UseConcMarkSweepGC but the video is still not 100% smooth. I wonder how I might tune the generation sizes to optimize this. I don't use JMF because last time I tried it didn't work well enough and is fairly complex.
|
|
|
|
gregorypierce
|
 |
«
Reply #1 - Posted
2004-01-17 15:16:25 » |
|
You're thinking along the lines of where I've been headed. I keep wondering to myself - if I don't have large dynamic memory needs... why not just create one big ass eden or eden + generation one where I know everything is going to fit, thus meaning that the garbage collector would never need to be invoked?
|
http://www.gregorypierce.comShe builds, she builds oh man When she links, she links I go crazy Cause she looks like good code but she's really a hack I think I'll run upstairs and grab a snack!
|
|
|
Jeff
|
 |
«
Reply #2 - Posted
2004-01-17 20:20:30 » |
|
Okauy,
First off thre is no edne "compaction".
It works mroe like this.
Obejcts are allcoated till the eden is filled. When it is filled, some very trickey datastructures that track use of eden obejcts are scanned. Any obejcts that are still in use at that point in time are moved to the next generation. The eden is then wiped clean.
The next generation s one of two things depending on whetehr or not you have incremental collection turned on.
If the incr collector is on then the next generation is what are called the "trains". In simpel terms they are a set of linked lists that "bubble up" object through them in progression during aprtial collects.
Finally if they get to the end of the trains without going unreferenced they move to the final generation.
Each generation's code/algortihym is 'tuned" to be better for the kidn fo obejct its likely to encounter. The eden is easy to collect but costs you a bit mreo in space and processor for live objects. In contrast, the final generation is the cehapest cost for keeping thinsg alive but the msot expensive to collect.
There really IS a reason for mant generations.
|
|
|
|
Games published by our own members! Check 'em out!
|
|
gregorypierce
|
 |
«
Reply #3 - Posted
2004-01-18 00:36:31 » |
|
Right, but the question is - what happens if eden is never full. What if I increase the size of eden to the point where it never fills up.
a) can I do that
b) what are the implications of doing that
|
http://www.gregorypierce.comShe builds, she builds oh man When she links, she links I go crazy Cause she looks like good code but she's really a hack I think I'll run upstairs and grab a snack!
|
|
|
Jeff
|
 |
«
Reply #4 - Posted
2004-01-18 00:53:58 » |
|
Right, but the question is - what happens if eden is never full. What if I increase the size of eden to the point where it never fills up.
a) can I do that
Well in order to do that you would have to have an app that had static allocation needs. (Did not do allocation in its runnign but only for set up.) Which means that all your obejcts are really long lived. Which means you DONT want them in the eden. You are paying CPU and memory for tracking those. In fact yo uwant them to hit the old generation as fast as possible, which means you want a SMALL eden, not a large one. [quote b) what are the implications of doing that[/quote] See above.
|
|
|
|
swpalmer
|
 |
«
Reply #5 - Posted
2004-01-18 02:08:21 » |
|
Obejcts are allcoated till the eden is filled. When it is filled, some very trickey datastructures that track use of eden obejcts are scanned. Any obejcts that are still in use at that point in time are moved to the next generation. The eden is then wiped clean. I see an obvious flaw in this, well not a flaw really - but in some cases an undesirable characteristic. If eden is wiped clean when it is filled it means that the most recently allocated objects are promoted to the next generation (i.e. non-eden) prematurely. By that I mean after a VERY short life relative to the objects that were put in eden shortly after it was last "wiped clean". My concern with this is that in my case I have relatively big objects (video frames) and that for animation smoothness I will have one or two of these buffered that are about to be displayed. Soon as they are painted over (approx 1/30 sec apart) they are garbage. So every time one image is allocated that doesn't fit in Eden one or two images would be promoted to a generation in which they don't really belong. Thus generating short lived objects in the wrong generation. Am I making sense? The GC algorithms could be much more complicated and handle this case. Can you tell me if they do?
|
|
|
|
princec
|
 |
«
Reply #6 - Posted
2004-01-18 11:30:47 » |
|
The lifetime of objects is not actually measured in "time" but in GC generations. It doesn't matter how long the object has been in Eden for in physical time; it'll still be cleared out if it's not referenced when the collector comes knocking. The trick is to tune the size of Eden so that during the course of your tight inner loops, the Eden space is just over completely filled by the time the video frame ends, which will cause at least one collection per frame producing almost no garbage whatsoever and costing very little to perform. The very little garbage that is still referenced when the GC occurs will end up going into the incremental collector where it is purged on a less frequent basis, but purged nonetheless, also at very little cost. Therefore, for video games, I think you need a very small Eden indeed, and even then, you should avoid constructing objects if you can help it during those inner loops. Cas 
|
|
|
|
gregorypierce
|
 |
«
Reply #7 - Posted
2004-01-18 20:36:15 » |
|
Anyone know of a profiler that will be able to tell me how long my objects spend in each generation and when they move from one generation to the next?
While I think its cheesy to have to wait for my objects to reach tenure since I know I want them to end up there (any time that it takes for them to reach the old generation is just a waste), I guess if that's the only way to do it I'l lhave to deal with it.
|
http://www.gregorypierce.comShe builds, she builds oh man When she links, she links I go crazy Cause she looks like good code but she's really a hack I think I'll run upstairs and grab a snack!
|
|
|
Jeff
|
 |
«
Reply #8 - Posted
2004-01-18 20:47:56 » |
|
Hmm. Iits possivble actually that the built in profiler might give you soem of that info if you do -verbose:gc
I dont kwno if the info is avauilable to external profilers or not, the way to find out is to look up the JVMPI docs on java.sun.com
|
|
|
|
Jeff
|
 |
«
Reply #9 - Posted
2004-01-18 20:49:25 » |
|
On tuning the edben. The VM will try to tune it for you based on whats going on in run-time but in a case like this (one set of allocs per level) it might take it a long time to get enough info.
There is a -XX flag to set the initial eden size. I poinetd it out to Chris when he did his benchmarks and it may be in his article. If not, I can dig it up again if you'ld like.
|
|
|
|
Games published by our own members! Check 'em out!
|
|
gregorypierce
|
 |
«
Reply #10 - Posted
2004-01-18 20:56:24 » |
|
I would appreciate it. This is the sort of thing I would love to see in a whitepaper:
"Tuning the JVM for games and other multimedia applications"
I think this is a very common and highly efficient practice for most games. At least it is for me coming from a console background where you just didn't have any choice.
"You have a certain amount of memory and that's it - suck it up and make it fit." (me to a junior cohort back in the day)
|
http://www.gregorypierce.comShe builds, she builds oh man When she links, she links I go crazy Cause she looks like good code but she's really a hack I think I'll run upstairs and grab a snack!
|
|
|
Jeff
|
 |
«
Reply #11 - Posted
2004-01-18 21:31:23 » |
|
I'll look for it.
The problem is we CAN'T publish the -XX flags in a white paper because they are -XX. The whole point of -XX is that they are VM specific and subject to change from VM version to VM version.
|
|
|
|
swpalmer
|
 |
«
Reply #12 - Posted
2004-01-18 23:21:45 » |
|
The trick is to tune the size of Eden so that during the course of your tight inner loops, the Eden space is just over completely filled by the time the video frame ends, which will cause at least one collection per frame producing almost no garbage whatsoever and costing very little to perform. The very little garbage that is still referenced when the GC occurs will end up going into the incremental collector where it is purged on a less frequent basis, but purged nonetheless, also at very little cost. Well this doesn't address my concern. I will always have a reference to a relatively large object (a bitmap image or 2 representing a couple video frames), so such objects will always be live when the eden is filled. I know they are "short-lived" in a real time sense, but it seems that there is nothing I can do about the fact that they will get promoted to the next generation. The thing I get out of this GC explanation is that it impossible to not have "short-lived" objects get moved from Eden. If you have only a single live object when you attempt to allocate another and that allocation won't fit in eden, then the one live object that is in eden, regardless of chronological age, is moved out of eden. If that object is large then there is a significant cost to copy it out of eden, even if the next generation will efficiently collect it because that generation never really gets a lot of work to do. That means a slight blip in the GC usage when eden fills, even though my algorithm is designed to produce only very short lived objects - specifically to take advantage of the efficient collection that the young generation performs for such objects. Is it best to simply reduce the size of eden in this case so that my large object is immediately placed in the next generation so that it never needs to be copied out of eden? I then have to worry about how efficient the next generation is with the many short lived objects I will be generating out side of the very space that treats them most efficiently. Comments? Suggestions? Gregory, and others.. there is a really nice program that I mentioned in this forum a few months ago. It shows the sizes of the various generations, the CPU time taken by GC.. it's very nice. I can't remember the name at the moment. I'll find the thread and give it a bump.
|
|
|
|
Jeff
|
 |
«
Reply #13 - Posted
2004-01-18 23:46:01 » |
|
Hmm. So at this point I have to ask...
Do you KNOW there is a problem or is this all just supposition? The VM inernally tries to tune all the memory spaces for over-all efficiency.
This seems to me to be pretty deep into abstract logic which may or may not actually be causing any significant issues...
|
|
|
|
gregorypierce
|
 |
«
Reply #14 - Posted
2004-01-19 00:09:53 » |
|
Yes it is definitely a problem - one that is more and more pronounced the slower the machine is. Basically what's happening when I do a verbose gc is that I have a lot of little gcs and then some long stop the world painful gc - but I'm not allocating jack. If I'm not allocating anything, the gc needs to leave my stuff alone.
|
http://www.gregorypierce.comShe builds, she builds oh man When she links, she links I go crazy Cause she looks like good code but she's really a hack I think I'll run upstairs and grab a snack!
|
|
|
swpalmer
|
 |
«
Reply #15 - Posted
2004-01-19 01:31:30 » |
|
In my case I know that GC is effecting my programs animation because if all I change is to add -XX:+UseConcMarkSweepGC the pauses are significantly reduced.
During this phase of my program the most significant object allocation is images created from JPEG data. I could reduce this with ImageIO reading the JPEG into an existing image buffer. However then there is a memory leak caused by ImageIO. The fix for that leak is to call some sort of re-init method on some ImageIO objects.. but the side effect of that is a MASSIVE slowdown (it triggers a call to System.gc() in the ImageIO code, which in turn is a result of the evil use of finalizers) so I've settled for the older garbage creating method.java The ImageIO bug in question is 4868479 - apparently it is fixed in Tiger.. but that won't help me for a while yet... the Mac platform is where this matters most to me now.
|
|
|
|
Jeff
|
 |
«
Reply #16 - Posted
2004-01-20 23:35:44 » |
|
yes thats very odd. if you aren't allocating then you really shouldnt see any GC activity unless youa re running incrmental gc.
if you are, try shutting that off.
|
|
|
|
ChrisRijk
Senior Newbie 
Optimise or Die
|
 |
«
Reply #17 - Posted
2004-01-21 15:54:29 » |
|
http://java.sun.com/docs/hotspot/VMOptions.htmlHas some detail on a lot of non-standard flags. The -XX:CompileThreshold, -XX:MaxInlineSize and -XX:FreqInlineSize are interesting since they're about the only flags that affect code optimisation, apart from -client and -server. Almost everything else is GC related (compare and contrast to C compilers!) However, in my experiance they almost never make much of a difference when -server is on, though one of my benchmarks in one particular setting did get a 30% boost if I remember correctly... This isn't a criticism of HS - I think it's really cool that I don't have to bother with this stuff. Let's hear it for automatic optimisation! Also see the docs on GC tuning with various VMs: http://java.sun.com/docs/hotspot/index.html
|
|
|
|
|
NVaidya
Junior Member  
Java games rock!
|
 |
«
Reply #18 - Posted
2004-01-22 16:49:57 » |
|
ChrisRijk wrote: The -XX:CompileThreshold, -XX:MaxInlineSize and -XX:FreqInlineSize are interesting since they're about the only flags that affect code optimisation, apart from -client and -server. OK ! The official description for these flags appears to be rather cryptic for my intelligence  . Anyone know more to elaborate.... TIA
|
Gravity Sucks !
|
|
|
Jeff
|
 |
«
Reply #19 - Posted
2004-01-23 08:15:50 » |
|
Compile Threshold is the number of times the interpreter has to pass over the same code befoer it decides its worth compiling.
One of the key differences between clietn and sevrer VM is that the CompileThreshold default is much lower for client thus reducing apparent start-up slowness of GUIs.
MaxInlineSize refers to the size of code produced by linlining. Im not 100% sure if this is the maximum size of a rotuine to be inliend or the maximum size code is allowed to grwo to by inlining. Chris may know.
This is important because inlining to the point of over-flowing your instruction cache on a modern CPu is a de-optimization. Again, Hotspot does its best to set this to a good number based on what it can know/guess about your system by probing it but ist always possible that you know more about a particualr unusual system then it does.
FreqInlineSize I don't know, I'll try to peek at the docs and see if they mean anything to me.
|
|
|
|
ChrisRijk
Senior Newbie 
Optimise or Die
|
 |
«
Reply #20 - Posted
2004-01-23 12:37:02 » |
|
I think MaxInlineSize is the cutoff for total number of bytecodes to inline (ie includes sub-functions too)
The FreqInlineSize explanation doesn't make much sense, but I guess it means that HS will never inline a function which is bigger than this value (in bytecodes).
|
|
|
|
|
NVaidya
Junior Member  
Java games rock!
|
 |
«
Reply #21 - Posted
2004-01-23 13:58:12 » |
|
Much appreciate the info folks.
Is the -Xcomp option equivalent to setting the CompileThreshold size to 0 ?
If I want to use the server for the SSE and SSE2 boosts (the client doesn't have them right ?), and at the same time want my GUI to come to life as eagerly as a client, then setting the threshold size to something like that of the client would get me the best of both worlds ?
How do I figure out the bytecode size to specify as input to MaxInlineSize (gulp...especially if it has to include the sub-functions sizes too !!!) ? And, thanks really for the caveat about going overboard in setting this size.
And something I tripped on in an IBM site: Quote: ...specifying -Xcomp will result in somewhat less efficient machine instructions being generated with the JIT, since the interpreter is preempted from running. Unquote. If such is the case, would running (micro)benchmarks with that option be in the interests of projecting Java favorably ?
TIA
|
Gravity Sucks !
|
|
|
swpalmer
|
 |
«
Reply #22 - Posted
2004-01-23 18:37:01 » |
|
And something I tripped on in an IBM site: Quote: ...specifying -Xcomp will result in somewhat less efficient machine instructions being generated with the JIT, since the interpreter is preempted from running. Unquote. If such is the case, would running (micro)benchmarks with that option be in the interests of projecting Java favorably ?
TIA Ah of course, the VM will have no runtime profiling info, it won't be able to make good guesses as to what branches are more or less likely to betaken for instance. (I think some processors have a branch instruction that hints if it is likely to be taken or not to help the processor pre-fetch optimally) I guess it is best to use warmup periods when benchmarking.
|
|
|
|
Jeff
|
 |
«
Reply #23 - Posted
2004-01-23 20:04:34 » |
|
Much appreciate the info folks.
Is the -Xcomp option equivalent to setting the CompileThreshold size to 0 ?
AIUI yes this is exactly what it does. If I want to use the server for the SSE and SSE2 boosts (the client doesn't have them right ?), and at the same time want my GUI to come to life as eagerly as a client, then setting the threshold size to something like that of the client would get me the best of both worlds ?
It would get you closer. The issue is that compiling all that code can be a big start-up load. Client handles this by not being as deep an optimizer as server. It leaves the last 10% or so on the table in order to cut the time it takes to compile by quite a bit. So -Xcomp with server is likely to take longer to start up then even -Xcomp with client, but the result will run faster. You shouldnt see GUI "ramp-up" but it will be longer til your GUI comes up at all. How do I figure out the bytecode size to specify as input to MaxInlineSize (gulp...especially if it has to include the sub-functions sizes too !!!) ? And, thanks really for the caveat about going overboard in setting this size.
erm. Well it starts with the size of instruction cache your CPu has. You basically want an entire in-lined loop to fit in the cache or you will end up thrashing the cache as you go around the loop. Beyond that you'ld have to ask a real bit-twiddler, which Im not. I generally trust Hotspot to do it right for me. And something I tripped on in an IBM site: Quote: ...specifying -Xcomp will result in somewhat less efficient machine instructions being generated with the JIT, since the interpreter is preempted from running. Unquote. If such is the case, would running (micro)benchmarks with that option be in the interests of projecting Java favorably ?
Interesting. IBM's VM technology is uniquely theirs. I don't *think* theres much we do with optimization generation and profile info. Rather we take the most aggressive option and then back off during run-time if need be. But for their VM yes this might be an issue. -Xcomp is kind of a shortcut. Its probably always better in terms of accuracy to really warm up the VM unless you intend to -Xcomp your actual program. Ofcourse accuracy of any microbenchmark is generally so suspect anyway that this is sort of gilding the lilly.
|
|
|
|
swpalmer
|
 |
«
Reply #24 - Posted
2004-01-23 22:42:14 » |
|
erm. Well it starts with the size of instruction cache your CPu has. You basically want an entire in-lined loop to fit in the cache or you will end up thrashing the cache as you go around the loop. With cache sizes what they are these days, I imagine you could get away with a fair bit.
|
|
|
|
princec
|
 |
«
Reply #25 - Posted
2004-01-24 10:24:27 » |
|
The best approach is to try a whole bunch of different values, and compare the speed of the resulting code. The bigger the MaxInlineSize, the longer the compilation takes, too - by quite a significant amount. I've found that 16 is a good compromise. In fact I run Eclipse with this configuration: -server -XX:CompileThreshold=1500 -XX:MaxInlineSize=16 -XX:FreqInlineSize=32 -XX:+UseParallelGC -Xms128m -Xmx192m which gives me, after a short while, a very fast IDE that doesn't constantly thrash the swapfile and starts up quickly too. Cas 
|
|
|
|
NVaidya
Junior Member  
Java games rock!
|
 |
«
Reply #26 - Posted
2004-01-24 14:40:29 » |
|
Muchos Gracias again !
OK ! Though I arrived at the options of -server with a CompileThreshold size of 1500 (something which princec already seems to be using while I was taking my hand around my head to reach my nose), I think my logic is possibly somewhere faulty in retrospection.
Let's see the "facts" that I have gleaned - correct me if I'm wrong:
o Server is a "deeper" optimizer than client, i.e., server will take a longer time to compile a block of code than client.
o Hotspot will not compile a block of code unless that block is hot, i.e., it thinks that compiling a non-hot block is worth not the effort and that letting the block run in interpreted mode may actually be faster than trying to spend time to compile it.
o the thresh hold size of client is 1500 and that of server is 10000.
Given the above, on first impulse, I would have actually surmised that the threshold size of server would be *smaller* than that of the client if app. performance is all that matters *after* app. realization. IOW, if I have already made the decision to use the server for what it is, then why keep its default threshold size to be higher than that of the client ? Is it because 10000 units in server mode is not directly equivalent to that many in client mode ?
Again, given the above, and if I want to use the server option, and assuming that the numbers 1500 and 10000 are sacrosanct, I think the threshold size should optimally be a number *greater* than 1500 and less than 10000, with the twin objectives of getting a start-up time *comparable* to that of client (noting again that the server is a deeper and slower optimizer) and at the same time being much more compilation aggressive than the default size of 10000 would permit.
Agreed, the best numbers may have to be determined by trying out some values as princec says, but I just want to make sure that my understanding about the various aspects are correct.
Also, I gather that setting Xms and Xmx values to be the same might be easy on the Hotspot, but I haven't tried to examine the effect on the overall performance.
TIA
|
Gravity Sucks !
|
|
|
Jeff
|
 |
«
Reply #27 - Posted
2004-01-25 00:23:44 » |
|
Pretty much all correct.
The reason server's threshold is higher is because it does go "deeper" and thsu only wants to do that on code where it really matyters. Look at it this way, the deeper optimizing raises the bar as to how many times the interpreter woudl have to run the code to equal the cost of compilation, see?
Server VM was actually the cirst VM. It evidenced a performance problem on GUI apps due to its higher threshold beause GUIS woudl come up and run first interpreted. Clietn was invented to hanbdle this. The Clinet threshold was dreduced so the GUIs woudl be compiled right away. With a lower threshold, it can't afford to go as deep.
See?
|
|
|
|
NVaidya
Junior Member  
Java games rock!
|
 |
«
Reply #28 - Posted
2004-01-26 18:48:00 » |
|
Jeff wrote: Pretty much all correct. The reason server's threshold is higher is because it does go "deeper" and thsu only wants to do that on code where it really matyters. Look at it this way, the deeper optimizing raises the bar as to how many times the interpreter woudl have to run the code to equal the cost of compilation, see?
Alright, I see now how the equation is balanced. Server VM was actually the cirst VM. It evidenced a performance problem on GUI apps due to its higher threshold beause GUIS woudl come up and run first interpreted. Clietn was invented to hanbdle this. The Clinet threshold was dreduced so the GUIs woudl be compiled right away. With a lower threshold, it can't afford to go as deep.
OK ! Thanks again, and I would suggest that the Sun's performance FAQ have more of these technical details, especially the one on "difference between client and server". One of the 3D applications that I'm developing has a particle tracking module that does millions of polymorphic calls and floating point operations. With the server option, the performance is virtually doubled. Wish the SSE boosts were available in the client mode too! Much appreciate all the deep enlightenment 
|
Gravity Sucks !
|
|
|
swpalmer
|
 |
«
Reply #29 - Posted
2004-02-05 03:07:41 » |
|
I just noticed this in the 1.5.0 beta 1 release notes.... -XX:MaxGCPauseMillis=nnn A hint to the virtual machine that pause times of nnn milliseconds or less are desired. The vm will adjust the java heap size and other gc-related parameters in an attempt to keep gc-induced pauses shorter than nnn milliseconds. Note that this may cause the vm to reduce overall throughput, and in some cases the vm will not be able to meet the desired pause time goal.
-XX:GCTimeRatio=nnn The ratio of GC time to application time.
1 / (1 + nnn)
For example -XX:GCTimeRatio=19 sets a goal of 5% of the total time for GC.
from http://java.sun.com/j2se/1.5.0/docs/guide/vm/gc-ergonomics.htmlShould be quite useful for games if they work well. These work with the parallel collector which will adapt generation sizes automatically to try to meet the requested goals. Cool.
|
|
|
|
|