Java-Gaming.org    
Featured games (79)
games approved by the League of Dukes
Games in Showcase (477)
Games in Android Showcase (109)
games submitted by our members
Games in WIP (536)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: [1] 2 3
  ignore  |  Print  
  Programming to the cache  (Read 21712 times)
0 Members and 1 Guest are viewing this topic.
Offline ewjordan

Junior Member





« Posted 2008-06-22 03:04:39 »

I've been looking around for a while on this, and haven't found much info online, so I figured I'd ask here - game programmers tend to be more psychotic than most about speed!

In compiled languages, one of the most important optimizations that can be done is to make sure you program your stuff in a cache friendly manner, especially with processor speed pulling so far ahead of memory access.  So you always try to do things like accessing things sequentially instead of randomly to optimize for that.

In theory, the JVM should be optimizing this type of stuff in Java without us needing to worry about it, which is one of the reasons Java could theoretically be faster than C; but in practice, it's not clear how much this is happening.

So here's the question: does anyone know if there is anything to be gained from cache-friendly methods in Java, and how best to exploit this, or does the work that the JVM does render these techniques irrelevant?  In C++ some people see up to 10x performance increases by appropriately handling memory access, so it could be very worthwhile to figure this out.  In some cases I've been able to exhibit speed gains by interleaving arrays of built-in types in very careful ways, so I know the JVM isn't doing everything that's possible, but does anyone know any best practices that are generally useful?  I'd hate to be optimizing for a particular architecture and JVM...
Offline anarchotron

Junior Member




...precious bodily fluids.


« Reply #1 - Posted 2008-06-22 05:57:11 »

I am very interested in this topic as well.  I'm mainly a C++ programmer these days and what you say is very true.  I've never understood how these cache coherency issues translate to Java.    Is there any kind of documentation out there on this?  Would the jvm spec indicate anything in this regard?  Basically the only strategy that I'm sure works is to keep everything as small as possible and as contiguous as possible (e.g. a single FloatBuffer of floats that is used as an n-dimensional array rather than an actual n-dimensional array.
Offline ewjordan

Junior Member





« Reply #2 - Posted 2008-06-22 11:14:51 »

My impression is that the JVM specs guarantee literally nothing about any sort of contiguous allocation - by my reading, an array of floats could very well be scattered all across the heap.  Apparently even the stack is usually not contiguous in Sun's VMs.

I would think, though, that the JVM would take the fact that you're allocating a bunch of floats in an array as a good hint that it's worth putting them near each other in memory, though.  And I have, indeed, been able to show small improvements in performance by carefully interleaving array data from several arrays into one.  What I'm not sure about is a) whether this is consistent and exploitable in general, and b) what happens with arrays of objects, and whether there is even any point in trying to do something similar - for instance, will the array of references ever be allocated so that it is interleaved with the referents for better locality?

I'm hoping that someone knows how HotSpot actually handles this stuff in practice, since in reality, we're normally programming to that implementation, and some rules of thumb would be great.  And maybe someone knows whether this stuff changes every version or is stable enough to try to work out.

BTW, re: FloatBuffers, are those faster than a single dimensional float[]?  I couldn't exhibit a simple case where that seemed to be the case.  Any chance you've got a snippet that could demonstrate a good use?
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline DzzD
« Reply #3 - Posted 2008-06-22 13:57:00 »

this is logical but in case of, for array you must access them sequentially and "incrementally" (from offset 0 to offset max) and they seems to give better result when working with 64k blocks, also use only one dimensional array as multidimensional work differently than in C (multiple block of memory allocated rather than a single one).

avoid casting when it is not simple.
double => int == very slow
int => double acceptable
avoid casting in a lower resolution and from floating to fixed.

for complexe computation do not work directly on array value, keep them local and put them back :
1  
2  
3  
4  
5  
int x=array[i];
x+=y; => Perform some complex ops
...
...
array[i]=x;

when you treat array value you must have only one read/one write per value in a loop, array are slow as there is a bound check for each read/write op.

when it is possible use block of memory in one time, it seems that working on an array than on another array is faster than working in the same time on both array (especially when there is a lot of array) => this is also true for object, i guess it is related to cache.

note that the above are only from my own experience.

EDIT:

Quote
BTW, re: FloatBuffers, are those faster than a single dimensional float[]?  I couldn't exhibit a simple case where that seemed to be the case.  Any chance you've got a snippet that could demonstrate a good use?
nothing can be faster than a simple float[] array in pure java, FloatBuffers may be faster for exchanging with external/native function or FloatBuffers may have some special function implemented.



Offline princec

JGO Kernel


Medals: 343
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #4 - Posted 2008-06-22 14:47:29 »

See 4820062 Smiley

Cas Smiley

Offline VeaR

Junior Member





« Reply #5 - Posted 2008-06-22 23:13:25 »

See 4820062 Smiley

Cas Smiley

The solution with "@Struct" would be absolutely ugly.

Reading the comments, i get the feeling that people actualy dont want games to be written in Java. Conspiracy of the C++ programmers? The JVM is written by C++ programmers isnt it? persecutioncomplex
Offline bienator

Senior Member




OutOfCoffeeException


« Reply #6 - Posted 2008-06-23 00:09:27 »

The solution with "@Struct" would be absolutely ugly.
I think they simple try to prevent a new keyword in the language. I remember as they introduced 'enum' and let several projects break including Apache projects and even NetBeans. The annotation based solution is probably the only possible way to get it into a 15 year old language like java.

Offline princec

JGO Kernel


Medals: 343
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #7 - Posted 2008-06-23 00:21:01 »

Nah, that's just a lame excuse. After all binary compatibility wouldn't be broken, only source, and how hard is it to do a trivial renaming refactoring on the very rare bits of code with the keyword "struct" in?

Cas Smiley

Offline bienator

Senior Member




OutOfCoffeeException


« Reply #8 - Posted 2008-06-23 01:00:38 »

Nah, that's just a lame excuse. After all binary compatibility wouldn't be broken, only source, and how hard is it to do a trivial renaming refactoring on the very rare bits of code with the keyword "struct" in?
its actually no work at all to refactor it... in case you have the source and you are allowed to recompile it. They don't see it as main feature thats why they prevent to cause any trouble with it.
(but to be honest I personally really don't care if it is called 'struct' or '@Struct')

IMO annotations are a pretty cool workaround to introduce language features, you can even make java "feel" like Erlang if you want.
http://www.youtube.com/watch?v=37NaHRE0Sqw
http://www.malhar.net/sriram/kilim/

but thats completely off topic ;-)

Offline ewjordan

Junior Member





« Reply #9 - Posted 2008-06-23 10:49:50 »

(Edit: posted some timing results, but just realized they might be flawed due to a C++ timing bug, will fix up and repost)
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline jezek2
« Reply #10 - Posted 2008-06-23 11:44:50 »

(Edit: posted some timing results, but just realized they might be flawed due to a C++ timing bug, will fix up and repost)

Can you also provide results with client compiler? (at least I think you tested only server)

Or is server compiler usable on multicore machine without initial low performance and very visible shuttering due to compilation?
Offline princec

JGO Kernel


Medals: 343
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #11 - Posted 2008-06-23 13:14:50 »

Soon to be rendered irrelevant with the combined client+server JVM, so let's not worry about the difference any more Smiley

Cas Smiley

Offline ewjordan

Junior Member





« Reply #12 - Posted 2008-06-23 21:05:48 »

Ok, verified the results again, so here's the post I deleted:

Some initial experimentation shows some interesting things.  I ran a simple 1D particle simulation (increment position based on velocity, apply a tiny bit of damping to velocity) in both Java and C++ comparing the behavior depending on whether I had two separate arrays for velocity and position or one interleaved array.  I can post the code if anyone cares (and yup, I am very aware of the pitfalls when writing a Java microbenchmark!), but for now I won't clutter things up.  Here are the results, in FPS, higher is better:

(20000 particles over 10000 frames of simulation, 40 tests run and averaged)

C++ Results:
Normal floats: 5950.77
Interleaved floats: 30524.33

Java Results (Apple JVM 1.6, -server - I'll check this on a Sun VM pretty soon, too):
Normal floats:22105.749484383392,
Interleaved floats: 22242.982339072023, 

So by a MASSIVE amount, C++ with separate arrays performs the worst, and we see a 5:1 improvement there by going to interleaved arrays.  But in Java, there's essentially no difference, at least at this particle count - trust me, the sub percentage point difference you see there is essentially just noise.

The real surprising thing (to me, at least), is that at least Apple's JVM actually does appear to be making good on the claim that it can achieve better cache locality than AOT compiled languages, as the non cache-friendly Java code performs four times as well as the corresponding C++.

I wondered whether the performance difference between Java and C++ could be accounted for by the bounds check, so I inserted a basic bounds check into the C++ (just breaks out if bounds check fails, which is probably not quite as complex or expensive as Java's):

C++ (with bounds check):
Normal floats: 5956.94
Interleaved floats: 25678.14

No difference at all for the regular float version (two arrays), but brings the interleaved version within spitting distance of the Java one.

Somewhat relevant to Cas' suggestion, I also tried using an array of Particle objects in Java (just holders for x and vx, in this case):

Java using Particle class: 17139.75

Using an array of objects is still better than the worst C++, but still a ways behind optimized Java; and we haven't even touched the issue of creating objects on the fly or garbage collecting them, either, which are the real slowdowns that I tend to see in my code.

So there's no real conclusion here, except that short of getting rid of those bounds checks, pure float arrays in Java are handled about as well as can be expected, and the JVM actually is doing quite a bit of work to keep the cache miss rate low.

This is now totally off topic, but I ran into one thing that I should warn you about - the initial version of the particle update code I wrote read something like this:

1  
2  
3  
4  
for (int i=0; i<particleList.length; ++i) {
   particleList[i].x += .01f * particleList[i].vx;
   particleList[i].vx *= 0.99f;
}

...and I was surprised to find that my timing information was heavily dependent upon the number of frames that I was simulating.  But FPS should not depend on the number of frames you check, as long as it's large, and I was surprised to be hitting a cutoff around 9000 where to push any higher meant an 80% drop in frame rate! (interestingly enough, the interleaved method weathered this drop far better than the others in Java)

The cause: it turns out that numerical underflow will freaking MURDER your program's performance in either C++ or Java - that vx *= 0.99f line was bringing all the velocities down to the level of underflow after about 9000 frames, and from that point on it's a major drag on performance.  In this case I wasn't interested in working around that issue, so I just changed the decay factor to 0.999f (putting off the inevitable a little further), but be aware of this in your own code, if you have a large number of variables that are used a lot and could exponentially decay to zero, it might be worth checking them against a threshold.

Things I've learned:

- The overall conclusion: the JVM does exactly what it promises to, at least with floating point arrays, optimizing them in almost every case to the point that they are similar in performance to well-used C++ arrays.
- I could not find a SINGLE way to affect the timing via a local (non-algorithmic) optimization - storing the loop limit away, changing to a while loop, replacing with a try/catch and a while(true), etc. - every one of these things performed EXACTLY the same
- final keyword makes no difference at all to performance anymore, anywhere, at any time.  Only use it when it "should" be used, not because you think it will make things faster.
- Scope does not affect speed within a class (passing an array as a parameter vs. storing it as a member doesn't make a difference)
- Visibility of array does not affect speed
- Using getters and setters to marshal access to private object members does not degrade performance
- for (Particle p:particles) loop performed exactly the same
- Any optimizations that are happening for fast array access do not seem to happen at garbage collection time - performance is identical before a single GC happens
- Memory layout management is probably NOT responsible for the fast access - disrupting the access patterns does not seem to make any difference, so the JVM must be doing some intelligent prefetching

@jezek2: on the Mac, the -server mode in 1.6 doesn't do very much, I think it just changes the way the heap sizing operates, which is irrelevant for these tests.  I tried in 1.5, though, and got the following:

Apple 1.5 without -server:
Normal FPS:      7530.917239361543
Interleaved FPS:   7536.807696165959
Object FPS:      6943.34652777474

Apple 1.5 with -server:
Normal FPS:      22029.56632155148
Interleaved FPS:   21603.01266973487
Object FPS:      18347.889937613505
Local FPS:      13744.841045723999

So at least on OS X, 1.5 should DEFINITELY be used with the -server flag.  I included that "Local FPS" entry in the last one because that shows that at least Apple's 1.5 -server has a speed problem passing arrays as local variables, I'm not sure why - I omitted that from all the other tests because it never differed.

I know that most people aren't using Apple's VM, so I'm going to boot into Windows right now and see how things work there...I'll post some more results in half an hour or so.

And btw, I've never had problems with stuttering or anything like that due to the server flag, I tend to find that compilations pretty much cease after the first several seconds of gameplay, which you can keep an eye on with the -XX:-PrintCompilation flag - have you had trouble with this in the past?
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 757
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #13 - Posted 2008-06-23 21:46:53 »

Let me start by saying I ready your whole post.


Quote
for (Particle p:particles) loop performed exactly the same
That enhanced-for-loop creates an Iterator, the overhead can be significant in nested loops.


Further, I agree with most of your findings.
I recently began to change my coding where performance really mattered:
With out-of-order CPUs, you can actually do quite a bit while the CPU is waiting for the cache, or worse: for system RAM.

Let's look at this loop:
1  
2  
3  
4  
long[] data = new long[1024];
long xor = 0;
for(int i=0; i<data.length; i++)
   xor ^= data[i];

It's obvious that this loop basicly waits a lot of the time. You can do quite a bit meanwhile, in the same thread.

it's actually so bad, that you can do this, in exactly the same time:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
long[] data = new long[1024];
long xor = 0;
for(int i=0; i<data.length; i++)
   xor ^= swap64(data[i]);

   public static final long swap64(long val)
   {
      long swapped = 0;
      swapped |= (val & 0x00000000000000FFL) << 56;
      swapped |= (val & 0x000000000000FF00L) << 40;
      swapped |= (val & 0x0000000000FF0000L) << 24;
      swapped |= (val & 0x00000000FF000000L) << 8;
      swapped |= (val & 0x000000FF00000000L) >> 8;
      swapped |= (val & 0x0000FF0000000000L) >> 24;
      swapped |= (val & 0x00FF000000000000L) >> 40;
      swapped |= (val & 0xFF00000000000000L) >>> 56;
      return swapped;
   }


We're adding maybe 32 instructions per loop-iteration, and we have no speed-penalty at all.



So with this knowledge, I began to interleave my algorithms (very messy) which yielded great (!) performance improvements.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline DzzD
« Reply #14 - Posted 2008-06-23 21:57:03 »

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
47  
48  
49  
50  
51  
52  
53  
54  
55  
56  
57  
58  
59  
60  
61  
62  
63  
64  
65  
66  
67  
68  
69  
70  
71  
72  
73  
74  
75  
/**
 * @(#)Bench.java
 *
 * Simple bench showing difference between acces on Array
 *
 * @version 1.00 08/06/23
 */

 
import java.awt.*;
import java.applet.*;

public class Bench extends Applet {
   
   private long getTime()
   {
      return System.currentTimeMillis();
   }
   
   
   public void init()
   {
      this.testMultiArray();
      this.testSingleArray();
      //Another try
     System.out.println("Another try");
      this.testSingleArray();
      this.testMultiArray();

   }

   private void testSingleArray()
   {
      System.out.println("Single array test");  
      float a1[]=new float[1024*20*1024];
      int nb=0;
      long s=getTime();
      for(int y=0;y<1;y++)
      {
         for(int x=0;x<1024*1024*20;x++)
         {
            a1[x]+=1f;
            nb++;
         }        
      }
      long e=getTime();
      long time=(e-s);
      System.out.println(time+"ms for "+nb+" loops");      
     
   }
     
   private void testMultiArray()
   {
      System.out.println("Multi array test");  
      float a1[][]=new float[1024*20][1024];
      int nb=0;
      long s=getTime();
      for(int y=0;y<1024;y++)
      {
         for(int x=0;x<1024*20;x++)
         {
            a1[x][y]+=1f;
            nb++;
         }        
      }
      long e=getTime();
      long time=(e-s);
      System.out.println(time+"ms for "+nb+" loops");      
     
   }

   public void paint(Graphics g)
   {
   
   }
}


Results:

Multi array test
3859ms for 20971520 loops
Single array test
218ms for 20971520 loops
Another try
Single array test
234ms for 20971520 loops
Multi array test
3875ms for 20971520 loops

Offline ewjordan

Junior Member





« Reply #15 - Posted 2008-06-23 22:33:57 »

Well, that will teach me to make the assumption that Apple's VM has anything in common with Sun's!  Here are my results using 6u10 in XP on the same machine:

(still 20000 particles, now over 1000 frames so I don't drive myself nuts waiting, so bear in mind these results are slightly more variable)
With -server:
Normal FPS:      14993.227896306866
Interleaved FPS:   20555.42182483608
Object FPS:      14358.932379346927
Local FPS (interleaved):      20807.751451388936

Without -server:
Normal FPS:      8315.078683988102
Interleaved FPS:   7129.550702202638
Object FPS:      11372.472596502801
Local FPS (interleaved):      7685.328248452919

So with Sun's VM, -server is DEFINITELY worth it.  Interestingly enough, the performance characteristics are almost exactly inverted between the client and server VMs.  Here there does appear to be a difference between using a single interleaved array and two separate ones.

Also, it appears that Apple's VM is a bit faster at this than Sun's as of 1.6, delivering more even performance overall.

Since I just noticed that this forum scrolls code posted rather than inlining it, I might as well post mine in case anyone has any beef with it:

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
47  
48  
49  
50  
51  
52  
53  
54  
55  
56  
57  
58  
59  
60  
61  
62  
63  
64  
65  
66  
67  
68  
69  
70  
71  
72  
73  
74  
75  
76  
77  
78  
79  
80  
81  
82  
83  
84  
85  
86  
87  
88  
89  
90  
91  
92  
93  
94  
95  
96  
97  
98  
99  
100  
101  
102  
103  
104  
105  
106  
107  
108  
109  
110  
111  
112  
113  
114  
115  
116  
117  
118  
119  
120  
121  
122  
123  
124  
125  
126  
127  
128  
129  
130  
131  
132  
133  
134  
135  
136  
137  
138  
139  
140  
141  
142  
143  
144  
145  
146  
147  
148  
149  
150  
151  
152  
153  
154  
155  
156  
157  
158  
159  
160  
161  
162  
163  
164  
165  
166  
167  
168  
169  
170  
171  
172  
173  
174  
175  
176  
177  
178  
179  
180  
181  
182  
183  
184  
185  
186  
187  
188  
189  
190  
191  
192  
193  
194  
195  
196  
197  
public class CacheBenchmark {
   private float[] floats;   //interleaved
  private float[] floatsx;  //normal position
  private float[] floatsvx; //normal velocity
  private Particle[] particleList; //particle holder
 
   private boolean firstTime = true;
   
   private static String dumpString = "";

   public static void main(String[] args) {
      CacheBenchmark myBench = new CacheBenchmark();
     
      int warmupiters = 100;
      int iters = 100;
      int frames = 1000;
      int particles = 20000;
     
      double interleavedDiff = 0.0f;
      double normalDiff = 0.0f;
      double objectDiff = 0.0f;
      double localDiff = 0.0f;
       
      long nanos = System.nanoTime();
        long diff = 0;
        System.out.println("Warming up...");
        for (int i=0; i<iters+warmupiters; i++){
           System.gc();
           
           myBench.setupTest(particles);
           
           nanos = System.nanoTime();
           for (int j=0; j<frames; ++j) {
                 myBench.stepFloatsInterleaved();
           }
           diff = System.nanoTime() - nanos;
            myBench.dumpResults();
            double fps = frames / ( diff / ((double)1000000000) );
            System.out.println("Interleaved:\t "+fps+",  ");
            if (i > warmupiters) interleavedDiff += diff;
           
            /////////////////////////
           nanos = System.nanoTime();
           for (int j=0; j<frames; ++j) {
                 myBench.stepFloatsNormal();
           }
            diff = System.nanoTime() - nanos;
            myBench.dumpResults();
            fps = frames / ( diff / ((double)1000000000) );
            System.out.println("Normal:\t\t "+fps+",  ");
            if (i > warmupiters) normalDiff += diff;

            /////////////////////////
         
            nanos = System.nanoTime();
           for (int j=0; j<frames; ++j) {
                 myBench.stepParticles();
           }
            diff = System.nanoTime() - nanos;
            myBench.dumpResults();

            fps = frames / ( diff / ((double)1000000000) );
            System.out.println("Particles:\t "+fps+",  ");
            if (i > warmupiters) objectDiff += diff;
           
            /////////////////////////
           
            float[] localFloats = new float[2*particles];
            for (int l=0; l<particles; ++l) {
             float x = l*l*.01f;
             float vx = l*.1f;
             localFloats[2*l] = vx;
             localFloats[2*l+1] = x;
          }
           
            nanos = System.nanoTime();
           for (int j=0; j<frames; ++j) {
                 myBench.stepParameterInterleaved(localFloats);
           }
            diff = System.nanoTime() - nanos;
            myBench.dumpResults();

            fps = frames / ( diff / ((double)1000000000) );
            System.out.println("Local int.:\t "+fps+",  ");
            if (i > warmupiters) localDiff += diff;
           
           

        }
       
        normalDiff /= iters;
        interleavedDiff /= iters;
        objectDiff /= iters;
        localDiff /= iters;
       
        double normalFPS = frames / ( normalDiff / ((double)1000000000));
        double interleavedFPS = frames / ( interleavedDiff / ((double)1000000000));
        double objectFPS = frames / ( objectDiff / ((double)1000000000));
        double localFPS = frames / ( localDiff / ((double)1000000000));
        System.out.println("Normal time elapsed:\t\t"+normalDiff);
        System.out.println("Interleaved time elapsed:\t"+interleavedDiff);
        System.out.println("Object time elapsed:\t\t"+objectDiff);
        System.out.println("Local time elapsed:\t\t"+localDiff);
        System.out.println("Normal FPS:\t\t"+normalFPS);
        System.out.println("Interleaved FPS:\t"+interleavedFPS);
        System.out.println("Object FPS:\t\t"+objectFPS);
        System.out.println("Local FPS:\t\t"+localFPS);
       
       
        System.out.println("\n\nDumping some crap, please ignore:");
        printDump();
   }
   
   public void setupTest(int particles) {
      if (firstTime) {
         floats = new float[2*particles];
         floatsx = new float[particles];
         floatsvx = new float[particles];
         particleList = new Particle[particles];
      }
     
      for (int i=0; i<particles; ++i) {
         float x = i*i*.01f;
         float vx = i*.1f;
         floats[2*i+1] = vx;
         floats[2*i] = x;
         floatsx[i] = x;
         floatsvx[i] = vx;
         if (firstTime) particleList[i] = new Particle();
         particleList[i].setX(x);
         particleList[i].setVx(vx);
      }
     
      firstTime = false;
   }
   
   private void stepFloatsInterleaved() {
      for (int i=0; i<floats.length/2; ++i) {
         int x = 2*i;
         int vx = 2*i+1;
         floats[x] = floats[x] + .01f * floats[vx];
         floats[vx] *= 0.999f;
      }
   }
   
   private void stepFloatsNormal() {
      for (int i=0; i<floatsx.length; ++i) {
         floatsx[i] = floatsx[i] + .01f * floatsvx[i];
         floatsvx[i] *= 0.999f;
      }
   }
   
   private void stepParticles() {
      for (int i=0; i<particleList.length; ++i) {
         particleList[i].setX(particleList[i].getX() + .01f * particleList[i].getVx());
         particleList[i].setVx(particleList[i].getVx() * 0.999f);
      }
   }
   
   private void stepParameterInterleaved(float[] floatArray) {
      for (int i=0; i<floatArray.length/2; ++i) {
         int x = 2*i;
         int vx = 2*i+1;
         floatArray[x] = floatArray[x] + .01f * floatArray[vx];
         floatArray[vx] *= 0.999f;
      }
   }
   
   
   public void dumpResults() {
      for (int i=0; i<floatsx.length; ++i) {
         if (Math.random() < 0.0001) dumpString += floats[2*i]+floats[2*i+1]+floatsx[i]+floatsvx[i]+particleList[i].getX()+particleList[i].getVx();
      }
   }
   
   public static void printDump() {
      System.out.println(dumpString);
   }

}

class Particle{
   private float x;
   private float vx;
   public void setX(float x) {
      this.x = x;
   }
   public float getX() {
      return x;
   }
   public void setVx(float vx) {
      this.vx = vx;
   }
   public float getVx() {
      return vx;
   }
}


@DzzD: There's one issue with that benchmark, which brings up a very interesting point (maybe that was the thing you were demonstrating?) - you're running through the multi-array loop in the "wrong" order; if you switch the inner and outer loops in the testMultiArray function, I get the following results:

Multi array test
62ms for 20971520 loops
Single array test
47ms for 20971520 loops
Another try
Single array test
47ms for 20971520 loops
Multi array test
47ms for 20971520 loops

Of course, this difference itself is a cache effect, and shows that the JVM is not handling multidimensional array optimizations very well, nor is it reordering accesses when possible.  Try the following altered code and play with xWidth and yWidth to alter the dimensions of the array and see the difference between accessing it the right way and the wrong way, sometimes it is extremely dramatic:

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
47  
48  
49  
50  
51  
52  
53  
54  
55  
56  
57  
58  
59  
60  
61  
62  
63  
64  
65  
66  
67  
68  
69  
70  
71  
72  
73  
74  
75  
76  
77  
78  
79  
80  
81  
/**
 * @(#)Bench.java
 *
 * Simple bench showing difference between access on Array
 *
 * @version 2.00 08/06/23
 */

 
import java.awt.*;
import java.applet.*;

public class Bench extends Applet {
   
   private final int yWidth = 20;
   private final int xWidth = 500*1024;
   
   private long getTime()
   {
      return System.currentTimeMillis();
   }
   
   
   public void init()
   {
      this.testMultiArray();
      this.testMultiArray2();
      //Another try
     System.out.println("Another try");
      this.testMultiArray2();
      this.testMultiArray();

   }
     
   private void testMultiArray()
   {
      System.out.println("Badly traversed multi array test");  
      float a1[][]=new float[xWidth][yWidth];
      int nb=0;
      long s=getTime();
      for(int y=0;y<yWidth;y++)
      //for (int x=0; x<1024*20;++x)
     {
         for(int x=0;x<xWidth;x++)
         //for(int y=0;y<1024;++y)
        {
            a1[x][y]+=1f;
            nb++;
         }        
      }
      long e=getTime();
      long time=(e-s);
      System.out.println(time+"ms for "+nb+" loops");      
     
   }
   
   private void testMultiArray2()
   {
      System.out.println("Well traversed multi array test");  
      float a1[][]=new float[xWidth][yWidth];
      int nb=0;
      long s=getTime();
      //for(int y=0;y<2;y++)
     for (int x=0; x<xWidth;++x)
      {
         //for(int x=0;x<1024*20;x++)
        for(int y=0;y<yWidth;++y)
         {
            a1[x][y]+=1f;
            nb++;
         }        
      }
      long e=getTime();
      long time=(e-s);
      System.out.println(time+"ms for "+nb+" loops");      
     
   }

   public void paint(Graphics g)
   {
   
   }


@Riven: that's extremely interesting, I'll have to play around with that type of effect.  Do you have any advice as to when you can generally get away with stuff like that?
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 757
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #16 - Posted 2008-06-24 00:10:25 »

@Riven: that's extremely interesting, I'll have to play around with that type of effect.  Do you have any advice as to when you can generally get away with stuff like that?

You can only do calculations that can operate fully on the registers, 'while' waiting for memory I/O.

IIRC, a memory lookup can take up to 40 cycles (depending on memory-speed and CPU architechture) when it's not in cache. Even cache is slow compared to registers. This is the sole reason why programming in ASM can give such great performance: manually fiddling with registers in a way that compilers simply can't even get anywhere near. In Java we can only put as much as possible (primitives, not arrays, which involve memory I/O) in local variables and hope for the best.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 757
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #17 - Posted 2008-06-24 00:14:48 »

I read your sourcecode of your benchmark, and it's rather poor, unfortunately.

Quote
for (int i=0; i<floatArray.length/2; ++i)

Have you got any idea how long a division takes? It might be like 25-50% of all time spend in that loop.

change it to:
Quote
int end = floatArray.length/2;
for (int i=0; i<end; ++i)


also x*2 is much slower than x<<1




Further, changing the order of the tests in the benchmark will probably have a significant effect on the outcome.


Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline DzzD
« Reply #18 - Posted 2008-06-24 00:57:18 »

I understand the | ^ & part related to ASM but I have trouble to understand where the hang happend ? I mean for the value to be passed to the function it must be first read than the function is called , no ?

anyway it seems very interrestng, can you provide a different sample please for a better understanding ?

Offline ewjordan

Junior Member





« Reply #19 - Posted 2008-06-24 06:03:07 »

I read your sourcecode of your benchmark, and it's rather poor, unfortunately.
Ouch.  But I took another look, and that's actually a fair assessment, though only for one of the reasons you posted - the ordering criticism is valid and I definitely should have known better than to ignore it (which does effectively render the whole test moot, so you're ultimately correct).  But for the sake of nitpicking, I have to disagree with the other stuff. Wink

Quote
Have you got any idea how long a division takes? It might be like 25-50% of all time spend in that loop.
(re: for (int i=0; i<floatArray.length/2; ++i) )

This division, expensive as it might be, only happens once, not every time through the loop, since the JVM can figure out that it's constant.  It might even be converted to a shift.  This is actually the first thing I tested (I trimmed all that code out for simplicity's sake), since these types of optimizations are often the lowest hanging fruit in C.  Fortunately (or not, if you're looking for places to optimize!) inefficiencies in loop handling code seem to be completely optimized away by the JVM, as I mentioned before.  You can even do all sorts of nasty stuff like calling simple functions like getters to get the upper range for the loop, and for the most part it adds no overhead as long as the limit is actually constant (there's probably some trickery you could pull to make the JVM worry that it won't be, in which case you would be right, but this is not one of those cases).

I tried anyways, though - replacing my original code with yours resulted in no significant difference (I removed all but the interleaved test and ran it separate times to hopefully avoid ordering issues, and got roughly the same results on each run):

Original code:   20565.894852276786 fps
After replacements:   20468.498175720106 fps

Quote
also x*2 is much slower than x<<1
Again, not on today's VMs, especially if the 2 is written as a compile time constant.  This is one of the easiest optimizations for a compiler to make, so I would be shocked if the JVM wasn't doing it.  If I'm correct it's even done in most C/C++ compilers these days, though I haven't checked that. 

(BTW, I'm only making a claim about desktop Java on a recent JIT - on embedded, all this stuff is vital to getting passable performance, and it's probably increasingly important as you move back from 1.6 even on the desktop)

I threw together a quick benchmark to verify all this.  I think I worked around the ordering issue by picking from the two variants (which should be identical apart from your suggested changes, and perform an array copy with a transformation and even/odd flop so as to try and confound any unexpected cleverness from the JVM) randomly, but let me know if you think there's still likely a bias here:

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
47  
48  
49  
50  
51  
52  
53  
54  
55  
56  
57  
58  
59  
60  
61  
62  
63  
64  
65  
66  
67  
68  
69  
70  
71  
72  
73  
74  
75  
76  
77  
78  
79  
80  
81  
82  
83  
84  
85  
86  
87  
88  
89  
90  
91  
92  
93  
94  
95  
96  
97  
98  
99  
100  
101  
102  
103  
104  
105  
106  
107  
108  
109  
110  
111  
112  
113  
114  
115  
116  
117  
118  
119  
120  
121  
122  
123  
124  
import java.util.ArrayList;

public class LoopBench {
   private final int length = 100000;
   private int[] source;
   private int[] dest;
   
   // To store away results so that no dead-code can be eliminated
  private ArrayList<Integer> holder = new ArrayList<Integer>();
   
   public static void main(String[] args) {
      LoopBench myBench = new LoopBench();
     
      int warmupiters = 1000;
      int iters = 10000;
     
      ArrayList<Long> unoptimizedTimes = new ArrayList<Long>();
      ArrayList<Long> optimizedTimes = new ArrayList<Long>();
     
      long nanos = System.nanoTime();
        long diff = 0;
       
        for (int i=0; i<iters+warmupiters; i++){
           myBench.initialize();
           
           boolean doOptimized = false;
           if (Math.random() >= 0.5) doOptimized = true;
           
           if (doOptimized) {
              nanos = System.nanoTime();
              myBench.loopOptimized();
              diff = System.nanoTime() - nanos;
           } else {
              nanos = System.nanoTime();
              myBench.loopUnoptimized();
              diff = System.nanoTime() - nanos;
           }
           myBench.dump();
           if (i < warmupiters) {
              System.out.println("Warmup iteration "+i);
              continue;
           }
           
           if(doOptimized) {
              optimizedTimes.add(diff);
              System.out.println("Optimized:\t"+diff);
           } else {
              unoptimizedTimes.add(diff);
              System.out.println("Unoptimized:\t"+diff);
           }
        }
       
        myBench.dumpHolder();
     
        double optSum = 0.0;
        for (Long l:optimizedTimes) {
           optSum += l.doubleValue();
        }
        optSum /= optimizedTimes.size();
       
        double unoptSum = 0.0;
        for (Long l:unoptimizedTimes) {
           unoptSum += l.doubleValue();
        }
        unoptSum /= unoptimizedTimes.size();
        System.out.println("Optimized average: "+optSum + " over "+optimizedTimes.size()+" tests");
        System.out.println("Unoptimized average: "+unoptSum + " over "+unoptimizedTimes.size()+" tests");
       
   }
   
   private void initialize() {
      source = new int[2*length];
      dest = new int[2*length];
      for (int i=0; i<source.length; ++i) {
         source[i] = (int)(1000*Math.random());
      }
   }
   
   // Use all the data, somehow
  private void dump() {
      int x = 0;
      for (int i=0; i<source.length; ++i) {
         x += source[i] - dest[i];
      }
      holder.add(x);
   }
   
   // Dump the useless data to the console
  private void dumpHolder() {
      System.out.println("Useless data:");
      long sum = 0;
      for (Integer i:holder) {
         sum += i.intValue();
      }
      System.out.println(sum);
      System.out.println("End useless data.");
   }
   
   private void loopUnoptimized() {
      for (int i=0; i<getLength()/2; ++i ) {
         dest[2*i] = someFunctionUnoptimized(source[2*i+1]);
         dest[2*i+1] = someFunctionUnoptimized(source[2*i]);        
      }
   }
   
   private void loopOptimized() {
      int limit = source.length >> 1;
      for (int i=0; i<limit; ++i ) {
         dest[(i<<1)] = someFunctionOptimized(source[(i<<1)+1]);
         dest[(i<<1) + 1] = someFunctionOptimized(source[(i<<1)]);        
      }  
   }
   private int getLength() {
      return source.length;
   }
   
   private int someFunctionUnoptimized(int x) {
      return (2*x + 1);
   }
   
   private int someFunctionOptimized(int x) {
      return ((x<<1) + 1);
   }
}


My results (time taken, not fps):
Optimized average: 338101.0054556476 over 4949 tests
Unoptimized average: 339458.2603444862 over 5051 tests

The difference between the results differs from run to run and depending on the size of the array, sometimes putting the optimized version ahead, sometimes the unoptimized one; in any case, I'm not convinced by these results that there's anything going on here.

Feel free to prove me wrong, though, I'm just going off of what I've noticed myself, and you definitely know a lot more about the internals than I do.

Quote
Further, changing the order of the tests in the benchmark will probably have a significant effect on the outcome.
There you're 100% right, and I'm a damn fool since I know that's always an issue.  I thought I had checked for this but I just reordered things manually and the results come out different on Windows (haven't checked OS X).  My bad.

The earlier conclusion (from the OS X tests I ran, not the Windows ones) is actually amplified by this mistake, so I'll restate it: there seems to be no predictable and exploitable difference between performance of a single interleaved array and two separate arrays, as the performance depends on a lot of other factors that you will likely not have much control over in a live app, and no matter what, as long as you use the server VM, performance beats cache-unfriendly C++ code, so the JVM is doing a lot of lifting even for simple loops.

Which brings me to the next obvious question: is there any logic to how test ordering affects timing results?  Since this seems to affect things up to 50% or more, it would be fantastic to have a better idea how to avoid those hits against speed whenever possible, but it initially seems to be totally random - surely there's some method to the madness?
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 757
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #20 - Posted 2008-06-24 07:51:16 »

Quote
Quote
Quote
also x*2 is much slower than x<<1
Again, not on today's VMs, especially if the 2 is written as a compile time constant.  This is one of the easiest optimizations for a compiler to make, so I would be shocked if the JVM wasn't doing it.  If I'm correct it's even done in most C/C++ compilers these days, though I haven't checked that.


No no no no! (no. Smiley) Indeed C compilers have done this forever, but the JVM never has.
If you don't see a difference, that means your bottleneck (like memory I/O) is somewhere else.

You might have read my thread about MappedObjects I created yesterday. I added code to handle the case where the stride is a power-of-two and replace a single *16 by a <<4 (or *32 by <<5 so that matter), and saw the performance double. That was because the bottleneck was actually there. As you saw, in another situation, I could squeeze in a 'massive' swap64 method, without a speed-penalty. So if you're not seeing your optimisation paying off, that doesn't mean your code has become faster, just that the CPU is waiting. Once you remove the other bottleneck, you'll notice that your earlier optimisation did speedup your code.

The VM is really not that smart, although it's trivial to replace a constant multiplkication by a bitshift.



I'll do some actual tests with your benchmark (yesterday I didn't have the time). There are so much more tricks to make that code faster (that's why I said it was rather poor). Your main slowdown is in the loop-overhead. AFAIK the JVM does some loop unrolling, but doing it yourself (copy the body 2-4 times inside the loop) is much faster.

Again, it's hard to do micro-benchmarking, not only because the JVM rewrites most of your code, but also because your might think the bottleneck is somewhere else, or are even unaware of certain types of bottlenecks.


Last, your have that giant method that runs all the benchmarks. The JIT stops moptimising methods that are longer than a certain amount of instructions. That 'magic amount' is rather low, and your method will certainly not be turned into the most efficient native code. You can take a look at the output of the Debug VMs (release candidates with debug commandline options, IIRC) from Sun, which will tell you when the JIT gives up.


I'll posts some tests later today.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline DzzD
« Reply #21 - Posted 2008-06-24 10:53:02 »

arf, juste notice that most of your code is unoptimized, just an exemple below:

1  
2  
3  
4  
5  
6  
7  
private void loopOptimized() {
      int limit = source.length >> 1;
      for (int i=0; i<limit; ++i ) {
         dest[(i<<1)] = someFunctionOptimized(source[(i<<1)+1]);
         dest[(i<<1) + 1] = someFunctionOptimized(source[(i<<1)]);        
      }  
   }



for exemple can be replaced by
1  
2  
3  
4  
5  
6  
7  
8  
9  
private void loopOptimized() {
      int limit = source.length >> 1;
      for (int i=0; i<limit;i+=2 ) //I eard that i++ is not optimized (not translated in ASM inc)
                              {
                                                int i1=i+1;
         dest[i] = someFunctionOptimized(source[i1]);
         dest[i1] = someFunctionOptimized(source[i]);        
      }  
   }


it is always good to factorize repetitive pattern.

or


1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
for (int i=0; i<particles; ++i) 
{
 float x = i*i*.01f;
 float vx = i*.1f;
 floats[2*i+1] = vx;
 floats[2*i] = x;
 floatsx[i] = x;
 floatsvx[i] = vx;
 if (firstTime) particleList[i] = new Particle();
 particleList[i].setX(x);
 particleList[i].setVx(vx);
}


replaced by:

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
Particle p;
for (int i=0; i<particles; i++)
 particleList[i]=new Particle(); //Use pool object...

for (int i=0; i<particles; i++)
{
 int i2=i<<1;
 float x = i*i*0.01f;
 float vx = i*0.1f;
 floats[i2+1] = vx;
 floats[i2] = x;
 floatsx[i] = x;
 floatsvx[i] = vx;
 p=particleList[i];
 p.setX(x);
 p.setVx(vx);
}






Offline ewjordan

Junior Member





« Reply #22 - Posted 2008-06-24 12:17:06 »

Riven - Yes yes yes yes yes.  (No, actually, I'm just kidding, I know enough to know that I'm probably wrong here.  Smiley )

That all makes sense (er, well, let me rephrase: let's just say that your explanation makes sense - the behavior itself is %^$&@ maddening, that simple local optimizations like that are not somewhere in the compilation process when they would seem so straightforward to implement at literally any point along the path from .java text to native code, and I'm further surprised to see firsthand that manual unrolling of trivial loops actually helps things in any significant manner, but you're right, it absolutely does!  End rant.).  I suppose I'm still too often confusing the "should be" with the "is" when it comes to the JVM (a habit I must re-disabuse myself of), and unfortunately, as you mentioned, it's tricky to accurately measure these things, so a lot of times there's confirmation bias at work.

I wrote a longer response before with some more questions, but just had a computer crash so I lost it, and I'm about to go to sleep, so I'll hit the main point.  Please, don't waste too much effort optimizing that code I posted before, it was just a quick stab, not anything I actually need, especially since I pretty much already have the conclusion I wanted from it (major speed boosts based on simple cache considerations when picking data representation are not quite as automatic in Java as in C++, so more detailed examination is necessary).

To me it would actually be most useful to see a simple example exhibiting, for instance, a well enough optimized loop that I could see for myself an effect like the speed difference between a multiply and a shift - while I definitely trust you on that, I tried briefly and wasn't able to make the difference visible, despite a very no nonsense loop unrolled a bit with no other obvious bottlenecks.  But it's becoming clear to me that a lot of my conceptions about what is causing stalls have to shift considerably, especially in Java, and a lot of things that I would have thought would hurt performance may actually help it, so it would be great if you could point me to a pared down example that I could take as a baseline and work from.

BTW, do you know of any good references about this stuff that are somewhat recent?  Googling seems to pull up mostly either ancient optimization guides from the dot com era or feel good assurances from "experts" that insist that we don't need to optimize Java code because the next version of the VM will do everything including our taxes for us anyways...

@DzzD: yeah, a lot of that could be played with quite a bit, however what I was intending to test there was whether the sole difference between a multiply and a shift was visible with all else remaining the same, so that was mostly just contrived to use a bunch of shifts for no purpose but to use them; turns out that I sort of misinterpreted what Riven was saying anyways, which renders the example moot.

I always feel vaguely bad every time I pull out subexpressions like that in Java, though, as if it's a) wasted effort, and b) perhaps doing more damage than good - am I again just feeling unreasonable pangs of guilt based on what Some Guy At Sun once said about how I should let the JVM do that type of work for me, or do those types of optimizations actually happen inside HotSpot?  I mean, this is compilers 101 stuff, but I suppose without verification I should not assume much.

i++ not optimized?  Hm, can anyone confirm that?  ++i and i++ should definitely be translated differently from each other, but it would surprise me to learn that they are not each being converted to their best representations in bytecode.

(also, careful - the new code doesn't do the same thing anymore, you'd need to pull the shift from the limit declaration)
Offline DzzD
« Reply #23 - Posted 2008-06-24 13:06:11 »

Quote
@DzzD: yeah, a lot of that could be played with quite a bit, however what I was intending to test there was whether the sole difference between a multiply and a shift was visible with all else remaining the same, so that was mostly just contrived to use a bunch of shifts for no purpose but to use them; turns out that I sort of misinterpreted what Riven was saying anyways, which renders the example moot.
ok, so the problem in your bench is that *2 or <<1 spent so low cpu than the final bench result wont be affected as most of the time will be spent in array(i) read/write ops.

<<1 is faster than *2 this is a fact in any language, this is related to ASM/Machine op code. ex: *2 requiere loading two register, perform mul and sometime store result back.

but let say <<1 use about 2 cpu cycle and *2 maybe use about 5 cpu cycle and all array(i) + for + other stuff... maybe 200 cycles.

In one case your full loop will spent 202 and in another 205 so you wont see any differents in your results.

Quote
I always feel vaguely bad every time I pull out subexpressions like that in Java, though, as if it's a) wasted effort, and b) perhaps doing more damage than good - am I again just feeling unreasonable pangs of guilt based on what Some Guy At Sun once said about how I should let the JVM do that type of work for me, or do those types of optimizations actually happen inside HotSpot?  I mean, this is compilers 101 stuff, but I suppose without verification I should not assume much.
  Shocked hum.. factorizing make both code optimised and clean (more readable), this is a good practice!! just imagine that for any reason you have to change i+1 by i+2 ....

Quote
Some Guy At Sun once said about how I should let the JVM do that type of work

dont think they was thinking of things like the following wich is absolutly awfull code :
y=x*2;
z=x*2;
w=x*2;
t=t+x*2;


Offline DzzD
« Reply #24 - Posted 2008-06-24 20:57:20 »

Quote
- final keyword makes no difference at all to performance anymore, anywhere, at any time.  Only use it when it "should" be used, not because you think it will make things faster.
When a small method is set final I guess the JIT can inline it, setting two methods final in one of my demo make FPS grow from 27 to 31.

Offline ewjordan

Junior Member





« Reply #25 - Posted 2008-06-25 02:41:19 »

(We're now thoroughly off topic, which is fine by me, this is all interesting stuff - just wanted to put a disclaimer that nothing in this post relates to cache friendly programming anymore, this is 100% in the nitpicky micro-optimization category!)

Quote
<<1 is faster than *2 this is a fact in any language, this is related to ASM/Machine op code. ex: *2 requiere loading two register, perform mul and sometime store result back.
Right, and I would never make the claim that the actual multiplication operation was as fast as a shift, I'm well aware of the differences in the operations once we're down to bare metal.  But in most programming languages, there actually is no difference in the assembly generated when you type in n*2 as opposed to (n<<1), they are both turned into left shifts; if that's happening, then it's generally better practice to just write what you mean (esp. given how easy it is to screw up the precedence rules with shifts, for instance by accidentally writing (x<<1 + 1) to replace (2*x + 1)).  I was simply surprised by the claim that the JVM would let this type of thing slide by, as it's a textbook example of an easy to implement peephole optimization.

After another failed stab at making a performance difference visible, I took a dig through the HotSpot 1.6.3 source to try and see exactly what's going on and why this might have been left out (it's all quite interesting, I'm absolutely going to dig through more of it to see what's happening under the hood), and I see that it definitely does have code that can recognize an integer multiply by a constant power of two and transform it to a left shift (see share->vm->opto->multnode.cpp at line 145 in MulINode::Ideal(PhaseGVN *phase, bool can_reshape)).  But I don't know enough about the internals to immediately see whether it's live or not, or whether it's perhaps a low priority optimization that is usually never reached, and don't have the time right now to get this beast compiling and throw in some traces to see whether that code is getting hit.  Maybe if I have some extra time this week.

Riven, is it possible that the performance improvements you're seeing as a result of those replacements are happening in functions that for some reason or another are getting skipped over during optimization, perhaps for method length reasons like you mentioned before?  I totally expect you to prove me wrong on this again, by the way, but I can't help but ask now that I'm actually looking at the source that's supposed to do this transform!

Quote
dont think they was thinking of things like the following wich is absolutly awfull code :
y=x*2;
z=x*2;
w=x*2;
t=t+x*2;
Yes, that code is bad bad bad from a maintenance and readibility perspective.  But I'm not sure off the top of my head whether it will be any slower, ultimately, than if you eliminated the extra multiplies, as some compilers will recognize that they can simplify and generate some very optimal assembly from that code, especially in such a simple situation.  I don't know if HotSpot would do very well at it, though I'd be curious to know, because most of the time when I omit common subexpression eliminations it's because the algorithm is such that the change would actually reduce the readability or logic of the code, so there is a bit of a real tradeoff.  This happens a lot with math expressions, where you are carrying out a tricky geometric procedure or something like that and for clarity it's best to preserve the geometrical objects locality in code rather than slicing and dicing to reduce the operation count.
Offline DzzD
« Reply #26 - Posted 2008-06-25 14:13:23 »

Quote
there actually is no difference in the assembly generated when you type in n*2 as opposed to (n<<1), they are both turned into left shifts;
arf, as it is not 100% true that's better to get the habit to use <<1 or <<2 or etc.. you will finally use it automatically and after severals years thinking in binary you wont see any more difference between *2 and <<1 when reading a code.

Offline ewjordan

Junior Member





« Reply #27 - Posted 2008-06-26 00:28:49 »

@DzzD: I think we're fundamentally in agreement here about the speed issue, we just differ a little bit as far as style goes. Smiley

Oh, I was curious, so I looked into the increment issue you mentioned before, and I don't think you need to worry about it, if I understand your concern correctly - both ++i and i += 1 are translated to the same bytecode (IINC 1 1, ILOAD 1), and i++ turns into (ILOAD 1, IINC 1 1), as expected.  Though while we're on the topic, it's worth mentioning that 32767 is the cutoff for the IINC operator: i += 32768 translates to (ILOAD 1, LDC 32768, IADD, ISTORE 1, ILOAD 1), which could potentially be slower.  Not that anyone's usually "incrementing" by numbers that big, but...
Offline erikd

JGO Ninja


Medals: 16
Projects: 4
Exp: 14 years


Maximumisness


« Reply #28 - Posted 2008-06-26 09:44:12 »

...
It's obvious that this loop basicly waits a lot of the time. You can do quite a bit meanwhile, in the same thread.

it's actually so bad, that you can do this, in exactly the same time:
...

This is quite interesting.
It makes me wonder though why java's bounds checks seem as expensive as they are. Especially with lots of array access, it should basically be free then? It isn't though.

Offline DzzD
« Reply #29 - Posted 2008-06-26 09:48:57 »

Quote
@DzzD: I think we're fundamentally in agreement here about the speed issue, we just differ a little bit as far as style goes.
oki  Wink

Quote
This is quite interesting.
It makes me wonder though why java's bounds checks seem as expensive as they are. Especially with lots of array access, it should basically be free then? It isn't though.
this slow array access is for me the biggest performance issue in java

Pages: [1] 2 3
  ignore  |  Print  
 
 
You cannot reply to this message, because it is very, very old.

 

Add your game by posting it in the WIP section,
or publish it in Showcase.

The first screenshot will be displayed as a thumbnail.

CogWheelz (18 views)
2014-07-30 21:08:39

Riven (27 views)
2014-07-29 18:09:19

Riven (16 views)
2014-07-29 18:08:52

Dwinin (14 views)
2014-07-29 10:59:34

E.R. Fleming (35 views)
2014-07-29 03:07:13

E.R. Fleming (13 views)
2014-07-29 03:06:25

pw (44 views)
2014-07-24 01:59:36

Riven (45 views)
2014-07-23 21:16:32

Riven (30 views)
2014-07-23 21:07:15

Riven (31 views)
2014-07-23 20:56:16
List of Learning Resources
by SilverTiger
2014-07-31 18:29:50

List of Learning Resources
by SilverTiger
2014-07-31 18:26:06

List of Learning Resources
by SilverTiger
2014-07-31 13:54:12

HotSpot Options
by dleskov
2014-07-08 03:59:08

Java and Game Development Tutorials
by SwordsMiner
2014-06-14 00:58:24

Java and Game Development Tutorials
by SwordsMiner
2014-06-14 00:47:22

How do I start Java Game Development?
by ra4king
2014-05-17 11:13:37

HotSpot Options
by Roquen
2014-05-15 09:59:54
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!