Java-Gaming.org Hi !
Featured games (83)
games approved by the League of Dukes
Games in Showcase (538)
Games in Android Showcase (132)
games submitted by our members
Games in WIP (600)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: [1] 2 3 ... 7
  ignore  |  Print  
  Once again! fast MappedObjects implementation  (Read 31956 times)
0 Members and 1 Guest are viewing this topic.
Online Riven
« League of Dukes »

« JGO Overlord »


Medals: 840
Projects: 4
Exp: 16 years


Hand over your head.


« Posted 2008-06-23 19:15:35 »

Edit: Latest version:
Quote
http://indiespot.net/files/published/mappedobject-0.11.jar (view source)
Main-class: org.lwjgl.util.mapped.TestMappedObject
JARs asm-3.2.jar and asm-util-3.2.jar (asm.ow2.org) must be on the classpath.


Usage:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
@MappedType(sizeof = 12)
public class MappedVec3 extends MappedObject
{
   public float x;

   public float y;

   public float z; // you can hardcode this field's byte-offset using: @MappedField(offset=8)

   public float length()
   {
      return (float) Math.sqrt(x*x + y*y + z*z); // natural field access
   }

   @Override
   public String toString()
   {
      return "[" + x + "," + y + "," + z + "]";
   }
}

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
public class TestMappedObject
{
   public static void main(String[] args)
   {
      MappedObjectTransformer.register(MappedVec3.class);

      // we 'fork' the current application. the forked version will have transformed classes
      if (MappedObjectClassLoader.fork(TestMappedObject.class, args))
      {
         // if this is the version of the classloader without transformed classes, stop right here
         return;
      }

      // business as usual

      ByteBuffer bb = ByteBuffer.allocateDirect(4096);
      bb.order(ByteOrder.nativeOrder());
      MappedVec3 vecs = MappedVec3.map(bb);

      // naturally we start at mapping 0
      vecs.x = 1.30f;
      vecs.y = 2.50f;
      vecs.z = 3.60f;

      vecs.view = 1; // go to mapping at index 1
      vecs.x = 0.13f;
      vecs.y = 0.25f;
      vecs.z = 0.36f;

      float x0 = bb.getFloat(0 << 2);
      float y0 = bb.getFloat(1 << 2);
      float z0 = bb.getFloat(2 << 2);

      float x1 = bb.getFloat(3 << 2);
      float y1 = bb.getFloat(4 << 2);
      float z1 = bb.getFloat(5 << 2);
   }
}





Everything below this line is outdated...







I decided to give implementing MappedObjects another try, after the previous attempt went down the drain last year.

Let's say you have this 'struct' in C (yes, my C-skills are somewhere below sea level):
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
struct Sphere
{
   float x, y, z, r;

   bool intersects(Sphere that)
   {
      float xd = this->x - that->x;
      float yd = this->y - that->y;
      float zd = this->z - that->z;
      float d2 = (xd * xd) + (yd * yd) + (zd * zd);
      float r = this->r + that->r;
      return d2 <= r * r;
   }
}


Currently you'd have to fiddle with FloatBuffers or float[]s with offsets and strides, which is extremely boring and error-phrone to code. I wrote an API that does this for you, which a nice layer over it so that is supports byte[]s, direct and heap ByteBuffers. Last but not least, it's extremely quick! Working with float[]s instead of byte[]s gains you only 5-7% performance, which it probably a good tradeoff for 99.99% of us.

Unfortunately, we still need to tell Java which fields have which offset, so... the code to construct the MappedObject is a bit verbose. But as we define the abstract class (or interface), we can simply talk to the interface, so that we keep all those nice features in our IDEs.

In the Java implementation it becomes:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
   @MappedType(stride = 16)
   public abstract class Sphere // or interface...
   {
      // sliding window
      public abstract void map(int element);

      // getters (with annotations)
      public abstract @MappedField(offset = 0,  type = float.class) float x();
      public abstract @MappedField(offset = 4,  type = float.class) float y();
      public abstract @MappedField(offset = 8,  type = float.class) float z();
      public abstract @MappedField(offset = 12, type = float.class) float r();

      // setters
      public abstract void x(float x);
      public abstract void y(float y);
      public abstract void z(float z);
      public abstract void r(float r);

      public boolean intersects(Sphere that)
      {
         float xd = this.x() - that.x();
         float yd = this.y() - that.y();
         float zd = this.z() - that.z();
         float d2 = (xd * xd) + (yd * yd) + (zd * zd);
         float r = this.r() + that.r();
         return d2 <= r * r;
      }
}



So much for the boiler-plate! Now it's time to make a MappedObject over a byte[] or a ByteBuffer:

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
      MappedObjectProvider<Sphere> provider = new MappedObjectProvider<Sphere>(Sphere.class);

      int sphereCount = 128;
      ByteBuffer data1 = ByteBuffer.allocate(      provider.sizeof() * sphereCount);
      ByteBuffer data2 = ByteBuffer.allocateDirect(provider.sizeof() * sphereCount);
      byte[]     data3 = new byte[                 provider.sizeof() * sphereCount];
     
      Sphere sA = provider.newMappedObject(data1, 0, sphereCount); // datasource, base, elementCount
      Sphere sB = provider.newMappedObject(data2, 0, sphereCount);
      Sphere sC = provider.newMappedObject(data3, 0, sphereCount);

      sA.map(5);  // slide the MappedObject to element 5 (offset = base+5*stride)
      sB.map(13); // slide the MappedObject to element 13

      // sA.x = sB.x;
      sA.x(sB.x());


The sourcecode (and binaries) are here:
http://213.247.55.3/~balk1242/html_stuff/ (for browsable packages, with source in formatted HTML)
jawnae.mapped.PrimitiveConverter

Although I'm using sun.misc.Unsafe to write all primitives in the byte[] at 'native speed', it's impossible to get a buffer overflow.

There are automatic optimisations when stride is power-of-two, which makes is roughly 40% faster.


Give it a try and let me known what you think!

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social
Online Riven
« League of Dukes »

« JGO Overlord »


Medals: 840
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #1 - Posted 2008-06-23 19:20:15 »

Performance is best with byte[], direct bytebuffers are about 10% slower,
and for heap-bytebuffers I simply use the backing byte[], so they run at full speed too.




Needless to say you can finally have fields that overlap persecutioncomplex

for whatever that's worth

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social
Online Riven
« League of Dukes »

« JGO Overlord »


Medals: 840
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #2 - Posted 2008-06-28 21:02:19 »

I guess I should have titled this topic: structs for java with runtime class generation

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline oNyx

JGO Coder


Medals: 2


pixels! :x


« Reply #3 - Posted 2008-06-30 09:37:26 »

I guess the biggest issue is that most people here (me included) don't have a clue what it's supposed to be good for. So maybe you could outline some specific real-life use case and how much of a performance improvement you managed to get.

弾幕 ☆ @mahonnaiseblog
Online kappa
« League of Dukes »

JGO Kernel


Medals: 80
Projects: 15


★★★★★


« Reply #4 - Posted 2008-06-30 11:37:55 »

I guess the biggest issue is that most people here (me included) don't have a clue what it's supposed to be good for. So maybe you could outline some specific real-life use case and how much of a performance improvement you managed to get.

Yup was thinking the same thing.
Online Riven
« League of Dukes »

« JGO Overlord »


Medals: 840
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #5 - Posted 2008-06-30 15:34:58 »

Hm... I thought this was a pretty well covered topic on these forums! Guess it's not familiar territory to everybody Smiley


It's about not having 10.000 instances of classes like Vector3f, Matrix16f, Normal3f and TexCoord2f, because they are spread all over the heap, so you can't really have good cache-locality and certainly cannot push all that data to the GPU in one call.


So now you have this 'sliding window', over your data, where you can pick/map a certain element on which you get/set your values or do your calculations.



Slow way, OO style:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
class Vector3f
{
   public float x, y, z;
}

class Normal3f
{
   public float x, y, z;
}


Vector3f[] vectors = new Vector3f[1024];
Normal3f[] normals = new Normal3f[1024];
// ...initialize...


for(int i=0; i<vectors.length; i++)
{
   vectors[i].add(vectors[1024-i-1]);
   vectors[i].mul(vectors[1024-i-1]);
   normals[i].normalize();
}

// pump to FloatBuffer to update some VBO
FloatBuffer vec_norm_buf = ...;
int p = 0;
for(int i=0; i<vectors.length; i++)
{
   Vextor3f v3f = vectors[i];
   fbuf.put(p + 0, v3f.x);
   fbuf.put(p + 1, v3f.y);
   fbuf.put(p + 2, v3f.z);

   Vextor3f n3f = normals[i];
   fbuf.put(p + 3, n3f.x);
   fbuf.put(p + 4, n3f.y);
   fbuf.put(p + 5, n3f.z);

   p += 6;
}



You can also work directly on the FloatBuffer, but the code to write is just a heck of a lot boiler-plate, and a pain in the ass to modify (for example when you change your interleaved buffer from VectorNormal to VectorNormalTexcoord.

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
47  
48  
49  
   @MappedType(stride = 24)
   public abstract class Vector3f
   {
      public abstract void map(int element);

      public abstract @MappedField(offset = 0,  type = float.class) float x();
      public abstract @MappedField(offset = 4,  type = float.class) float y();
      public abstract @MappedField(offset = 8,  type = float.class) float z();

      public abstract void x(float x);
      public abstract void y(float y);
      public abstract void z(float z);
  }

   @MappedType(stride = 24)
   public abstract class Normal3f
   {
      public abstract void map(int element);

      public abstract @MappedField(offset = 12,  type = float.class) float x();
      public abstract @MappedField(offset = 16,  type = float.class) float y();
      public abstract @MappedField(offset = 20,  type = float.class) float z();

      public abstract void x(float x);
      public abstract void y(float y);
      public abstract void z(float z);
  }


      MappedObjectProvider<Vector3f> vProvider = new MappedObjectProvider<Vector3f>(Vector3f.class);
      MappedObjectProvider<Normal3f> nProvider = new MappedObjectProvider<Normal3f>(Normal3f.class);
      byte[] data = new byte[(vProvider.sizeof() + nProvider.sizeof()) * 1024];

      Vector3f vMap1 = vProvider.newMappedObject(data, 0, 1024);
      Vector3f vMap2 = vProvider.newMappedObject(data, 0, 1024);
      Normal3f nMap = nProvider.newMappedObject(data, 0, 1024);

for(int i=0; i<1024; i++)
{
   vMap1.map(i);
   vMap2.map(1024-i-1);
   nMap.map(i);

   vMap1.add(vMap2);
   vMap1.mul(vMap2);
   nMap.normalize();
}

// no need to copy everything to a Buffer



Note the stride of 24 (6 floats), and offsets for the floats in Vector3f and Normal3f.



This is as near as you get to structs in Java.

Next step would be to add a bytecode transformer that allows you to call fields instead of getters/setters, (which are transformed into getters/setters again at runtime).

A simple Vector3f.add() would look like:
1  
2  
3  
4  
5  
6  
public void add(Vector3f that)
{
   this.x(this.x() + that.x());
   this.y(this.y() + that.y());
   this.z(this.z() + that.z());
}


With a bytecode transformer it would look like:
1  
2  
3  
4  
5  
6  
public void add(Vector3f that)
{
   this.x += that.x;
   this.y += that.y;
   this.z += that.z;
}




Anyway, there is no performance penalty, even when writing floats into byte[]s or ByteBuffers.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social
Offline princec

« JGO Spiffy Duke »


Medals: 429
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #6 - Posted 2011-04-06 12:20:04 »

Hmm how did I miss this thread the last time.
<edit>Links at the top are broken - still got it all lying around somewhere?

Cas Smiley

Online Riven
« League of Dukes »

« JGO Overlord »


Medals: 840
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #7 - Posted 2011-04-06 14:57:23 »

Not really. The data behind the links was lost when JGO got infected and I had to get a new VM instance. The actual sourcecode was deleted by me because I basically didn't write it for myself, but for others here and it turned out nobody cared / understood. It shouldn't take too long to write it again though.

Keep in mind it uses sun.misc.Unsafe (in a safe way) and Janino for the code generation. It's basically a tiny layer over direct memory access (writing a float into a byte[], by doing a unsafe.putFloat(instance, offset, value) for example. with direct buffers I used memory addresses instead of offsets from objects) With short,char,int,long you get byte-order problems, but that's pretty much guaranteed anyway when interfacing with raw memory.

I just had an enlightened moment (...) on how to enable the developer to write code using fields as opposed to those getters & setters. The bytecode transformations shouldn't be too hard after that.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social
Offline princec

« JGO Spiffy Duke »


Medals: 429
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #8 - Posted 2011-04-06 14:59:23 »

I watch with interest Smiley Direct field access -> buffer is exactly what I'm looking for... whether that turns into an intrinsic move / load operation at the end of it is of course the holy grail and puts it on a par with pure C for performance.

Cas Smiley

Online Riven
« League of Dukes »

« JGO Overlord »


Medals: 840
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #9 - Posted 2011-04-06 15:01:02 »

The overhead of Unsafe is significant. It's about twice as slow as your regular Vector3f instance. (yay!)

At least, that's what I recall from 2008.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline delt0r

JGO Knight


Medals: 30
Exp: 18 years


Computers can do that?


« Reply #10 - Posted 2011-04-06 15:37:26 »

Any real numbers on the difference between the "OO" way and the "struct" way?

Also when you say you can have overlapping offsets, do you mean C unions? (C unions are evil Angry ).

I had one trick when using the OO way. Every now and then re-allocate the entire set of objects in access order. This gave me a 2-4x speed increase on Intels and no increase or penalty on AMDs. But this was back in java 1.3 IIRC.

I have no special talents. I am only passionately curious.--Albert Einstein
Offline lhkbob

JGO Knight


Medals: 32



« Reply #11 - Posted 2011-04-06 15:41:23 »

I just re-read that and was wondering how the abstraction and JNI affected performance.  Sure you get the memory locality and fewer instances lying around, but it felt like you were adding a lot of extra work in reading/writing the "fields".

Online Riven
« League of Dukes »

« JGO Overlord »


Medals: 840
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #12 - Posted 2011-04-06 15:56:00 »

The JVM got better since 2008 Grin

It's still non-deterministic though, and you can have wildly varying performance between JVM runs... (it either takes 18us or 30us, after warmup, on my machine)

Here is a benchmark to give a rough indication of Unsafe performance:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
47  
48  
49  
50  
51  
52  
53  
54  
55  
56  
57  
58  
59  
60  
61  
62  
63  
64  
65  
66  
67  
68  
69  
70  
71  
72  
73  
74  
75  
76  
77  
78  
79  
80  
81  
82  
83  
84  
85  
86  
87  
88  
89  
90  
91  
92  
93  
94  
95  
96  
97  
98  
99  
100  
101  
102  
103  
104  
105  
106  
107  
108  
109  
110  
111  
112  
113  
114  
115  
116  
117  
118  
119  
120  
121  
122  
123  
124  
125  
126  
127  
128  
129  
130  
131  
132  
133  
134  
135  
136  
137  
138  
139  
140  
141  
142  
143  
144  
145  
146  
147  
148  
149  
150  
151  
152  
153  
154  
155  
156  
157  
158  
159  
160  
161  
162  
163  
164  
165  
166  
167  
168  
169  
170  
171  
172  
173  
174  
175  
176  
177  
178  
179  
180  
181  
182  
183  
184  
185  
186  
187  
188  
189  
190  
191  
192  
193  
194  
195  
196  
197  
198  
199  
200  
201  
202  
203  
204  
205  
206  
207  
208  
209  
210  
211  
212  
213  
214  
215  
216  
217  
218  
219  
220  
221  
222  
223  
224  
225  
226  
227  
228  
229  
230  
231  
232  
233  
234  
235  
236  
237  
238  
239  
240  
241  
242  
243  
244  
245  
246  
247  
248  
249  
250  
251  
252  
253  
254  
255  
256  
257  
258  
259  
260  
261  
262  
263  
package eden;

import java.lang.reflect.Field;
import java.nio.ByteBuffer;
import java.nio.ByteOrder;
import java.nio.FloatBuffer;
import java.util.Arrays;

import sun.misc.Unsafe;

public class MappedObject
{
   private static final long nanoTimeOverhead;
   static
   {
      nanoTimeOverhead = measureNanoTimeOverhead();
      System.out.println("Java version: " + System.getProperty("java.version"));
      System.out.println("JavaVM version: " + System.getProperty("java.vm.version"));
      System.out.println("nanoTimeOverhead: " + nanoTimeOverhead + "ns");
   }

   private static Unsafe grabUnsafeInstanceFromBuffer()
   {
      try
      {
         ByteBuffer dummy = ByteBuffer.allocateDirect(1);
         Field unsafeField = dummy.getClass().getDeclaredField("unsafe");
         unsafeField.setAccessible(true);
         return (Unsafe) unsafeField.get(dummy);
      }
      catch (Throwable t)
      {
         throw new Error(t);
      }
   }

   public static void main(String[] args)
   {
      Unsafe unsafe = grabUnsafeInstanceFromBuffer();

      //

      int vectors = 4 * 1024; // small, to keep it in cache
      int floats = vectors * 4;
      int bytes = floats * 4;

      //

      int runs = 256;

      for (int i = 0; i < 16; i++)
      {
         System.out.println();

         runFloatObjectBenchmark(vectors, floats, bytes, runs);
         runFloatBufferBenchmark(vectors, floats, bytes, runs);
         runFloatArrayBenchmark(vectors, floats, bytes, runs);
         runFloatPointerBenchmark(unsafe, vectors, floats, bytes, runs);
      }
   }

   static class FloatVector
   {
      public float x, y, z, w;
   }

   private static final String formatNanos(long nanos)
   {
      if (nanos > 10 * 1000000000L)
         return (nanos / 1000000000L) + "s";
      if (nanos > 10 * 1000000L)
         return (nanos / 1000000L) + "ms";
      if (nanos > 10 * 1000L)
         return (nanos / 1000L) + "us";
      return nanos + "ns";
   }

   private static final void printResults(String prefix, long[] durations)
   {
      System.out.println(prefix + ".best:   " + formatNanos(durations[(int) (durations.length * 0.00f)]));
      System.out.println(prefix + ".good:   " + formatNanos(durations[(int) (durations.length * 0.05f)]));
      System.out.println(prefix + ".median: " + formatNanos(durations[(int) (durations.length * 0.50f)]));
      System.out.println(prefix + ".bad:    " + formatNanos(durations[(int) (durations.length * 0.95f) - 1]));
      System.out.println(prefix + ".worst:  " + formatNanos(durations[(int) (durations.length * 1.00f) - 1]));
   }

   private static final void runFloatObjectBenchmark(int vectors, int floats, int bytes, int runs)
   {
      FloatVector[] a = new FloatVector[vectors];
      FloatVector[] b = new FloatVector[vectors];
      FloatVector[] dst = new FloatVector[vectors];

      // allocate the objects sequentially, per array
      for (int i = 0; i < vectors; i++)
         a[i] = new FloatVector();
      for (int i = 0; i < vectors; i++)
         b[i] = new FloatVector();
      for (int i = 0; i < vectors; i++)
         dst[i] = new FloatVector();

      long[] durations = new long[runs];
      for (int i = 0; i < durations.length; i++)
      {
         long t0 = System.nanoTime();
         testFloatObjectMulPerformance(a, b, dst, vectors);
         long t1 = System.nanoTime();
         long took = (t1 - t0) - 2 * nanoTimeOverhead;
         durations[i] = took;
      }

      Arrays.sort(durations);
      printResults("Vector", durations);
   }

   private static final void runFloatArrayBenchmark(int vectors, int floats, int bytes, int runs)
   {
      float[] a = new float[floats];
      float[] b = new float[floats];
      float[] dst = new float[floats];

      long[] durations = new long[runs];
      for (int i = 0; i < durations.length; i++)
      {
         long t0 = System.nanoTime();
         testFloatArrayMulPerformance(a, b, dst, vectors);
         long t1 = System.nanoTime();
         long took = (t1 - t0) - 2 * nanoTimeOverhead;
         durations[i] = took;
      }

      Arrays.sort(durations);
      printResults("float[]", durations);
   }

   private static final void runFloatBufferBenchmark(int vectors, int floats, int bytes, int runs)
   {
      FloatBuffer a = ByteBuffer.allocateDirect(bytes).order(ByteOrder.nativeOrder()).asFloatBuffer();
      FloatBuffer b = ByteBuffer.allocateDirect(bytes).order(ByteOrder.nativeOrder()).asFloatBuffer();
      FloatBuffer dst = ByteBuffer.allocateDirect(bytes).order(ByteOrder.nativeOrder()).asFloatBuffer();

      long[] durations = new long[runs];
      for (int i = 0; i < durations.length; i++)
      {
         long t0 = System.nanoTime();
         testFloatBufferMulPerformance(a, b, dst, vectors);
         long t1 = System.nanoTime();
         long took = (t1 - t0) - 2 * nanoTimeOverhead;
         durations[i] = took;
      }

      Arrays.sort(durations);
      printResults("FloatBuffer", durations);
   }

   private static final void runFloatPointerBenchmark(Unsafe unsafe, int vectors, int floats, int bytes, int runs)
   {
      long a = malloc_aligned(unsafe, bytes, 4 * 4);
      long b = malloc_aligned(unsafe, bytes, 4 * 4);
      long dst = malloc_aligned(unsafe, bytes, 4 * 4);

      long[] durations = new long[runs];
      for (int i = 0; i < durations.length; i++)
      {
         long t0 = System.nanoTime();
         testFloatPointerMulPerformance(unsafe, a, b, dst, vectors);
         long t1 = System.nanoTime();
         long took = (t1 - t0) - 2 * nanoTimeOverhead;
         durations[i] = took;
      }

      Arrays.sort(durations);
      printResults("float*", durations);
   }

   private static final long measureNanoTimeOverhead()
   {
      long nanoTimeOverhead = Long.MAX_VALUE;
      for (int k = 0; k < 64; k++)
      {
         for (int i = 0; i < 1024; i++)
            System.nanoTime();

         long runNanoTimeOverhead = Long.MAX_VALUE;
         for (int i = 0; i < 64; i++)
         {
            long currentOverhead = -(System.nanoTime() - System.nanoTime());
            if (currentOverhead < runNanoTimeOverhead)
               runNanoTimeOverhead = currentOverhead;
         }

         if (runNanoTimeOverhead < nanoTimeOverhead)
            nanoTimeOverhead = runNanoTimeOverhead;
      }
      return nanoTimeOverhead;
   }

   private static long malloc_aligned(Unsafe unsafe, long bytes, long align)
   {
      long base = unsafe.allocateMemory(bytes + align);
      long pntr = base + (align - (base % align));
      return pntr;
   }

   private static void testFloatObjectMulPerformance(FloatVector[] a, FloatVector[] b, FloatVector[] dst, int vectorCount)
   {
      for (int i = 0; i < vectorCount; i++)
      {
         int p = i;
         dst[p].x = a[p].x * b[p].x;
         dst[p].y = a[p].y * b[p].y;
         dst[p].z = a[p].z * b[p].z;
      }
   }

   private static void testFloatArrayMulPerformance(float[] a, float[] b, float[] dst, int vectorCount)
   {
      for (int i = 0; i < vectorCount; i++)
      {
         int p = i << 2;
         dst[p + 0] = a[p + 0] * b[p + 0];
         dst[p + 1] = a[p + 1] * b[p + 1];
         dst[p + 2] = a[p + 2] * b[p + 2];
      }
   }

   private static void testFloatBufferMulPerformance(FloatBuffer a, FloatBuffer b, FloatBuffer dst, int vectorCount)
   {
      for (int i = 0; i < vectorCount; i++)
      {
         int p = i << 2;
         dst.put(p + 0, a.get(p + 0) * b.get(p + 0));
         dst.put(p + 1, a.get(p + 1) * b.get(p + 1));
         dst.put(p + 2, a.get(p + 2) * b.get(p + 2));
      }
   }

   private static void testFloatPointerMulPerformance(Unsafe unsafe, long a, long b, long dst, int vectorCount)
   {
      for (int i = 0; i < vectorCount; i++)
      {
         int p = i << 4;

         unsafe.putFloat(dst + (p + 0), unsafe.getFloat(a + (p + 0)) * unsafe.getFloat(b + (p + 0))); // long + (int + int)
         unsafe.putFloat(dst + (p + 4), unsafe.getFloat(a + (p + 4)) * unsafe.getFloat(b + (p + 4)));
         unsafe.putFloat(dst + (p + 8), unsafe.getFloat(a + (p + 8)) * unsafe.getFloat(b + (p + 8)));
      }
   }

   private static void testFloatPointerMulPerformance_slower(Unsafe unsafe, long a, long b, long dst, int vectorCount)
   {
      for (int i = 0; i < vectorCount; i++)
      {
         int p = i << 4;
         long ap = p + a;
         long bp = p + b;
         long dstp = p + dst;

         unsafe.putFloat(dstp + 0, unsafe.getFloat(ap + 0) * unsafe.getFloat(bp + 0));
         unsafe.putFloat(dstp + 4, unsafe.getFloat(ap + 4) * unsafe.getFloat(bp + 4));
         unsafe.putFloat(dstp + 8, unsafe.getFloat(ap + 8) * unsafe.getFloat(bp + 8));
      }
   }
}

(please don't whine about how flawed the benchmark is)


Output:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
Java version: 1.6.0_20
JavaVM version: 16.3-b01
nanoTimeOverhead: 698ns

... [snip] ...

Vector.best:   26us
Vector.good:   26us
Vector.median: 27us
Vector.bad:    28us
Vector.worst:  52us

FloatBuffer.best:   20us
FloatBuffer.good:   20us
FloatBuffer.median: 20us
FloatBuffer.bad:    21us
FloatBuffer.worst:  112us

float[].best:   35us
float[].good:   35us
float[].median: 36us
float[].bad:    38us
float[].worst:  157us

float*.best:   17us
float*.good:   17us
float*.median: 18us
float*.bad:    19us
float*.worst:  95us


Now that we have a baseline, and it turns out Unsafe is (even more) feasible, I can try to get that MappedObject impl. up and running.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social
Online Riven
« League of Dukes »

« JGO Overlord »


Medals: 840
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #13 - Posted 2011-04-06 15:56:24 »

I just re-read that and was wondering how the abstraction and JNI affected performance.  Sure you get the memory locality and fewer instances lying around, but it felt like you were adding a lot of extra work in reading/writing the "fields".
There are no JNI calls.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social
Online Riven
« League of Dukes »

« JGO Overlord »


Medals: 840
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #14 - Posted 2011-04-06 15:58:59 »

Any real numbers on the difference between the "OO" way and the "struct" way?
See above benchmark, on how Unsafe beats performance of sequentially allocated instances

Also when you say you can have overlapping offsets, do you mean C unions? (C unions are evil Angry ).
Yes, I do. Maybe even worse, as the fields can partically overlap. This 'support' is a side effect and can easily be disabled with some checks/exceptions during initializing the sliding-window / mapped-object. These checks are advisable anyway, because on non-x86 architectures misaligned memory access can crash a process.

I had one trick when using the OO way. Every now and then re-allocate the entire set of objects in access order. This gave me a 2-4x speed increase on Intels and no increase or penalty on AMDs. But this was back in java 1.3 IIRC.
After a single GC all your hard work is lost. Still it might be a good idea to do this once every few seconds.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social
Offline delt0r

JGO Knight


Medals: 30
Exp: 18 years


Computers can do that?


« Reply #15 - Posted 2011-04-06 16:51:31 »

I think a GC will still approximately keep the order, especially a generational  GC and when the graph has the same access pattern (ie all are referenced from an array).

I have no special talents. I am only passionately curious.--Albert Einstein
Offline lhkbob

JGO Knight


Medals: 32



« Reply #16 - Posted 2011-04-06 18:47:47 »

I just re-read that and was wondering how the abstraction and JNI affected performance.  Sure you get the memory locality and fewer instances lying around, but it felt like you were adding a lot of extra work in reading/writing the "fields".
There are no JNI calls.

But Unsafe has native methods in it.  I've always just assumed that the native method meant that it used JNI, or can the JVM have super-fast native functions too?

Online Riven
« League of Dukes »

« JGO Overlord »


Medals: 840
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #17 - Posted 2011-04-06 19:16:54 »

The JVM is a native process. It can read/write at any memory address (within its virtual address space). Writing into an object field is basically using the object as a pointer and the field as an offset.

It's actually pretty easy to figure out the pointer/reference of an object, as the JVM sees it:
1. make and Object[1]
2. write the object reference into index 0
3. read the array index as an int (32 bit jvm) or long (64 bit jvm) using the Unsafe class

You now have the location of an Object on the heap, which was obviously already known by the JVM. If you'd write in that region, you'd be modifying fields.

The Unsafe class is basically a backdoor into the JVM, into what it already can do, and what you're not supposed to be able to do in Java. All direct Buffers (Floatbuffer, IntBuffer, etc) are backed by Unsafe. The reason that Buffers are still reasonably fast, despite the bunch of abstraction layers, is that the JVM turns all these calls to Unsafe methods into machine instructions.

This is why:
1  
2  
3  
         unsafe.putFloat(dst + (p + 0), unsafe.getFloat(a + (p + 0)) * unsafe.getFloat(b + (p + 0)));
         unsafe.putFloat(dst + (p + 4), unsafe.getFloat(a + (p + 4)) * unsafe.getFloat(b + (p + 4)));
         unsafe.putFloat(dst + (p + 8), unsafe.getFloat(a + (p + 8)) * unsafe.getFloat(b + (p + 8)));
is about as fast as:
1  
2  
3  
         dst[p + 0] = a[p + 0] * b[p + 0];
         dst[p + 1] = a[p + 1] * b[p + 1];
         dst[p + 2] = a[p + 2] * b[p + 2];
actually, all those Unsafe method calls are slightly faster than pure float[] access, probably because there are no bound-checks.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social
Offline princec

« JGO Spiffy Duke »


Medals: 429
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #18 - Posted 2011-04-06 19:44:37 »

hm, be interesting to actually look at the machine code it outputs. I'm thinking it should genuinely be as fast as C if it's intrinsified by the JVM.

Cas Smiley

Online Riven
« League of Dukes »

« JGO Overlord »


Medals: 840
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #19 - Posted 2011-04-06 19:54:54 »

The problem with Unsafe is that all pointers are long. On 32 bit JVMs this is rather inefficient because all memory address calculations have to be converted to longs, which the JVM then has to truncate back to an int. This simply can't even remotely be as fast as C code doing pointerbumps on ints. Or the JVM manages to get rid of the int->long->int conversion all together.

Still, running the above code on a 32 bit JVM vs a 64 bit JVM doesn't show any performance difference, leading to the conclusion that we have another bottleneck: either the cache is the bottleneck or the JVM doesn't remotely reach the maximum possible performance, in this benchmark, for all cases: float[], FloatBuffer and FloatVector.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social
Offline concerto49

Junior Devvie





« Reply #20 - Posted 2011-06-10 05:58:59 »

Have you got a chance to rework this? Also the bytecode transformer part? This sounds exciting!

High performance, fast network, affordable price VPS - Cloud Shards
Available in Texas, New York & Los Angeles
Need a VPS Upgrade?
Online Riven
« League of Dukes »

« JGO Overlord »


Medals: 840
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #21 - Posted 2011-06-23 23:47:48 »

First time I did bytecode transformation... but I got it to work! It's fairly basic at this point, but I will share it once I cleaned it up. You can choose whether you want your objects backed by a float[] or a FloatBuffer (only floats are supported atm) and possibly raw pointers at some point.

Meanwhile, here is some example code:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
   private int         backing_offset = 0;
   private float[]     backing_array  = null;
   private FloatBuffer backing_buffer = null;

   // will be a const in bytecode
   public int static sizeof;              

   // these fields will not even exist at runtime
   @FieldOffset(0)   public float        x;              
   @FieldOffset(1)   public float        y;  
   @FieldOffset(2)   public float        z;  

   public void index(int index) // will be generated later on
   {
      this.backing_offset = this.sizeof * index;
   }

   public void test()
   {
      this.index(1);

      System.out.println(Arrays.toString(this.backing_array));
      this.x = 13.13f;
      System.out.println(Arrays.toString(this.backing_array));
      this.y = 14.14f;
      System.out.println(Arrays.toString(this.backing_array));
      this.z = this.x * this.y;
      System.out.println(Arrays.toString(this.backing_array));
   }

   public VectorStruct duplicate() // will probably be generated too
   {
      VectorStruct copy = new VectorStruct();
      copy.backing_offset = this.backing_offset;
      copy.backing_array = this.backing_array;
      copy.backing_buffer = this.backing_buffer;
      return copy;
   }


It outputs (writing into the 2nd struct):
1  
2  
3  
4  
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 0.0, 0.0, 13.13, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 0.0, 0.0, 13.13, 14.14, 0.0, 0.0, 0.0, 0.0]
[0.0, 0.0, 0.0, 13.13, 14.14, 185.6582, 0.0, 0.0, 0.0]

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social
Offline Spasi
« Reply #22 - Posted 2011-06-25 23:38:15 »

If we consider multi-threaded access, for example when doing a computation using JDK7's fork/join framework, I guess we'll have to use duplicate() in a ThreadLocal's initialValue(), is that correct? So that different threads can work on a different backing_offset.
Online Riven
« League of Dukes »

« JGO Overlord »


Medals: 840
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #23 - Posted 2011-06-26 10:52:26 »

If we consider multi-threaded access, for example when doing a computation using JDK7's fork/join framework, I guess we'll have to use duplicate() in a ThreadLocal's initialValue(), is that correct? So that different threads can work on a different backing_offset.
Indeed. Just like direct ByteBuffer instances with relative access, mapped objects are not thread-safe.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social
Online Riven
« League of Dukes »

« JGO Overlord »


Medals: 840
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #24 - Posted 2011-06-26 11:11:39 »

I rewrote the implementation to use direct memory access. All primitives are supported now.

You'll map a type using:
1  
MappedVec2 vec2s = MappedVec2.map(ByteBuffer)


As adding a Java agent to your application might be a tad unobvious for average Joe, you have the option to install the bytecode transformer through code.

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
public static void main(String[] args)
{
    if(MappedInstanceTransformer.fork(MyApp.class, args))
    {
         MappedInstanceTransformer.register(MappedVec2.class);
         MappedInstanceTransformer.register(MappedVec3.class);
         MappedInstanceTransformer.register(MappedVec4.class);
         return;
    }

    // your code

    ByteBuffer bb = ByteBuffer.allocateDirect(MappedVec2.sizeof * n);
    MappedVec2 vec2s = MappedVec2.map(bb);
}


The 'fork' method will grab the URLs of the application ClassLoader, creates a new ClassLoader that transforms the classes, and calls the main-method again, using a class from the new classloader.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social
Offline concerto49

Junior Devvie





« Reply #25 - Posted 2011-06-26 12:36:01 »

Is the source/lib still available? The download links are broken, thanks.

High performance, fast network, affordable price VPS - Cloud Shards
Available in Texas, New York & Los Angeles
Need a VPS Upgrade?
Online Riven
« League of Dukes »

« JGO Overlord »


Medals: 840
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #26 - Posted 2011-06-26 13:04:58 »

I want to clean it up first (and run some proper benchmarks).

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social
Offline princec

« JGO Spiffy Duke »


Medals: 429
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #27 - Posted 2011-06-26 15:28:24 »

I'll plug it into my sprite engine and see what sort of a performance boost I can get Smiley

Cas Smiley

Online Riven
« League of Dukes »

« JGO Overlord »


Medals: 840
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #28 - Posted 2011-06-26 15:30:59 »

It seems HotSpot doesn't really like the bytecodes I feed it.

At the moment it reaches 40% of the performance of field-access.

I believe I can do much better, by duplicating the style of bytecode javac generates.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social
Offline Spasi
« Reply #29 - Posted 2011-06-26 15:55:14 »

Is that on the client VM? I've had similar code using Unsafe failing miserably on the client VM, it couldn't inline all the way to the intrinsified methods. Enabling tiered compilation helped but not much. On server everything was fine and faster than non-Unsafe code.
Pages: [1] 2 3 ... 7
  ignore  |  Print  
 
 
You cannot reply to this message, because it is very, very old.

 

Add your game by posting it in the WIP section,
or publish it in Showcase.

The first screenshot will be displayed as a thumbnail.

rwatson462 (29 views)
2014-12-15 09:26:44

Mr.CodeIt (20 views)
2014-12-14 19:50:38

BurntPizza (40 views)
2014-12-09 22:41:13

BurntPizza (75 views)
2014-12-08 04:46:31

JscottyBieshaar (37 views)
2014-12-05 12:39:02

SHC (50 views)
2014-12-03 16:27:13

CopyableCougar4 (47 views)
2014-11-29 21:32:03

toopeicgaming1999 (113 views)
2014-11-26 15:22:04

toopeicgaming1999 (100 views)
2014-11-26 15:20:36

toopeicgaming1999 (30 views)
2014-11-26 15:20:08
Resources for WIP games
by kpars
2014-12-18 10:26:14

Understanding relations between setOrigin, setScale and setPosition in libGdx
by mbabuskov
2014-10-09 22:35:00

Definite guide to supporting multiple device resolutions on Android (2014)
by mbabuskov
2014-10-02 22:36:02

List of Learning Resources
by Longor1996
2014-08-16 10:40:00

List of Learning Resources
by SilverTiger
2014-08-05 19:33:27

Resources for WIP games
by CogWheelz
2014-08-01 16:20:17

Resources for WIP games
by CogWheelz
2014-08-01 16:19:50

List of Learning Resources
by SilverTiger
2014-07-31 16:29:50
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!