Riven
« League of Dukes » JGO Kernel      Posts: 5870 Medals: 255
Hand over your head.
|
 |
«
on:
2008-06-23 15:15:35 » |
|
Edit: Latest version:Usage:1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
| @MappedType(sizeof = 12) public class MappedVec3 extends MappedObject { public float x;
public float y;
public float z; public float length() { return (float) Math.sqrt(x*x + y*y + z*z); }
@Override public String toString() { return "[" + x + "," + y + "," + z + "]"; } } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
| public class TestMappedObject { public static void main(String[] args) { MappedObjectTransformer.register(MappedVec3.class);
if (MappedObjectClassLoader.fork(TestMappedObject.class, args)) { return; }
ByteBuffer bb = ByteBuffer.allocateDirect(4096); bb.order(ByteOrder.nativeOrder()); MappedVec3 vecs = MappedVec3.map(bb);
vecs.x = 1.30f; vecs.y = 2.50f; vecs.z = 3.60f;
vecs.view = 1; vecs.x = 0.13f; vecs.y = 0.25f; vecs.z = 0.36f;
float x0 = bb.getFloat(0 << 2); float y0 = bb.getFloat(1 << 2); float z0 = bb.getFloat(2 << 2);
float x1 = bb.getFloat(3 << 2); float y1 = bb.getFloat(4 << 2); float z1 = bb.getFloat(5 << 2); } } |
Everything below this line is outdated...
I decided to give implementing MappedObjects another try, after the previous attempt went down the drain last year. Let's say you have this 'struct' in C (yes, my C-skills are somewhere below sea level): 1 2 3 4 5 6 7 8 9 10 11 12 13 14
| struct Sphere { float x, y, z, r;
bool intersects(Sphere that) { float xd = this->x - that->x; float yd = this->y - that->y; float zd = this->z - that->z; float d2 = (xd * xd) + (yd * yd) + (zd * zd); float r = this->r + that->r; return d2 <= r * r; } } |
Currently you'd have to fiddle with FloatBuffers or float[]s with offsets and strides, which is extremely boring and error-phrone to code. I wrote an API that does this for you, which a nice layer over it so that is supports byte[]s, direct and heap ByteBuffers. Last but not least, it's extremely quick! Working with float[]s instead of byte[]s gains you only 5-7% performance, which it probably a good tradeoff for 99.99% of us. Unfortunately, we still need to tell Java which fields have which offset, so... the code to construct the MappedObject is a bit verbose. But as we define the abstract class (or interface), we can simply talk to the interface, so that we keep all those nice features in our IDEs. In the Java implementation it becomes: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
| @MappedType(stride = 16) public abstract class Sphere { public abstract void map(int element);
public abstract @MappedField(offset = 0, type = float.class) float x(); public abstract @MappedField(offset = 4, type = float.class) float y(); public abstract @MappedField(offset = 8, type = float.class) float z(); public abstract @MappedField(offset = 12, type = float.class) float r();
public abstract void x(float x); public abstract void y(float y); public abstract void z(float z); public abstract void r(float r);
public boolean intersects(Sphere that) { float xd = this.x() - that.x(); float yd = this.y() - that.y(); float zd = this.z() - that.z(); float d2 = (xd * xd) + (yd * yd) + (zd * zd); float r = this.r() + that.r(); return d2 <= r * r; } } |
So much for the boiler-plate! Now it's time to make a MappedObject over a byte[] or a ByteBuffer: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
| MappedObjectProvider<Sphere> provider = new MappedObjectProvider<Sphere>(Sphere.class);
int sphereCount = 128; ByteBuffer data1 = ByteBuffer.allocate( provider.sizeof() * sphereCount); ByteBuffer data2 = ByteBuffer.allocateDirect(provider.sizeof() * sphereCount); byte[] data3 = new byte[ provider.sizeof() * sphereCount]; Sphere sA = provider.newMappedObject(data1, 0, sphereCount); Sphere sB = provider.newMappedObject(data2, 0, sphereCount); Sphere sC = provider.newMappedObject(data3, 0, sphereCount);
sA.map(5); sB.map(13); sA.x(sB.x()); |
The sourcecode (and binaries) are here: http://213.247.55.3/~balk1242/html_stuff/ (for browsable packages, with source in formatted HTML) jawnae.mapped.PrimitiveConverterAlthough I'm using sun.misc.Unsafe to write all primitives in the byte[] at 'native speed', it's impossible to get a buffer overflow. There are automatic optimisations when stride is power-of-two, which makes is roughly 40% faster. Give it a try and let me known what you think!
|
Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social rankings
|
|
|
Riven
« League of Dukes » JGO Kernel      Posts: 5870 Medals: 255
Hand over your head.
|
 |
«
Reply #1 on:
2008-06-23 15:20:15 » |
|
Performance is best with byte[], direct bytebuffers are about 10% slower, and for heap-bytebuffers I simply use the backing byte[], so they run at full speed too. Needless to say you can finally have fields that overlap 
for whatever that's worth
|
Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social rankings
|
|
|
Riven
« League of Dukes » JGO Kernel      Posts: 5870 Medals: 255
Hand over your head.
|
 |
«
Reply #2 on:
2008-06-28 17:02:19 » |
|
I guess I should have titled this topic: structs for java with runtime class generation
|
Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social rankings
|
|
|
Games published by our own members! Go get 'em!
|
|
oNyx
JGO Kernel      Posts: 2943 Medals: 5
pixels! :x
|
 |
«
Reply #3 on:
2008-06-30 05:37:26 » |
|
I guess the biggest issue is that most people here (me included) don't have a clue what it's supposed to be good for. So maybe you could outline some specific real-life use case and how much of a performance improvement you managed to get.
|
|
|
|
kappa
« League of Dukes » JGO Kernel      Posts: 2360 Medals: 59
★★★★★
|
 |
«
Reply #4 on:
2008-06-30 07:37:55 » |
|
I guess the biggest issue is that most people here (me included) don't have a clue what it's supposed to be good for. So maybe you could outline some specific real-life use case and how much of a performance improvement you managed to get.
Yup was thinking the same thing.
|
|
|
|
|
Riven
« League of Dukes » JGO Kernel      Posts: 5870 Medals: 255
Hand over your head.
|
 |
«
Reply #5 on:
2008-06-30 11:34:58 » |
|
Hm... I thought this was a pretty well covered topic on these forums! Guess it's not familiar territory to everybody  It's about not having 10.000 instances of classes like Vector3f, Matrix16f, Normal3f and TexCoord2f, because they are spread all over the heap, so you can't really have good cache-locality and certainly cannot push all that data to the GPU in one call. So now you have this 'sliding window', over your data, where you can pick/map a certain element on which you get/set your values or do your calculations. Slow way, OO style: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
| class Vector3f { public float x, y, z; }
class Normal3f { public float x, y, z; }
Vector3f[] vectors = new Vector3f[1024]; Normal3f[] normals = new Normal3f[1024];
for(int i=0; i<vectors.length; i++) { vectors[i].add(vectors[1024-i-1]); vectors[i].mul(vectors[1024-i-1]); normals[i].normalize(); }
FloatBuffer vec_norm_buf = ...; int p = 0; for(int i=0; i<vectors.length; i++) { Vextor3f v3f = vectors[i]; fbuf.put(p + 0, v3f.x); fbuf.put(p + 1, v3f.y); fbuf.put(p + 2, v3f.z);
Vextor3f n3f = normals[i]; fbuf.put(p + 3, n3f.x); fbuf.put(p + 4, n3f.y); fbuf.put(p + 5, n3f.z);
p += 6; } |
You can also work directly on the FloatBuffer, but the code to write is just a heck of a lot boiler-plate, and a pain in the ass to modify (for example when you change your interleaved buffer from VectorNormal to VectorNormalTexcoord. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
| @MappedType(stride = 24) public abstract class Vector3f { public abstract void map(int element);
public abstract @MappedField(offset = 0, type = float.class) float x(); public abstract @MappedField(offset = 4, type = float.class) float y(); public abstract @MappedField(offset = 8, type = float.class) float z();
public abstract void x(float x); public abstract void y(float y); public abstract void z(float z); }
@MappedType(stride = 24) public abstract class Normal3f { public abstract void map(int element);
public abstract @MappedField(offset = 12, type = float.class) float x(); public abstract @MappedField(offset = 16, type = float.class) float y(); public abstract @MappedField(offset = 20, type = float.class) float z();
public abstract void x(float x); public abstract void y(float y); public abstract void z(float z); }
MappedObjectProvider<Vector3f> vProvider = new MappedObjectProvider<Vector3f>(Vector3f.class); MappedObjectProvider<Normal3f> nProvider = new MappedObjectProvider<Normal3f>(Normal3f.class); byte[] data = new byte[(vProvider.sizeof() + nProvider.sizeof()) * 1024];
Vector3f vMap1 = vProvider.newMappedObject(data, 0, 1024); Vector3f vMap2 = vProvider.newMappedObject(data, 0, 1024); Normal3f nMap = nProvider.newMappedObject(data, 0, 1024);
for(int i=0; i<1024; i++) { vMap1.map(i); vMap2.map(1024-i-1); nMap.map(i);
vMap1.add(vMap2); vMap1.mul(vMap2); nMap.normalize(); }
|
Note the stride of 24 (6 floats), and offsets for the floats in Vector3f and Normal3f. This is as near as you get to structs in Java. Next step would be to add a bytecode transformer that allows you to call fields instead of getters/setters, (which are transformed into getters/setters again at runtime). A simple Vector3f.add() would look like: 1 2 3 4 5 6
| public void add(Vector3f that) { this.x(this.x() + that.x()); this.y(this.y() + that.y()); this.z(this.z() + that.z()); } |
With a bytecode transformer it would look like: 1 2 3 4 5 6
| public void add(Vector3f that) { this.x += that.x; this.y += that.y; this.z += that.z; } |
Anyway, there is no performance penalty, even when writing floats into byte[]s or ByteBuffers.
|
Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social rankings
|
|
|
princec
« League of Dukes » JGO Kernel      Posts: 8089 Medals: 96
Eh? Who? What? ... Me?
|
 |
«
Reply #6 on:
2011-04-06 08:20:04 » |
|
Hmm how did I miss this thread the last time. <edit>Links at the top are broken - still got it all lying around somewhere? Cas 
|
|
|
|
Riven
« League of Dukes » JGO Kernel      Posts: 5870 Medals: 255
Hand over your head.
|
 |
«
Reply #7 on:
2011-04-06 10:57:23 » |
|
Not really. The data behind the links was lost when JGO got infected and I had to get a new VM instance. The actual sourcecode was deleted by me because I basically didn't write it for myself, but for others here and it turned out nobody cared / understood. It shouldn't take too long to write it again though.
Keep in mind it uses sun.misc.Unsafe (in a safe way) and Janino for the code generation. It's basically a tiny layer over direct memory access (writing a float into a byte[], by doing a unsafe.putFloat(instance, offset, value) for example. with direct buffers I used memory addresses instead of offsets from objects) With short,char,int,long you get byte-order problems, but that's pretty much guaranteed anyway when interfacing with raw memory.
I just had an enlightened moment (...) on how to enable the developer to write code using fields as opposed to those getters & setters. The bytecode transformations shouldn't be too hard after that.
|
Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social rankings
|
|
|
princec
« League of Dukes » JGO Kernel      Posts: 8089 Medals: 96
Eh? Who? What? ... Me?
|
 |
«
Reply #8 on:
2011-04-06 10:59:23 » |
|
I watch with interest  Direct field access -> buffer is exactly what I'm looking for... whether that turns into an intrinsic move / load operation at the end of it is of course the holy grail and puts it on a par with pure C for performance. Cas 
|
|
|
|
Riven
« League of Dukes » JGO Kernel      Posts: 5870 Medals: 255
Hand over your head.
|
 |
«
Reply #9 on:
2011-04-06 11:01:02 » |
|
The overhead of Unsafe is significant. It's about twice as slow as your regular Vector3f instance. (yay!)
At least, that's what I recall from 2008.
|
Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social rankings
|
|
|
Games published by our own members! Go get 'em!
|
|
delt0r
Sr. Member   Posts: 412 Medals: 12
Computers can do that?
|
 |
«
Reply #10 on:
2011-04-06 11:37:26 » |
|
Any real numbers on the difference between the "OO" way and the "struct" way? Also when you say you can have overlapping offsets, do you mean C unions? (C unions are evil  ). I had one trick when using the OO way. Every now and then re-allocate the entire set of objects in access order. This gave me a 2-4x speed increase on Intels and no increase or penalty on AMDs. But this was back in java 1.3 IIRC.
|
I have no special talents. I am only passionately curious.--Albert Einstein
|
|
|
lhkbob
JGO Neuromancer     Posts: 1174 Medals: 35
|
 |
«
Reply #11 on:
2011-04-06 11:41:23 » |
|
I just re-read that and was wondering how the abstraction and JNI affected performance. Sure you get the memory locality and fewer instances lying around, but it felt like you were adding a lot of extra work in reading/writing the "fields".
|
|
|
|
Riven
« League of Dukes » JGO Kernel      Posts: 5870 Medals: 255
Hand over your head.
|
 |
«
Reply #12 on:
2011-04-06 11:56:00 » |
|
The JVM got better since 2008  It's still non-deterministic though, and you can have wildly varying performance between JVM runs... (it either takes 18us or 30us, after warmup, on my machine) Here is a benchmark to give a rough indication of Unsafe performance: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263
| package eden;
import java.lang.reflect.Field; import java.nio.ByteBuffer; import java.nio.ByteOrder; import java.nio.FloatBuffer; import java.util.Arrays;
import sun.misc.Unsafe;
public class MappedObject { private static final long nanoTimeOverhead; static { nanoTimeOverhead = measureNanoTimeOverhead(); System.out.println("Java version: " + System.getProperty("java.version")); System.out.println("JavaVM version: " + System.getProperty("java.vm.version")); System.out.println("nanoTimeOverhead: " + nanoTimeOverhead + "ns"); }
private static Unsafe grabUnsafeInstanceFromBuffer() { try { ByteBuffer dummy = ByteBuffer.allocateDirect(1); Field unsafeField = dummy.getClass().getDeclaredField("unsafe"); unsafeField.setAccessible(true); return (Unsafe) unsafeField.get(dummy); } catch (Throwable t) { throw new Error(t); } }
public static void main(String[] args) { Unsafe unsafe = grabUnsafeInstanceFromBuffer();
int vectors = 4 * 1024; int floats = vectors * 4; int bytes = floats * 4;
int runs = 256;
for (int i = 0; i < 16; i++) { System.out.println();
runFloatObjectBenchmark(vectors, floats, bytes, runs); runFloatBufferBenchmark(vectors, floats, bytes, runs); runFloatArrayBenchmark(vectors, floats, bytes, runs); runFloatPointerBenchmark(unsafe, vectors, floats, bytes, runs); } }
static class FloatVector { public float x, y, z, w; }
private static final String formatNanos(long nanos) { if (nanos > 10 * 1000000000L) return (nanos / 1000000000L) + "s"; if (nanos > 10 * 1000000L) return (nanos / 1000000L) + "ms"; if (nanos > 10 * 1000L) return (nanos / 1000L) + "us"; return nanos + "ns"; }
private static final void printResults(String prefix, long[] durations) { System.out.println(prefix + ".best: " + formatNanos(durations[(int) (durations.length * 0.00f)])); System.out.println(prefix + ".good: " + formatNanos(durations[(int) (durations.length * 0.05f)])); System.out.println(prefix + ".median: " + formatNanos(durations[(int) (durations.length * 0.50f)])); System.out.println(prefix + ".bad: " + formatNanos(durations[(int) (durations.length * 0.95f) - 1])); System.out.println(prefix + ".worst: " + formatNanos(durations[(int) (durations.length * 1.00f) - 1])); }
private static final void runFloatObjectBenchmark(int vectors, int floats, int bytes, int runs) { FloatVector[] a = new FloatVector[vectors]; FloatVector[] b = new FloatVector[vectors]; FloatVector[] dst = new FloatVector[vectors];
for (int i = 0; i < vectors; i++) a[i] = new FloatVector(); for (int i = 0; i < vectors; i++) b[i] = new FloatVector(); for (int i = 0; i < vectors; i++) dst[i] = new FloatVector();
long[] durations = new long[runs]; for (int i = 0; i < durations.length; i++) { long t0 = System.nanoTime(); testFloatObjectMulPerformance(a, b, dst, vectors); long t1 = System.nanoTime(); long took = (t1 - t0) - 2 * nanoTimeOverhead; durations[i] = took; }
Arrays.sort(durations); printResults("Vector", durations); }
private static final void runFloatArrayBenchmark(int vectors, int floats, int bytes, int runs) { float[] a = new float[floats]; float[] b = new float[floats]; float[] dst = new float[floats];
long[] durations = new long[runs]; for (int i = 0; i < durations.length; i++) { long t0 = System.nanoTime(); testFloatArrayMulPerformance(a, b, dst, vectors); long t1 = System.nanoTime(); long took = (t1 - t0) - 2 * nanoTimeOverhead; durations[i] = took; }
Arrays.sort(durations); printResults("float[]", durations); }
private static final void runFloatBufferBenchmark(int vectors, int floats, int bytes, int runs) { FloatBuffer a = ByteBuffer.allocateDirect(bytes).order(ByteOrder.nativeOrder()).asFloatBuffer(); FloatBuffer b = ByteBuffer.allocateDirect(bytes).order(ByteOrder.nativeOrder()).asFloatBuffer(); FloatBuffer dst = ByteBuffer.allocateDirect(bytes).order(ByteOrder.nativeOrder()).asFloatBuffer();
long[] durations = new long[runs]; for (int i = 0; i < durations.length; i++) { long t0 = System.nanoTime(); testFloatBufferMulPerformance(a, b, dst, vectors); long t1 = System.nanoTime(); long took = (t1 - t0) - 2 * nanoTimeOverhead; durations[i] = took; }
Arrays.sort(durations); printResults("FloatBuffer", durations); }
private static final void runFloatPointerBenchmark(Unsafe unsafe, int vectors, int floats, int bytes, int runs) { long a = malloc_aligned(unsafe, bytes, 4 * 4); long b = malloc_aligned(unsafe, bytes, 4 * 4); long dst = malloc_aligned(unsafe, bytes, 4 * 4);
long[] durations = new long[runs]; for (int i = 0; i < durations.length; i++) { long t0 = System.nanoTime(); testFloatPointerMulPerformance(unsafe, a, b, dst, vectors); long t1 = System.nanoTime(); long took = (t1 - t0) - 2 * nanoTimeOverhead; durations[i] = took; }
Arrays.sort(durations); printResults("float*", durations); }
private static final long measureNanoTimeOverhead() { long nanoTimeOverhead = Long.MAX_VALUE; for (int k = 0; k < 64; k++) { for (int i = 0; i < 1024; i++) System.nanoTime();
long runNanoTimeOverhead = Long.MAX_VALUE; for (int i = 0; i < 64; i++) { long currentOverhead = -(System.nanoTime() - System.nanoTime()); if (currentOverhead < runNanoTimeOverhead) runNanoTimeOverhead = currentOverhead; }
if (runNanoTimeOverhead < nanoTimeOverhead) nanoTimeOverhead = runNanoTimeOverhead; } return nanoTimeOverhead; }
private static long malloc_aligned(Unsafe unsafe, long bytes, long align) { long base = unsafe.allocateMemory(bytes + align); long pntr = base + (align - (base % align)); return pntr; }
private static void testFloatObjectMulPerformance(FloatVector[] a, FloatVector[] b, FloatVector[] dst, int vectorCount) { for (int i = 0; i < vectorCount; i++) { int p = i; dst[p].x = a[p].x * b[p].x; dst[p].y = a[p].y * b[p].y; dst[p].z = a[p].z * b[p].z; } }
private static void testFloatArrayMulPerformance(float[] a, float[] b, float[] dst, int vectorCount) { for (int i = 0; i < vectorCount; i++) { int p = i << 2; dst[p + 0] = a[p + 0] * b[p + 0]; dst[p + 1] = a[p + 1] * b[p + 1]; dst[p + 2] = a[p + 2] * b[p + 2]; } }
private static void testFloatBufferMulPerformance(FloatBuffer a, FloatBuffer b, FloatBuffer dst, int vectorCount) { for (int i = 0; i < vectorCount; i++) { int p = i << 2; dst.put(p + 0, a.get(p + 0) * b.get(p + 0)); dst.put(p + 1, a.get(p + 1) * b.get(p + 1)); dst.put(p + 2, a.get(p + 2) * b.get(p + 2)); } }
private static void testFloatPointerMulPerformance(Unsafe unsafe, long a, long b, long dst, int vectorCount) { for (int i = 0; i < vectorCount; i++) { int p = i << 4;
unsafe.putFloat(dst + (p + 0), unsafe.getFloat(a + (p + 0)) * unsafe.getFloat(b + (p + 0))); unsafe.putFloat(dst + (p + 4), unsafe.getFloat(a + (p + 4)) * unsafe.getFloat(b + (p + 4))); unsafe.putFloat(dst + (p + 8), unsafe.getFloat(a + (p + 8)) * unsafe.getFloat(b + (p + 8))); } }
private static void testFloatPointerMulPerformance_slower(Unsafe unsafe, long a, long b, long dst, int vectorCount) { for (int i = 0; i < vectorCount; i++) { int p = i << 4; long ap = p + a; long bp = p + b; long dstp = p + dst;
unsafe.putFloat(dstp + 0, unsafe.getFloat(ap + 0) * unsafe.getFloat(bp + 0)); unsafe.putFloat(dstp + 4, unsafe.getFloat(ap + 4) * unsafe.getFloat(bp + 4)); unsafe.putFloat(dstp + 8, unsafe.getFloat(ap + 8) * unsafe.getFloat(bp + 8)); } } } |
(please don't whine about how flawed the benchmark is) Output: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
| Java version: 1.6.0_20 JavaVM version: 16.3-b01 nanoTimeOverhead: 698ns
... [snip] ...
Vector.best: 26us Vector.good: 26us Vector.median: 27us Vector.bad: 28us Vector.worst: 52us
FloatBuffer.best: 20us FloatBuffer.good: 20us FloatBuffer.median: 20us FloatBuffer.bad: 21us FloatBuffer.worst: 112us
float[].best: 35us float[].good: 35us float[].median: 36us float[].bad: 38us float[].worst: 157us
float*.best: 17us float*.good: 17us float*.median: 18us float*.bad: 19us float*.worst: 95us |
Now that we have a baseline, and it turns out Unsafe is (even more) feasible, I can try to get that MappedObject impl. up and running.
|
Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social rankings
|
|
|
Riven
« League of Dukes » JGO Kernel      Posts: 5870 Medals: 255
Hand over your head.
|
 |
«
Reply #13 on:
2011-04-06 11:56:24 » |
|
I just re-read that and was wondering how the abstraction and JNI affected performance. Sure you get the memory locality and fewer instances lying around, but it felt like you were adding a lot of extra work in reading/writing the "fields".
There are no JNI calls.
|
Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social rankings
|
|
|
Riven
« League of Dukes » JGO Kernel      Posts: 5870 Medals: 255
Hand over your head.
|
 |
«
Reply #14 on:
2011-04-06 11:58:59 » |
|
Any real numbers on the difference between the "OO" way and the "struct" way? See above benchmark, on how Unsafe beats performance of sequentially allocated instances Also when you say you can have overlapping offsets, do you mean C unions? (C unions are evil  ). Yes, I do. Maybe even worse, as the fields can partically overlap. This 'support' is a side effect and can easily be disabled with some checks/exceptions during initializing the sliding-window / mapped-object. These checks are advisable anyway, because on non-x86 architectures misaligned memory access can crash a process. I had one trick when using the OO way. Every now and then re-allocate the entire set of objects in access order. This gave me a 2-4x speed increase on Intels and no increase or penalty on AMDs. But this was back in java 1.3 IIRC.
After a single GC all your hard work is lost. Still it might be a good idea to do this once every few seconds.
|
Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social rankings
|
|
|
delt0r
Sr. Member   Posts: 412 Medals: 12
Computers can do that?
|
 |
«
Reply #15 on:
2011-04-06 12:51:31 » |
|
I think a GC will still approximately keep the order, especially a generational GC and when the graph has the same access pattern (ie all are referenced from an array).
|
I have no special talents. I am only passionately curious.--Albert Einstein
|
|
|
lhkbob
JGO Neuromancer     Posts: 1174 Medals: 35
|
 |
«
Reply #16 on:
2011-04-06 14:47:47 » |
|
I just re-read that and was wondering how the abstraction and JNI affected performance. Sure you get the memory locality and fewer instances lying around, but it felt like you were adding a lot of extra work in reading/writing the "fields".
There are no JNI calls. But Unsafe has native methods in it. I've always just assumed that the native method meant that it used JNI, or can the JVM have super-fast native functions too?
|
|
|
|
Riven
« League of Dukes » JGO Kernel      Posts: 5870 Medals: 255
Hand over your head.
|
 |
«
Reply #17 on:
2011-04-06 15:16:54 » |
|
The JVM is a native process. It can read/write at any memory address (within its virtual address space). Writing into an object field is basically using the object as a pointer and the field as an offset. It's actually pretty easy to figure out the pointer/reference of an object, as the JVM sees it: 1. make and Object[1] 2. write the object reference into index 0 3. read the array index as an int (32 bit jvm) or long (64 bit jvm) using the Unsafe class You now have the location of an Object on the heap, which was obviously already known by the JVM. If you'd write in that region, you'd be modifying fields. The Unsafe class is basically a backdoor into the JVM, into what it already can do, and what you're not supposed to be able to do in Java. All direct Buffers (Floatbuffer, IntBuffer, etc) are backed by Unsafe. The reason that Buffers are still reasonably fast, despite the bunch of abstraction layers, is that the JVM turns all these calls to Unsafe methods into machine instructions. This is why: 1 2 3
| unsafe.putFloat(dst + (p + 0), unsafe.getFloat(a + (p + 0)) * unsafe.getFloat(b + (p + 0))); unsafe.putFloat(dst + (p + 4), unsafe.getFloat(a + (p + 4)) * unsafe.getFloat(b + (p + 4))); unsafe.putFloat(dst + (p + 8), unsafe.getFloat(a + (p + 8)) * unsafe.getFloat(b + (p + 8))); |
is about as fast as: 1 2 3
| dst[p + 0] = a[p + 0] * b[p + 0]; dst[p + 1] = a[p + 1] * b[p + 1]; dst[p + 2] = a[p + 2] * b[p + 2]; |
actually, all those Unsafe method calls are slightly faster than pure float[] access, probably because there are no bound-checks.
|
Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social rankings
|
|
|
princec
« League of Dukes » JGO Kernel      Posts: 8089 Medals: 96
Eh? Who? What? ... Me?
|
 |
«
Reply #18 on:
2011-04-06 15:44:37 » |
|
hm, be interesting to actually look at the machine code it outputs. I'm thinking it should genuinely be as fast as C if it's intrinsified by the JVM. Cas 
|
|
|
|
Riven
« League of Dukes » JGO Kernel      Posts: 5870 Medals: 255
Hand over your head.
|
 |
«
Reply #19 on:
2011-04-06 15:54:54 » |
|
The problem with Unsafe is that all pointers are long. On 32 bit JVMs this is rather inefficient because all memory address calculations have to be converted to longs, which the JVM then has to truncate back to an int. This simply can't even remotely be as fast as C code doing pointerbumps on ints. Or the JVM manages to get rid of the int->long->int conversion all together.
Still, running the above code on a 32 bit JVM vs a 64 bit JVM doesn't show any performance difference, leading to the conclusion that we have another bottleneck: either the cache is the bottleneck or the JVM doesn't remotely reach the maximum possible performance, in this benchmark, for all cases: float[], FloatBuffer and FloatVector.
|
Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social rankings
|
|
|
concerto49
JGO n00b  Posts: 13
|
 |
«
Reply #20 on:
2011-06-10 01:58:59 » |
|
Have you got a chance to rework this? Also the bytecode transformer part? This sounds exciting!
|
|
|
|
|
Riven
« League of Dukes » JGO Kernel      Posts: 5870 Medals: 255
Hand over your head.
|
 |
«
Reply #21 on:
2011-06-23 19:47:48 » |
|
First time I did bytecode transformation... but I got it to work! It's fairly basic at this point, but I will share it once I cleaned it up. You can choose whether you want your objects backed by a float[] or a FloatBuffer (only floats are supported atm) and possibly raw pointers at some point. Meanwhile, here is some example code: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
| private int backing_offset = 0; private float[] backing_array = null; private FloatBuffer backing_buffer = null;
public int static sizeof;
@FieldOffset(0) public float x; @FieldOffset(1) public float y; @FieldOffset(2) public float z;
public void index(int index) { this.backing_offset = this.sizeof * index; }
public void test() { this.index(1);
System.out.println(Arrays.toString(this.backing_array)); this.x = 13.13f; System.out.println(Arrays.toString(this.backing_array)); this.y = 14.14f; System.out.println(Arrays.toString(this.backing_array)); this.z = this.x * this.y; System.out.println(Arrays.toString(this.backing_array)); }
public VectorStruct duplicate() { VectorStruct copy = new VectorStruct(); copy.backing_offset = this.backing_offset; copy.backing_array = this.backing_array; copy.backing_buffer = this.backing_buffer; return copy; } |
It outputs (writing into the 2nd struct): 1 2 3 4
| [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 13.13, 0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 13.13, 14.14, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 0.0, 13.13, 14.14, 185.6582, 0.0, 0.0, 0.0] |
|
Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social rankings
|
|
|
Spasi
JGO Ninja    Posts: 589 Medals: 26
Molon Lave
|
 |
«
Reply #22 on:
2011-06-25 19:38:15 » |
|
If we consider multi-threaded access, for example when doing a computation using JDK7's fork/join framework, I guess we'll have to use duplicate() in a ThreadLocal's initialValue(), is that correct? So that different threads can work on a different backing_offset.
|
|
|
|
|
Riven
« League of Dukes » JGO Kernel      Posts: 5870 Medals: 255
Hand over your head.
|
 |
«
Reply #23 on:
2011-06-26 06:52:26 » |
|
If we consider multi-threaded access, for example when doing a computation using JDK7's fork/join framework, I guess we'll have to use duplicate() in a ThreadLocal's initialValue(), is that correct? So that different threads can work on a different backing_offset.
Indeed. Just like direct ByteBuffer instances with relative access, mapped objects are not thread-safe.
|
Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social rankings
|
|
|
Riven
« League of Dukes » JGO Kernel      Posts: 5870 Medals: 255
Hand over your head.
|
 |
«
Reply #24 on:
2011-06-26 07:11:39 » |
|
I rewrote the implementation to use direct memory access. All primitives are supported now. You'll map a type using: 1
| MappedVec2 vec2s = MappedVec2.map(ByteBuffer) |
As adding a Java agent to your application might be a tad unobvious for average Joe, you have the option to install the bytecode transformer through code. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
| public static void main(String[] args) { if(MappedInstanceTransformer.fork(MyApp.class, args)) { MappedInstanceTransformer.register(MappedVec2.class); MappedInstanceTransformer.register(MappedVec3.class); MappedInstanceTransformer.register(MappedVec4.class); return; }
ByteBuffer bb = ByteBuffer.allocateDirect(MappedVec2.sizeof * n); MappedVec2 vec2s = MappedVec2.map(bb); } |
The 'fork' method will grab the URLs of the application ClassLoader, creates a new ClassLoader that transforms the classes, and calls the main-method again, using a class from the new classloader.
|
Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social rankings
|
|
|
concerto49
JGO n00b  Posts: 13
|
 |
«
Reply #25 on:
2011-06-26 08:36:01 » |
|
Is the source/lib still available? The download links are broken, thanks.
|
|
|
|
|
Riven
« League of Dukes » JGO Kernel      Posts: 5870 Medals: 255
Hand over your head.
|
 |
«
Reply #26 on:
2011-06-26 09:04:58 » |
|
I want to clean it up first (and run some proper benchmarks).
|
Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social rankings
|
|
|
princec
« League of Dukes » JGO Kernel      Posts: 8089 Medals: 96
Eh? Who? What? ... Me?
|
 |
«
Reply #27 on:
2011-06-26 11:28:24 » |
|
I'll plug it into my sprite engine and see what sort of a performance boost I can get  Cas 
|
|
|
|
Riven
« League of Dukes » JGO Kernel      Posts: 5870 Medals: 255
Hand over your head.
|
 |
«
Reply #28 on:
2011-06-26 11:30:59 » |
|
It seems HotSpot doesn't really like the bytecodes I feed it.
At the moment it reaches 40% of the performance of field-access.
I believe I can do much better, by duplicating the style of bytecode javac generates.
|
Hi, appreciate more people! Σ ♥ = ¾ Learn how to award medals... and work your way up the social rankings
|
|
|
Spasi
JGO Ninja    Posts: 589 Medals: 26
Molon Lave
|
 |
«
Reply #29 on:
2011-06-26 11:55:14 » |
|
Is that on the client VM? I've had similar code using Unsafe failing miserably on the client VM, it couldn't inline all the way to the intrinsified methods. Enabling tiered compilation helped but not much. On server everything was fine and faster than non-Unsafe code.
|
|
|
|
|
|