The Server VM (and Client VM sometimes) is driving me nuts...
I'm putting the contents of a float[] into a strided datastructure with a certain offset and width (like VBO interleaved stuff)
One element can look like: [x,x,X,X,X,X,x,x,x] (repeats iself)
The code for this is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
| public final void putElements(int elementOffset, int elementLength, float[] array, int arrayOffset) { int dataOffset = this.dataOffsetOfElement(elementOffset); int lastDataOffset = dataOffset + elementLength * stride;
while (dataOffset < lastDataOffset) { data[dataOffset + 0] = array[arrayOffset + 0]; data[dataOffset + 1] = array[arrayOffset + 1]; data[dataOffset + 2] = array[arrayOffset + 2]; data[dataOffset + 3] = array[arrayOffset + 3]; arrayOffset += width; dataOffset += stride; } } |
Now the silly part is that it's gets executed much faster with a switch-statement around it:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
| switch (width) { case 1: dataOffset += stride * elementLength; for (int e = elementLength - 1; e >= 0; e--) data[dataOffset -= stride] = array[arrayOffset + e]; break;
case 2: break;
case 3: break;
case 4: while (dataOffset < lastDataOffset) { data[dataOffset + 0] = array[arrayOffset + 0]; data[dataOffset + 1] = array[arrayOffset + 1]; data[dataOffset + 2] = array[arrayOffset + 2]; data[dataOffset + 3] = array[arrayOffset + 3]; arrayOffset += width; dataOffset += stride; } break;
case 5: break;
default: for (int e = elementLength - 1; e >= 0; e--) { for (int i = width - 1; i >= 0; i--) data[dataOffset + i] = array[arrayOffset + i]; arrayOffset += width; dataOffset += stride; } break; } } |
Making tiny changes, like replacing 'width' with '4' makes the HotSpot VM drop everything and create some slow execution-path (maybe it requires an additional register, and it just ran out? Should that cause a performanceloss of 50%?)
When I code for performance, I have to find the fastest way with trial and error, the difference can be up to
factor 3, when applying lots of seamingly irrelevant changes, or putting a lot of never-to-be-executed code around it, to make it 21% faster. And then that will only be the fastest loop on that version of that VM vendor.
I'm going to implement this tiny loop in C and make a DLL, that way I'm certain the code is natively compiled properly on every VM. it's such a shame HotSpot isn't predicatable.