Show Posts
|
|
Pages: [1]
|
|
1
|
Game Development / Shared Code / Re: 13.8x faster atan2 (updated)
|
on: 2012-05-07 11:59:36
|
Fast/slow. It's all relative. To put some numbers on it in terms of instruction latency/throughput (both in cycles) for modern(ish) Intel. Ranges are to cover a reasonable set of hardware.
SQRTSD: 20-39/14-39 (one complaint double sqrt) On unspecified input ranges these can drop to ~6/~10. SQRTSS: 23-32/23-32 (one compliant single sqrt) On unspecified input ranges these can drop to ~6/~10. RSQRTSS: 3-6/1-4 (approximate 1/sqrt for singles. We can't do this...big fat sadface)
So what I intended to say about sqrt being "hard" is that unlike lots of other functions reducing the range or precision to get a faster version is much more difficult. About the only way you can succeed is if you know that it's very near a specific value (like when renormalizing). That old SGI trick (popularized by ID) should always be slower that simply using the hardware.
After some googling I found out these are MMX instructions, and Java can't use those afaik. Normal math happens using FPU instructions. I copied this from "Intel Architecture Optimization Reference Manual": "Additionally, while the multi-cycle floating-point instructions, fdiv and fsqrt, execute in the floating-point unit pipe, integer instructions can be executed in parallel." So it pays off to do lots of non-fp-math in inner-loops that use Math.sqrt and /. Separating (or keeping) those in different inner-loops would be bad. And from "How to optimize for the Pentium family of microprocessors" I read that FSQRT uses 70 cycles and cannot overlap integer multiplication instructions (they will block the pipeline). No idea about modern processors however. They should be much more efficient in this, especially the new Core Ix and Celeron Sandy Bridge CPUs.
|
|
|
|
|
2
|
Game Development / Shared Code / Re: 13.8x faster atan2 (updated)
|
on: 2012-05-07 04:56:25
|
@Zom-B the original poster said he used a "lookup-table." Rainbow tables are lookup tables for hash codes.
Rainbow tables are clever structures that eliminate the need for memory-eating lookup tables (think in the range of several terabytes) (@Riven do you have fast implementations for asin and acos?)
Would you use them? Haha that's a good question, I don't see why I would ever use acos and asin in a performance critical application  How about fractals / fractal art?  (actually you'd need the highest possible precision here so the statement is void) Fast/slow. It's all relative. To put some numbers on it in terms of instruction latency/throughput (both in cycles) for modern(ish) Intel. Ranges are to cover a reasonable set of hardware.
SQRTSD: 20-39/14-39 (one complaint double sqrt) On unspecified input ranges these can drop to ~6/~10. SQRTSS: 23-32/23-32 (one compliant single sqrt) On unspecified input ranges these can drop to ~6/~10. RSQRTSS: 3-6/1-4 (approximate 1/sqrt for singles. We can't do this...big fat sadface)
So what I intended to say about sqrt being "hard" is that unlike lots of other functions reducing the range or precision to get a faster version is much more difficult. About the only way you can succeed is if you know that it's very near a specific value (like when renormalizing). That old SGI trick (popularized by ID) should always be slower that simply using the hardware.
I sometimes use "float d = 1f / (float)Math.sqrt(dx * dx + dy * dy);" (more often double actually), so hopefully the compiler uses RSQRTSS there.
|
|
|
|
|
3
|
Game Development / Shared Code / Re: 13.8x faster atan2 (updated)
|
on: 2012-05-02 10:29:26
|
Who got what idea? I don't see anything in this thread even remotely related to rainbow tables. A side side note. At least on my implementation, Sun/Oracle Java's for Windows, in rt.jar (the part of the JVM that is in pure Java) all Math.xxx methods delegate to StrictMath.xxx. So directly calling StrictMath.xxx saves one call in performance.
StrictMath is almost never called from Math, the JVM inlines it with its own faster native implementation. EDIT: Hehe looks like Roquen's link already says that ^^  I didn't know that. It isn't flagged as native so I never expected the JVM would generate exceptions for those. I did some quick tests, and indeed the trig functions are in the order of 80 times faster when calling Math.  On the other hand, some functions like min/max are 35% faster when calling StrictMath (I haven't tested others yet). Edit: A more advanced test (using my second test type, and preventing code from being optimized away) shows now that StrictMath is about 10% faster than Math.  Edit2: (StrictMath is ...) sin,cos: 50% slower tan: 10% faster atan: about the same sqrt: 9000% slower cbrt: about the same floor: about 10% slower ceil: about the same max,min: about the same And I think we're drifting too much off topic. shouldn't someone make a new topic for this?
|
|
|
|
|
4
|
Game Development / Shared Code / Re: 13.8x faster atan2 (updated)
|
on: 2012-05-01 07:16:36
|
|
A side side note. At least on my implementation, Sun/Oracle Java's for Windows, in rt.jar (the part of the JVM that is in pure Java) all Math.xxx methods delegate to StrictMath.xxx. So directly calling StrictMath.xxx saves one call in performance.
|
|
|
|
|
5
|
Game Development / Shared Code / Re: 13.8x faster atan2 (updated)
|
on: 2012-04-27 06:23:51
|
In the past I have found a very elegant and reliable way to do microbenchmarking while compensating for the effects of overhead. Calculate the time it takes one loop without the function being tested, and one with. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
| public static void main(String[] args) int samples = 1000000; long l1; long l2; int n = 0;
for (int tries = 0; tries < 10; tries++) { l1 = System.nanoTime(); for (int i = 0; i < samples; i++) {} l1 = System.nanoTime() - l1;
l2 = System.nanoTime(); for (int i = 0; i < samples; i++) { n++; } l2 = System.nanoTime() - l2;
System.out.println("t0 = " + l1 + "\tt=" + l2 + "\tspeed=" + samples * 1e9f / (l2 - l1)); } } |
Result: t0 = 1251556 t=779708 speed=-2.11932659E9 t0 = 548115 t=838654 speed=3.44187878E9 t0 = 546159 t=752889 speed=4.837227E9 t0 = 531632 t=784457 speed=3.95530496E9 t0 = 523251 t=883911 speed=2.77269453E9 t0 = 546159 t=807924 speed=3.82022042E9 t0 = 543924 t=804013 speed=3.84483763E9 t0 = 545879 t=803454 speed=3.88236442E9 t0 = 543365 t=804013 speed=3.83659187E9 t0 = 546159 t=804012 speed=3.87817856E9My 2.5GHz Intel Core (duo) can do about 3.88 billion times n++ in a single thread. When removing n++ (making both loops the same), the result is: t0 = 1259099 t=602311 speed=-1.52256128E9 t0 = 544762 t=556216 speed=8.730574E10 t0 = 551467 t=540851 speed=-9.4197441E10 t0 = 666007 t=552864 speed=-8.8383724E9 t0 = 545041 t=553982 speed=1.11844311E11 t0 = 545600 t=556496 speed=9.1776795E10 t0 = 546159 t=603708 speed=1.73764956E10 t0 = 561245 t=540571 speed=-4.8369934E10 t0 = 553701 t=542527 speed=-8.9493463E10 t0 = 551467 t=607619 speed=1.78088038E10which are ridiculously high numbers (30 times higher) meaning this overhead compensation is fairly reliable relative to the measuring noise. (increasing samples increases accuracy) Now on to atan2. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
| public static void main(String[] args) { Random RND = new Random();
int samples = 1000000; long l1; long l2; long l3; float x; float y;
for (int tries = 0; tries < 10; tries++) { l1 = System.nanoTime(); for (int i = 0; i < samples; i++) { x = (RND.nextFloat() - 0.5f) * 200; y = (RND.nextFloat() - 0.5f) * 200; } l1 = System.nanoTime() - l1;
l2 = System.nanoTime(); for (int i = 0; i < samples; i++) { x = (RND.nextFloat() - 0.5f) * 200; y = (RND.nextFloat() - 0.5f) * 200; FastMath.atan2(x, y); } l2 = System.nanoTime() - l2;
l3 = System.nanoTime(); for (int i = 0; i < samples; i++) { x = (RND.nextFloat() - 0.5f) * 200; y = (RND.nextFloat() - 0.5f) * 200; StrictMath.atan2(x, y); } l3 = System.nanoTime() - l3;
float fast = samples * 1e9f / (l2 - l1); float slow = samples * 1e9f / (l3 - l1); System.out.println("speed fast=" + fast + "\tspeed slow=" + slow + "\tfactor=" + fast / slow); } } |
result: speed fast=2.5838188E7 speed slow=3102130.2 factor=8.329176 speed fast=4.2304908E7 speed slow=6922386.0 factor=6.111319 speed fast=3.6428208E7 speed slow=6051220.5 factor=6.019977 speed fast=3.3007322E7 speed slow=5712438.0 factor=5.7781496 speed fast=3.6745696E7 speed slow=6234034.0 factor=5.8943686 speed fast=4.6704136E7 speed slow=6152503.5 factor=7.5910783 speed fast=3.9707864E7 speed slow=6488173.5 factor=6.1200376 speed fast=5.0430332E7 speed slow=6682681.5 factor=7.5464215 speed fast=4.770246E7 speed slow=6730517.0 factor=7.087488 speed fast=4.857506E7 speed slow=6979997.5 factor=6.9591804I've also invented another, even more accurate, but more limited, microbenchmarking technique. Call the function once in the first loop and 10 times in the other, then solve for the run time of one execution. This is not applicable for atan2 because you can't call it multiple times in a row and expect the performance to be flat (and also calling random in between makes the technique useless).
|
|
|
|
|
6
|
Game Development / Shared Code / Re: 13.8x faster atan2 (updated)
|
on: 2012-04-26 20:01:28
|
|
I use JDK 1.7 with Eclipse Indigo on Core2(duo) 2.5GHz and 32-bit windows XP. I don't use the -server flag.
I don't see how my version could possibly be slower, unless you copied the early double variant which I soon replaced with the float variant.
My algorithm uses 3 comparisons, 1 multiplication, 1 division, 1 addition, 1 table look-up and 0~1 negations.
Riven's version uses 3 comparisons, 4 multiplications, 1 division, 1 addition, 1 table look-up, 0~2 negations, 5~7 assignments and 2 casts.
|
|
|
|
|
7
|
Game Development / Shared Code / Re: 13.8x faster atan2 (updated)
|
on: 2012-04-26 14:52:19
|
|
I just benchmarked my real application and it went from 20fps to 70fps. It's not only atan2, the 4th operation in the following sequence is atan2. Other operations include loading an image (memory intensive).
processing times in milliseconds.
before: ms=1.298769 1.79073 1.291225 31.39561 0.686121 1.968686 0.729702 1.721866
after: ms=1.283403 1.842692 1.211607 4.929677 0.801498 2.235479 0.755683 1.747324
From 31ms to 5ms is a 6x improvement.
My raw code+benchmark give this:
FastMath: 42ms, sum=-2688.061 JavaMath: 562ms, sum=-2690.9863 factor: 13.380953
FastMath: 40ms, sum=-2688.061 JavaMath: 519ms, sum=-2690.9863 factor: 12.975
In contrast, on my machine, Riven's original algorithm+benchmark was 8.5x faster out-of-the-box, as opposed to the claimed 13.8x
|
|
|
|
|
8
|
Game Development / Shared Code / Re: 13.8x faster atan2 (updated)
|
on: 2012-04-26 13:12:57
|
I improved it. Now it uses O(N) memory instead of O(N 2) and because it eliminates unnecessary assignments, multiplications, additions and 2-dimensional addressing, it's a lot faster. I increased the table size for more precision. The theory:  I added the option to not only return results from -Pi to Pi, but any range you want. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105
| public class FastMath { public static void main(String[] args) { float min = -100; float max = +100; float step = 0.12f;
for (int i = 0; i < 8; i++) { long t0A = System.nanoTime() / 1000000L; float sumA = 0.0f; for (float y = min; y < max; y += step) for (float x = min; x < max; x += step) sumA += atan2(y, x); long t1A = System.nanoTime() / 1000000L;
long t0B = System.nanoTime() / 1000000L; float sumB = 0.0f; for (float y = min; y < max; y += step) for (float x = min; x < max; x += step) sumB += Math.atan2(y, x); long t1B = System.nanoTime() / 1000000L;
System.out.println(); System.out.println("FastMath: " + (t1A - t0A) + "ms, sum=" + sumA); System.out.println("JavaMath: " + (t1B - t0B) + "ms, sum=" + sumB); System.out.println("factor: " + (float)(t1B - t0B) / (t1A - t0A)); } }
private static final int SIZE = 1024; private static final float STRETCH = Math.PI; private static final int EZIS = -SIZE; private static final float[] ATAN2_TABLE_PPY = new float[SIZE + 1]; private static final float[] ATAN2_TABLE_PPX = new float[SIZE + 1]; private static final float[] ATAN2_TABLE_PNY = new float[SIZE + 1]; private static final float[] ATAN2_TABLE_PNX = new float[SIZE + 1]; private static final float[] ATAN2_TABLE_NPY = new float[SIZE + 1]; private static final float[] ATAN2_TABLE_NPX = new float[SIZE + 1]; private static final float[] ATAN2_TABLE_NNY = new float[SIZE + 1]; private static final float[] ATAN2_TABLE_NNX = new float[SIZE + 1];
static { for (int i = 0; i <= SIZE; i++) { float f = (float)i / SIZE; ATAN2_TABLE_PPY[i] = (float)(StrictMath.atan(f) * STRETCH / StrictMath.PI); ATAN2_TABLE_PPX[i] = STRETCH * 0.5f - ATAN2_TABLE_PPY[i]; ATAN2_TABLE_PNY[i] = -ATAN2_TABLE_PPY[i]; ATAN2_TABLE_PNX[i] = ATAN2_TABLE_PPY[i] - STRETCH * 0.5f; ATAN2_TABLE_NPY[i] = STRETCH - ATAN2_TABLE_PPY[i]; ATAN2_TABLE_NPX[i] = ATAN2_TABLE_PPY[i] + STRETCH * 0.5f; ATAN2_TABLE_NNY[i] = ATAN2_TABLE_PPY[i] - STRETCH; ATAN2_TABLE_NNX[i] = -STRETCH * 0.5f - ATAN2_TABLE_PPY[i]; } }
public static final float atan2(float y, float x) { if (x >= 0) { if (y >= 0) { if (x >= y) return ATAN2_TABLE_PPY[(int)(SIZE * y / x + 0.5)]; else return ATAN2_TABLE_PPX[(int)(SIZE * x / y + 0.5)]; } else { if (x >= -y) return ATAN2_TABLE_PNY[(int)(EZIS * y / x + 0.5)]; else return ATAN2_TABLE_PNX[(int)(EZIS * x / y + 0.5)]; } } else { if (y >= 0) { if (-x >= y) return ATAN2_TABLE_NPY[(int)(EZIS * y / x + 0.5)]; else return ATAN2_TABLE_NPX[(int)(EZIS * x / y + 0.5)]; } else { if (x <= y) return ATAN2_TABLE_NNY[(int)(SIZE * y / x + 0.5)]; else return ATAN2_TABLE_NNX[(int)(SIZE * x / y + 0.5)]; } } } } |
|
|
|
|
|
|
Add your game by posting it in the WIP section,
or publish it in Showcase.
The first screenshot will be displayed as a thumbnail.
|
|