Java-Gaming.org    
Featured games (79)
games approved by the League of Dukes
Games in Showcase (476)
Games in Android Showcase (106)
games submitted by our members
Games in WIP (533)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: 1 [2]
  ignore  |  Print  
  My na(t)ive PerlinNoise impl. is 3 times slower than server VM  (Read 8286 times)
0 Members and 1 Guest are viewing this topic.
Online Riven
« League of Dukes »

JGO Overlord


Medals: 743
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #30 - Posted 2010-09-11 01:06:38 »

You could try profiling the C version with gprof
http://www.cs.duke.edu/~ola/courses/programming/gprof.html

"Basically, it looks into each of your functions and inserts code at the head and tail of each one to collect timing information"

That's going to skew results so badly... perlin-noise is taking so few cycles that injecting timing-code in all (inlined) methods will render any outcome useless.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline Wildern

Junior Member





« Reply #31 - Posted 2010-09-11 01:26:23 »

You had mentioned there was barely a difference between inline and not, I thought this was worth a shot as it is very simple/quick to try.
Online Riven
« League of Dukes »

JGO Overlord


Medals: 743
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #32 - Posted 2010-09-11 01:31:07 »

You had mentioned there was barely a difference between inline and not, I thought this was worth a shot as it is very simple/quick to try.

That's because it always gets inlined, as visible in the ASM code.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline Roquen
« Reply #33 - Posted 2010-09-11 08:07:35 »

Wildern: VTune (and similar) profilers are much more interesting. 

The bigger question (in my mind) is why the JIT is do so well.  There are tons of things riven could do if he wanted to improve the native performance.

(edit) And if he was really interested in high performance..this should be done on the GPU.
Online Riven
« League of Dukes »

JGO Overlord


Medals: 743
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #34 - Posted 2010-09-13 09:42:37 »

There are tons of things riven could do if he wanted to improve the native performance.

Right, care to provide some insight instead of vague hand waving?

Everything you suggested was either just as fast or much slower. My own optimization doubled the performance.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline Roquen
« Reply #35 - Posted 2010-09-13 10:19:57 »

Use SIMD.  Branch elimination.  Add a static init method that determines hardware arch.  Use closest arch match for computation.

(edit) And use a better compiler.
Online Riven
« League of Dukes »

JGO Overlord


Medals: 743
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #36 - Posted 2010-09-13 13:28:56 »

Use SIMD.  Branch elimination.  Add a static init method that determines hardware arch.  Use closest arch match for computation.

(edit) And use a better compiler.

Thanks. I'm not too confident about using SIMD in perlin noise, as it's not that easy to vectorize at it branches everywhere. From what I read about other people's attempts, the SIMD code is just as fast, until they add _mm_prefetch() which makes is significantly faster.

One question: if I pass the compiler the arch of the CPU I'm running, I expect that the compiler is generating the 'fastest possible' code for that CPU (just as fast as writing code for that specific arch). So if passing a (more recent) arch (flag) to the compiler doesn't result in fast code, would it be safe to assume that writing specific code for that arch won't change the performance either?

I'm looking at those free versions Visual Studio, and will post the result. I'm not expecting much, but who knows.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline CommanderKeith
« Reply #37 - Posted 2010-09-13 13:45:50 »

Admirable persistence. So you're trying to replicate the performance using native code to beat or at least come close to the server VM? Pretty cool.

i never knew how many C experts there were here.

This seems like a good lesson for me in why I should not bother learning C. So much hassle, and it's still slower.

Offline Roquen
« Reply #38 - Posted 2010-09-13 14:16:57 »

I've been meaning to mention this, but keep forgetting.  Changing the floor call to truncation will cause defects (if any of the inputs are negative).  This should be choosing the lower coordinate of cube (cell) which contains the evaluation point.  You might what to try using this instead and then check the results (vs. performance):

1  
  gx = (x > 0 ? (int)x : (int)(x-1)) & 0xff; // etc for gy,gz


(edit: Of course this is introducing 3 if's. Leaving the code as-is is correct as long as the input space is limited to positive.)

And another complete aside: Perlin presented a replacement for gradient noise, called simplex noise, sometime around 2000/2001 which has less defects and is faster (although maybe not in low dimensions).

I'm not familiar with the more recent version of GCC.  If memory serves there are multiple sets of options related to CPU specific code gen (as opposed to general optimization).  The first set basically specifies the allowed set of opcodes to use and the second is the specific hardware to target for coding scheduling.  So, for instance, you could only allow opcodes present in the P3, but set the scheduling for a newer processor.  I'm pretty sure that they did add a basic switch that says "do the best you can for this machine" switch.

For SIMD, in addition to prefetching, since you're array processing you could do non-temporal stores. Well, actually you could do non-temporal stores in the scalar code..just means directly writing _mm_XXX instead of the compiler doing it for you.

@CommaderKeith: If only this was more common.  In theory runtime compiled code could virtually always be faster.  (Go LLVM & Shark!)
Online Riven
« League of Dukes »

JGO Overlord


Medals: 743
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #39 - Posted 2010-09-13 14:26:44 »

Admirable persistence. So you're trying to replicate the performance using native code to beat or at least come close to the server VM? Pretty cool.

i never knew how many C experts there were here.

This seems like a good lesson for me in why I should not bother learning C. So much hassle, and it's still slower.

The current C code is faster, but only because it uses another strategy (switch-table in C for grad(), which is not as fast in Java). Besides that, maybe GCC is just producing inefficient machinecode.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline delt0r

JGO Knight


Medals: 26
Exp: 18 years


Computers can do that?


« Reply #40 - Posted 2010-09-13 15:32:00 »

I really don't know why C has a reputation of being fast (Fortan is Faster  Tongue). It can be after a lot of work. But out of the box its a rare thing. Doing asm is *not* using C. Sure it can also be fast, but at what labor cost? Basically with the compilation unit stuck at the file+macro level and poor reference semantics, the compiler simply can't optimize to the same level. A little runtime data can make some optimizations easy (ie not np hard), and more importantly the "compilation unit" for the jvm is the "fully linked code".

In my experience gcc is not bad. Intels Compiler does produce faster code on intels and I try and get the department to shell out for it for the intel clusters we have. But i wouldn't expect double performance or anything usually more like 20-50%.

C has some pretty big advantages if you need to have some asm blocks etc. Case in point there are quite a few highly optimized SSE etc asm implementations of perlin noise.   

I have no special talents. I am only passionately curious.--Albert Einstein
Offline Roquen
« Reply #41 - Posted 2010-09-29 10:33:36 »

Ran across Perlin's "Improved Noise" article while looking something up and noticed that there is an error in the gradient computation (did not check if the conditional move version is correct).  Correction (explicit vectors listed in Section 3):

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
  switch(hash & 0xF) {
    case 0x0: return  x+y;
    case 0x1: return -x+y;
    case 0x2: return  x-y;
    case 0x3: return -x-y;
    case 0x4: return  x+z;
    case 0x5: return -x+z;
    case 0x6: return  x-z;
    case 0x7: return -x-z;
    case 0x8: return  y+z;
    case 0x9: return -y+z;
    case 0xA: return  y-z;
    case 0xB: return -y-z;
    case 0xC: return  x+y;
    case 0xD: return -x+y;
    case 0xE: return -y+z;
    case 0xF: return -y-z;

    default:
      // never happens
     return 0;
  }
Online Riven
« League of Dukes »

JGO Overlord


Medals: 743
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #42 - Posted 2010-10-15 11:46:39 »

Thanks for the patch!

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Online Riven
« League of Dukes »

JGO Overlord


Medals: 743
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #43 - Posted 2010-11-24 17:53:25 »

 Undecided

Your 'patch' had a bunch of bugs...


This one is correct:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
float grad(int hash, float x, float y, float z)
   {  
   //float u = (h < 8) ? x : y;
  //float v = (h < 4) ? y : ((h == 12 || h == 14) ? x : z);
  //return ((h & 1) == 0 ? u : -u) + ((h & 2) == 0 ? v : -v);

   switch(hash & 0xF)
   {
      case 0x0: return  x + y;
      case 0x1: return -x + y;
      case 0x2: return  x - y;
      case 0x3: return -x - y;
      case 0x4: return  x + z;
      case 0x5: return -x + z;
      case 0x6: return  x - z;
      case 0x7: return -x - z;
      case 0x8: return  y + z;
      case 0x9: return -y + z;
      case 0xA: return  y - z;
      case 0xB: return -y - z;
      case 0xC: return  y + x;
      case 0xD: return -y + z;
      case 0xE: return  y - x;
      case 0xF: return -y - z;
      default: return 0; // never happens
  }
   }

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline Roquen
« Reply #44 - Posted 2010-11-24 19:15:02 »

I rechecked it and "looks" clean to me.  This is a direct translation of the vectors given in Perlin's paper that I linked above (in the noted section) and not a expansion of the "if" version.
Offline Jono
« Reply #45 - Posted 2010-11-26 10:49:00 »

I rechecked it and "looks" clean to me.  This is a direct translation of the vectors given in Perlin's paper that I linked above (in the noted section) and not a expansion of the "if" version.
I haven't read the paper and don't know the algorithm for Perlin noise, but as a gambling man I'll put money on Riven for the following reason: in the version he put up each of x,y and z appears 5 times as a positive value and 5 times as negative. In the one you posted y is negative 6 times and positive only 4, which makes me feel that the resulting distribution is probably not uniform.
Online Riven
« League of Dukes »

JGO Overlord


Medals: 743
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #46 - Posted 2010-11-26 11:14:57 »

I viewed the document.

I think it's safe to say that the "new gradient distributions" suffer from nasty patterns: you get a lot of straight lines (in the sample render I see 5) and rectangular shapes.

IMHO it is definitely not an improvement, so I'll stick to the "old gradient distributions" which is admittedly less random, but more irregular.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline Nate

JGO Kernel


Medals: 145
Projects: 4
Exp: 14 years


Esoteric Software


« Reply #47 - Posted 2010-11-27 09:59:55 »

If speed is important, is simplex noise any better? Roquen mentioned it above.

Offline Roquen
« Reply #48 - Posted 2010-11-27 16:21:38 »

Riven's new version is the correct and is the new gradient set.  The only difference between the two is the cases 0xD & 0xE are swapped, which has no effect in theory, so there is no mathematical reason to prefer one over the other.

Another way to squeeze out some extra cycles would be to use the older ease (fade) formula, which shouldn't produce noticable defects for simple usages.  The defects of the old ease function become more apparent in higher dimensions:  either noise itself or usage...see the paper above figures 1a & 1b.  The problem with the older fade is that it is only first order continous where the newer one is in all dimensions. So you probably won't notice any defects if noise is simply being used to generate textures, but probably will if generating terrain (for example)...of course "beauty is in the eye of the beholder", or in this case "the developer".  For reference the older ease is:

1  
(t*t*(3.f-(t+t)));


Nate: In theory simplex noise in three and higher dimensions should be faster than gradient noise.  There are a couple of issues though.  The first is that the "look" of simplex noise isn't the same, which could be a problem (see here for some examples).  The second is I that know of no good implemenations (either for reference or direct usage).  Perlin's original design is targeting hardware and his reference implementation reflects that.  A software version should be significantly different to be "fast".  There is a paper "Simplex noise demystified" is reasonable for information, but the author's implementation is very questionable IMHO.  Specifically he falls back to using permutation tables and actually performs dot products with explict vectors (stored in tables), which is missing part of the point of simplex noise given modern hardware. (To be fair, the author states the code is more for clarity than speed) So simplex noise would require a little blood, sweat and tears (or maybe a lot).

(EDIT:)
To state the obvious to any of our dear readers: Always use the lowest dimensional noise you need to perform a given task.
Offline badlogicgames
« Reply #49 - Posted 2011-01-03 23:24:00 »

Don't use O3, O2 is often times faster. GCC's vectorizing optimizer is a bit shitty as well. I had a similar issue with some CPU skinning code. I compile my native code with this at the moment:

-c -Wall -O2 -mfpmath=sse -msse2

The skinning code is now around 10% faster than the equivalent Java code (all manually inlined, all plain old data types etc. etc.).

If you want really good performance use the Intel compiler as suggested earlier. It will make a huge difference.

http://www.badlogicgames.com - musings on Android and Java game development
Offline Roquen
« Reply #50 - Posted 2011-01-04 07:48:02 »

Your list of optimization switches is kinda short for GCC. You probably want to specify the arch for scheduling, omit frame pointers and relax FP rules.

Over the holidays I wiped out old linux old and fresh installed a newer version and to my dismay couldn't find the free version of the intel compiler.  So either my search skills are weak or they've discontinued it (big fat sadface).

I'm having a fit of work-avoidance-itus so I'll take a quick stab at a rough pass at a slightly usable 3D simplex noise.
Pages: 1 [2]
  ignore  |  Print  
 
 
You cannot reply to this message, because it is very, very old.

 

Add your game by posting it in the WIP section,
or publish it in Showcase.

The first screenshot will be displayed as a thumbnail.

pw (18 views)
2014-07-24 01:59:36

Riven (17 views)
2014-07-23 21:16:32

Riven (14 views)
2014-07-23 21:07:15

Riven (17 views)
2014-07-23 20:56:16

ctomni231 (45 views)
2014-07-18 06:55:21

Zero Volt (40 views)
2014-07-17 23:47:54

danieldean (32 views)
2014-07-17 23:41:23

MustardPeter (36 views)
2014-07-16 23:30:00

Cero (51 views)
2014-07-16 00:42:17

Riven (50 views)
2014-07-14 18:02:53
HotSpot Options
by dleskov
2014-07-08 03:59:08

Java and Game Development Tutorials
by SwordsMiner
2014-06-14 00:58:24

Java and Game Development Tutorials
by SwordsMiner
2014-06-14 00:47:22

How do I start Java Game Development?
by ra4king
2014-05-17 11:13:37

HotSpot Options
by Roquen
2014-05-15 09:59:54

HotSpot Options
by Roquen
2014-05-06 15:03:10

Escape Analysis
by Roquen
2014-04-29 22:16:43

Experimental Toys
by Roquen
2014-04-28 13:24:22
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!