Ok, remon posted a response on the CFXWeb forum... I'll copy it here for people to see

-----COPIED TEXT BEGINS-----
...
Now, i'll reply here to the discussion on that board because i dont feel registering for yet another one

:
It's really odd that he complains about casts, seems like he is using an old VM and even then he should be able to get rid of most casts by writing his own primitive collection classes. I've never had casts impact performance with any of the 1.4 VMs. Maybe he is really confusing a cast with wrapping/unwrapping primitive types into objects when using the standard collections.
I'm complaining about float <-> int casts mainly, the thing is running in 1.4.1. And obviously, the other BIG benefit a C(++) based engine has is pointers, i cant do certain tricks that could speed up my inner interpolation render loop by at least 100%
I also vaguely remember someone from Sun coming here and asking whether we were affected by slowness in the current sin and cos implementations and mentioning that there may be a faster way to do them.
The trig methods arent really used during raytracing so not really an issue for me personally. I use vector dotproducts to get cos.N, no need for trig methods. The main one i have a serious problem with is Math.sqrt(d), which, as you can imagine, is called a lot, since it's used to get vector lengths amongst other things, and i cant always leave the length squared..
He mentions algorithmic differences between the applications and those could easily account for a 1:5 performance difference.
I'm afraid this isnt the case, algorithmically, the following things are different between RealStorm and my engine at this point :
- RealStorm has spatial subdivision
RealStorm uses spatial subdivision (octrees i think, possibly KD trees) for large scenes, doesnt have any impact on small scenes since our test scene doesnt qualify for a single spatial subdivision. So effect on performance in our test scene is 0%
- RealStorm has viewport scene subdivision
This is a big optimization, but again, our test scene will only benefit from this slightly, because objects are reasonably big compared to the total viewport (walls, cylinders). I'd say that once i implemented this the test scene's FPS will increase about 50-100%
- RealStorm has a different lighting model
This is a large difference, however, both systems use lookup tables during tracing, so performance wise this is no difference. It's just the way the models are calculated that is different. 0% improvement
- RealStorm has a different reflection mechanism
Again, different, but not slower. 0% improvement
- RealStorm has frustrum culling
All objects in the scene are visible, so this would actually only add overhead, 0% improvement
- RealStorm has primary ray optimization
This technique precalcs stuff that stay constant throughout a frame for entire objects, a big optimization i havent done yet. 30-80% increase i think.
So, basically, i think i can get the engine in total about 100% faster than it is now at a reasonably steady framerate. But the 1:5 factor will still be in affect, i just got a screenshot from RS running the same scene at the same resolution at 50FPS on an AMD 3000XP, if i can get within 10 on my AMD 1800XP i'll be happy, maybe 15FPS on the 3000XP is possible, we'll see. Lack of pointers, slow array access, slow field access, slow primitive type casts...it's all not helping

Someone can post this on the other board if needed...
Remon
-----COPIED TEXT ENDS-----
Full thing is here:
http://www.cfxweb.net/modules.php?name=Forums&file=viewtopic&p=5769#5769