Azeem Jiva
Junior Member  
Java VM Engineer, Sun Microsystems
|
 |
«
Posted
2004-09-01 15:58:15 » |
|
Just wanted to let you guys know that sin, cos, tan, ln, log10 are all setup in the VM to use the X86 hardware (when possible). This is in addition to square root and pow. This is both on X86 and AMD64, and there are small speed ups to other platforms (by speeding up the calls to these trig and transcendentals). This pretty much does it for this sorta work by me. Anyone have other suggestions for things that need to be sped up? Oh and this is of course post-tiger, so don't expect it anytime soon 
|
|
|
|
|
pepe
Junior Member  
Nothing unreal exists
|
 |
«
Reply #1 - Posted
2004-09-01 16:15:46 » |
|
Hello. Thanks for those, it is very apreciated !!! I don't know if that is your field, or is a correct answer to your question, but shifts and masking (ints, for channel operations on pixels) showed to be very very slow. In fact, it was slower using an RGBA int than four floats for storing/handling pixel values due to those operations. (i'm doing image filtering, and i -of course- need it to be fast ) If you could accelerate that also, i'd be your slave for life. 
|
|
|
|
Azeem Jiva
Junior Member  
Java VM Engineer, Sun Microsystems
|
 |
«
Reply #2 - Posted
2004-09-01 16:34:28 » |
|
Can you write up a small test case showing the problem? Something that I can look at and try to optimize? Thanks
|
|
|
|
|
Games published by our own members! Check 'em out!
|
|
pepe
Junior Member  
Nothing unreal exists
|
 |
«
Reply #3 - Posted
2004-09-01 17:51:05 » |
|
of course. here it is. i get a constant 4.5 speed increase factor going float.  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
| public class FilteringTest {
public static void main(String args[]) { FilteringTest ft=new FilteringTest(); ft.startInt(); ft.startFloat(); } static final int nbPixels = 1024*1024; static final int loopCount = 500; static int t; static float tf;
void startInt() { long debut; long fin; int loop=0, imgloop; int r=0,g=0,b=0,a=0,pixel=0; debut=System.currentTimeMillis(); int array[]=new int[nbPixels]; fin=System.currentTimeMillis(); System.out.println("init time:"+(fin-debut)+" ms");
imgloop=0; for (;imgloop<nbPixels; imgloop++) { array[imgloop]=0xffffffff; }
System.out.println("test1: int pixels filtering"); debut=System.currentTimeMillis(); for (;loop<loopCount; loop++) { imgloop=0; for (;imgloop<nbPixels; imgloop++) { pixel=array[imgloop]; a=pixel>>>24; r=(pixel>>>16)&0x000000ff; g=(pixel>>>8)&0x000000ff; b=pixel&0x000000ff;
t=((r+g+b)/3)-1; array[imgloop]=(t<<24)+(t<<16)+(t<<8)+t; } } fin=System.currentTimeMillis(); long test1=(fin-debut); System.out.println("Elapsed time: "+test1+" ms"); int nbpix1=(int) ((nbPixels*loopCount)/((double)test1/1000.f)); System.out.println(nbpix1+" pixels/second, that is "+(((double)nbpix1/(720*576*25))*100)+"% of real time video filtering."); System.out.println("random pixel result for validity of rendering: 0x"+ Integer.toHexString( array[ (int)(Math.random() * nbPixels) ] )); System.out.println("result should be:0x0a0a0a0a" +"\n\n"); array=null; }
void startFloat() { long debut; long fin; int loop=0, imgloop; float r=0.f,g=0.f,b=0.f,a=0.f; debut=System.currentTimeMillis(); float array[]=new float[nbPixels*4]; fin=System.currentTimeMillis(); System.out.println("init time:"+(fin-debut)+" ms");
imgloop=0; for (;imgloop<nbPixels; imgloop++) { array[imgloop]=150000.f; }
System.out.println("test2: float pixels filtering"); debut=System.currentTimeMillis(); for (;loop<loopCount; loop++) { imgloop=0; for (;imgloop < nbPixels ; imgloop+=4) { a=array[imgloop]; r=array[imgloop+1]; g=array[imgloop+2]; b=array[imgloop+3];
tf=((r+g+b)/3.f)-1; array[imgloop]=tf; array[imgloop+1]=tf; array[imgloop+2]=tf; array[imgloop+3]=tf; } } fin=System.currentTimeMillis(); long test1=(fin-debut); System.out.println("Elapsed time: "+test1+" ms"); int nbpix1=(int) ((nbPixels*loopCount)/((double)test1/1000.f)); System.out.println(nbpix1+" pixels/second, that is "+(((double)nbpix1/(720*576*25))*100)+"% of real time video filtering."); System.out.println("random pixel result for validity of rendering:"+ array[ (int)(Math.random() * nbPixels) ] ); System.out.println("result should be:149500" +"\n\n"); array=null; } } |
|
|
|
|
Mark Thornton
|
 |
«
Reply #4 - Posted
2004-09-01 19:01:08 » |
|
You should really put t &= 0xFF before creating the pixel. In any case using
t = (((r+g+b)*5592406) >>> 24)-1;
is considerably faster than
t=((r+g+b)/3)-1;
although still not as good as the float based code (at least on my Athlon XP 2500+).
|
|
|
|
|
pepe
Junior Member  
Nothing unreal exists
|
 |
«
Reply #5 - Posted
2004-09-01 19:55:26 » |
|
You should really put t &= 0xFF before creating the pixel.
That's interesting, but i think it's unnecessary. As the input can't be over 255, the result can't be illegal, that is, over 255. In any case using
t = (((r+g+b)*5592406) >>> 24)-1;
is considerably faster than
t=((r+g+b)/3)-1;
True, but that kind of optimisation should belong to the compiler, not the coder. although still not as good as the float based code (at least on my Athlon XP 2500+).
What is your ratio between each?
|
|
|
|
dranonymous
Junior Member  
Hoping to become a Java Titan someday!
|
 |
«
Reply #6 - Posted
2004-09-01 20:03:23 » |
|
Mark - Why is the version you presented faster? My guess is that you avoid casting the ints to floats, but thats speculation.
Dr. A>
|
|
|
|
|
Mark Thornton
|
 |
«
Reply #7 - Posted
2004-09-01 20:14:46 » |
|
That's interesting, but i think it's unnecessary. As the input can't be over 255, the result can't be illegal, that is, over 255.
That -1 means the result can be -1! True, but that kind of optimisation should belong to the compiler, not the coder.
While it may be practical for a compiler to replace division by a constant float with multiplication by the reciprocal, there are complications in doing the same thing for integers. What is your ratio between each?
float about 5, vs int about 8 (seconds in both cases). The original int version takes 20. dranonymous: My revised int version is faster because muliplication is (usually) significantly faster than division. This is true for both integer and floating point, however I suspect that in the floating point case the division has been automatically replaced by a multiplication by the reciprocal.
|
|
|
|
|
tom
|
 |
«
Reply #8 - Posted
2004-09-01 21:02:36 » |
|
On my computer the integer version is a factor of 1.7 slower using Marks modification. Wich sounds about right as the integer version does twice the amount of work. here it is. i get a constant 4.5 speed increase factor going float. What did you expect?
|
|
|
|
Mark Thornton
|
 |
«
Reply #9 - Posted
2004-09-01 21:15:35 » |
|
What did you expect? Current CPU have lots of hardware devoted to floating point, so that they can do simultaneous additions and multiplications. On the other hand there is usually only one shifter, so the integer version probably makes less effective use of the chip (less scope for operations to be performed in parallel).
|
|
|
|
|
Games published by our own members! Check 'em out!
|
|
pepe
Junior Member  
Nothing unreal exists
|
 |
«
Reply #10 - Posted
2004-09-02 04:34:45 » |
|
That -1 means the result can be -1!
I would be working with 8 bits store, that would be true. nevertheless, 0xFF in an int is 255, not -1... 0XFFFFFFFF would... While it may be practical for a compiler to replace division by a constant float with multiplication by the reciprocal, there are complications in doing the same thing for integers. Oh, interesting. why that? float about 5, vs int about 8 (seconds in both cases). The original int version takes 20.
That 's a nice improvement, i agree. Too nice, in fact. there has to be something to do for that division..
|
|
|
|
swpalmer
|
 |
«
Reply #11 - Posted
2004-09-02 05:00:54 » |
|
Anyone have other suggestions for things that need to be sped up? Use of vecor instructions MMX/SSE/SSE2.. etc. for common patterns found in manipulating RGBA ints - as above.
|
|
|
|
|
|
Mark Thornton
|
 |
«
Reply #13 - Posted
2004-09-02 15:15:36 » |
|
is FMA, in particular, currently in place in 1.5 (betas)?
Unfortunately not. JSR 84 which also proposed supporting FMA was withdrawn in March 2002 apparently due to difficulties in setting up the expert group. http://jcp.org/en/jsr/detail?id=84
|
|
|
|
|
dranonymous
Junior Member  
Hoping to become a Java Titan someday!
|
 |
«
Reply #14 - Posted
2004-09-02 16:34:14 » |
|
Pepe - In the int version you shift the alpha value, but then you never did anything with it. Did I miss where you manipulated the value again?
Mark/Pepe - Have you looked at the compiled byte code to see how it differs for those small shifting/masking areas?
Dr. A>
|
|
|
|
|
pepe
Junior Member  
Nothing unreal exists
|
 |
«
Reply #15 - Posted
2004-09-02 17:10:55 » |
|
Pepe - In the int version you shift the alpha value, but then you never did anything with it. Did I miss where you manipulated the value again?
no. In first versions, the values were even all copied into temporary values, then pushed bacK. That class is an expurged version of an other set where i tested how valuable it was to put pixel treatment in a method of an other class. In that old test, i had to extract all components, and pass them to filtering method, along with image array and poke offset. That was a pretty interesting test, because doing so was faster than simply putting all code in a single loop. (server JIT only..) Mark/Pepe - Have you looked at the compiled byte code to see how it differs for those small shifting/masking areas?
I would love to, but we can't have a look at how the JIT compiles bytecode, if that's what you meant.
|
|
|
|
dranonymous
Junior Member  
Hoping to become a Java Titan someday!
|
 |
«
Reply #16 - Posted
2004-09-02 18:15:19 » |
|
I realize you can't see how the JIT compiled it down to native assembly, but you could see the bytecode produced in the class files and compare them. It would be interesting to see what was going on in each one.
Dr. A>
|
|
|
|
|
Mark Thornton
|
 |
«
Reply #17 - Posted
2004-09-02 19:40:32 » |
|
Byte code is very direct representation of the java source --- little or no optimisation is done at that point. Essentially all the optimisation is done by the JIT at runtime.
|
|
|
|
|
pepe
Junior Member  
Nothing unreal exists
|
 |
«
Reply #18 - Posted
2004-09-03 08:06:51 » |
|
Byte code (compiled java source) is very basic. No optimisations are done there, in order for the JIT to recognise patterns, thus simplify its work and make it more efficient. Assembly (compiled bytecode) is done by JIT, and us, mortals, don't have access to it. That assembly can be way different than what is in the bytecode.
|
|
|
|
|
|
crystalsquid
Junior Member  
... Boing ...
|
 |
«
Reply #20 - Posted
2004-09-09 09:32:21 » |
|
Step 1: Run MS Visual Studio (Boo Hiss!) Step 2: Set up a new (empty) project. Step 3: Go to debug settings, set exe to be your IE, and program arguments to point to the path of a simple HTML page with an applet on (I only deal with applets but you could do this with Java itself just as easily) Step 4: Run a debugger session, and stop the debugger somewhere. IF you are in an area called something like WIN32, or NT40.DLL or something, then you are in a system call. If you are in <unknown> or <some hex string> then you are probably in the compiled code. SOMETIMES you can tell more easily, as the compiled code will reside in memory with an address much greater than 0x40000000 (the default base address space for code loaded from an exe file). It is then possible to track down specific parts of code by adding operations to add set constants to a static volatile variable, and then search the disassembly for the constants. Not saying its easy tho, but it can work if your desperate  - Dom
|
|
|
|
|
|