Java-Gaming.org    
Featured games (81)
games approved by the League of Dukes
Games in Showcase (482)
Games in Android Showcase (110)
games submitted by our members
Games in WIP (550)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: 1 2 3 [4]
  ignore  |  Print  
  Cortado works!  (Read 13696 times)
0 Members and 1 Guest are viewing this topic.
Offline Momoko_Fan

Junior Member


Medals: 2



« Reply #90 - Posted 2010-04-28 20:50:01 »

I don't have my profiling results handy. It was last year so the hardware was not that old. So I don't really trust your results. I will believe that halfpel and De blocking filter are high on the list since they really suck up a lot of raw mem bandwidth. But Huffman decoding is really fast even done badly while  YUV->RGB (720p?) in java only using 5%? Even that's surprising (but perhaps not on faster modern CPUs). 

What profiling tool did you use, how long did you collect stats for?  Note i added some of my own timing stuff since these days i am finding hard to get a accurate profiler.  Even then i replace the "slowdown" areas with a no op to see if it really does change the timings.
I tested it a few times already, the results stay the same. I am using the NetBeans built-in profiler. I recalibrated it and run the decoder for 100 frames (since it was too slow to run for longer).
Here's a screenshot of the results:
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 781
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #91 - Posted 2010-04-28 21:16:46 »

There are few profilers as good as VisualVM (which NetBeans uses) in transforming the code in such a way that it alters the performance characteristics completely.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline delt0r

JGO Knight


Medals: 27
Exp: 18 years


Computers can do that?


« Reply #92 - Posted 2010-04-28 21:21:07 »

Yes, thats what i thought. The method thats taking all the time... is the method that calls the iDCT (Hence DCT in the name). After scanning the  code I bet dollars to cents that the iDCT is really whats taking a lot of time (and the de blocking filter). Note that both can use opengl for big speed ups. 

The problems i have had with profiling has been the Netbeans and jvisual profilers.  I can know they do a bad job, because I can not call a method that takes 80% of the time on some profiling results and it doesn't speed up at all. Also I don't think anything that takes less than 2 mins *does not* give a good reflection of server hot spots performance.

But we will see.

** missed important words

I have no special talents. I am only passionately curious.--Albert Einstein
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline delt0r

JGO Knight


Medals: 27
Exp: 18 years


Computers can do that?


« Reply #93 - Posted 2010-04-28 22:05:14 »

just talked to some of the theora guys on irc. It was suggested that testing  the full Cortado would give pretty messy profiling results as its multi threaded with a bunch of complicated locks. Also different bit rates would change where they expect the cpu to spend its time. What bit rate source are we talking about here?

I have no special talents. I am only passionately curious.--Albert Einstein
Offline Momoko_Fan

Junior Member


Medals: 2



« Reply #94 - Posted 2010-04-28 23:13:43 »

Okay so you guys don't like the NetBeans profiler, can you suggest me another one that is good and doesn't cost money? I would run the test again on it. I am just using the netbeans one since it's easy to use and integrates into my project, it seemed to work fine for everything I used it.
I found this post with the profiling results for theora C version and it seems similar to my results:
http://osdir.com/ml/multimedia.ogg.theora.devel/2004-02/msg00078.html

Quote
The method thats taking all the time... is the method that calls the iDCT (Hence DCT in the name)
No you're wrong. Here's the method ReconInterHalfPixel2
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
public static final void ReconInterHalfPixel2(short[] ReconPtr, int idx1,
                           short[] RefPtr1, int idx2, short[] RefPtr2, int idx3,
                           short[] ChangePtr, int LineStep ) {
    int coff=0, roff1=idx1, roff2=idx2, roff3=idx3, i;

    for (i = 0; i < 8; i++ ){
      ReconPtr[roff1+0] = clamp255(((RefPtr1[roff2+0] + RefPtr2[roff3+0]) >> 1) + ChangePtr[coff++]);
      ReconPtr[roff1+1] = clamp255(((RefPtr1[roff2+1] + RefPtr2[roff3+1]) >> 1) + ChangePtr[coff++]);
      ReconPtr[roff1+2] = clamp255(((RefPtr1[roff2+2] + RefPtr2[roff3+2]) >> 1) + ChangePtr[coff++]);
      ReconPtr[roff1+3] = clamp255(((RefPtr1[roff2+3] + RefPtr2[roff3+3]) >> 1) + ChangePtr[coff++]);
      ReconPtr[roff1+4] = clamp255(((RefPtr1[roff2+4] + RefPtr2[roff3+4]) >> 1) + ChangePtr[coff++]);
      ReconPtr[roff1+5] = clamp255(((RefPtr1[roff2+5] + RefPtr2[roff3+5]) >> 1) + ChangePtr[coff++]);
      ReconPtr[roff1+6] = clamp255(((RefPtr1[roff2+6] + RefPtr2[roff3+6]) >> 1) + ChangePtr[coff++]);
      ReconPtr[roff1+7] = clamp255(((RefPtr1[roff2+7] + RefPtr2[roff3+7]) >> 1) + ChangePtr[coff++]);
      roff1 += LineStep;
      roff2 += LineStep;
      roff3 += LineStep;
    }
  }

It doesn't look like a DCT to me.
I found where it was used in ExpandBlock and the comment says this:
Quote
/* Fractional pixel reconstruction. */
        /* Note that we only use two pixels per reconstruction even for
           the diagonal. */


Quote
just talked to some of the theora guys on irc. It was suggested that testing  the full Cortado would give pretty messy profiling results as its multi threaded with a bunch of complicated locks.
Okay but I am not using the cortado one, I am not allowed to use it since it's under the GPL, I am just using Jheora that comes with it. Also I profiled with root method being the video decode function, so even if those locks were there, their effects would not be included in the results.


EDIT: Okay I asked my friend to profile an HD 720p video on his mac, using YourKit Java profiler. Here are the results:
YUVConv - 24%
loadFrame - 12%
LoopFilter (deblocking) - 12%
ReconInterHalfPixel2 - 9%
ReconInter - 7%
IDct1 - 3%
IDct10 - 3%
IDctSlow - 3%

He's using Mac and the java on the mac is probably not that good in optimizing as the Sun one, that might explain the differences. Also like you said maybe its the profiler. I don't have YourKit profiler but I am gonna get it tomorrow and test this again.
Offline Riven
« League of Dukes »

JGO Overlord


Medals: 781
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #95 - Posted 2010-04-29 08:17:42 »

Try passing the -Xprof parameter to the VM

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline delt0r

JGO Knight


Medals: 27
Exp: 18 years


Computers can do that?


« Reply #96 - Posted 2010-04-29 08:26:16 »

Quote
Okay so you guys don't like the NetBeans profiler, can you suggest me another one that is good and doesn't cost money?Okay so you guys don't like the NetBeans profiler, can you suggest me another one that is good and doesn't cost money?

Even with money. No.  If you find one, let me know.

To put it simply, we are doing millions and billions of operations per second with  complicated 3 tier cache system + instruction cache + branch prediction + out of order execution.  Even changing the order changes the performance. Adding profiling code *changes* the profile. And in java with conditional compilation this is even worse.  Basically taking the measurement changes the measurement so much that the measurement is simply false. Like i said. The profiler claimed that 80% of the time was spent in a method. Yet even with the method *commented out* the run time wasn't changed more than 5%.  IO gets even harder since slowing everything down with instrumentation code doesn't slow down the IO. So IO performance is many times faster than in reality when profiling,

The best way to run the profiler (i use jvisual /hprof and Xprof. ) Compare and check. I check by moving the problem around to make some things worse. ie if my opengl code is fill limited, higher resolution should make it go a lot slower...

In this case we also have timing loops and locks.

The theora/jheora guys think that at low bit rates the iDCT won't be a problem because you only have one or 2 non zero coefficients.  But at high bit rate they expect both iDCT and huffman decoding to hurt. But there profiling results was showing a huge chunk of work in the YUV2RGB path, and that matches experience in both java and C. In fact the Firefox decoder is using glsl for the YUV2RGB now apparently.

Note one of the main reasons I expect the iDCT to be high is experience. The second is back of the envelope calculations (bandwidth/FLOPS).  iDCT in C (asm in fact) is fast because thats what MMX was designed for. Java however does not have this and so this is one area where "java is slow" is in fact true (same goes for YUV2RGB).

So is this 720p. What bit rate? And without profiling do you get real time. Note the C uses less than 12% cpu for 720p24 on my 2 year old system.

I have no special talents. I am only passionately curious.--Albert Einstein
Offline Momoko_Fan

Junior Member


Medals: 2



« Reply #97 - Posted 2010-04-29 19:04:08 »

Here's the -Xprof results for Movie Kick-Ass trailer 720p, 30fps, 2000 bit rate:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
47  
48  
49  
50  
51  
52  
53  
54  
55  
56  
57  
58  
59  
60  
61  
62  
63  
64  
65  
66  
67  
68  
69  
70  
71  
72  
73  
74  
75  
76  
run:
started ogg reader
new stream 16625
new stream 3801
found theora video
new stream 23968
found vorbis audio
theora dimension: 1280x720
theora aspect: 27x20
theora framerate: 60x2
ogg reader done
ellapsed: 23502

Flat profile of 23.58 secs (2208 total ticks): main

  Interpreted + native   Method                        
  1.3%    28  +     0    com.fluendo.jheora.Decode.loadAndDecode
  1.1%    25  +     0    com.fluendo.jheora.DCTDecode.UpdateUMVBorder
  0.5%     0  +    10    java.io.FileInputStream.readBytes
  0.4%     8  +     0    com.fluendo.jheora.Decode.decodeMVectors
  0.2%     4  +     0    com.fluendo.jheora.Decode.ExtractToken
  0.1%     3  +     0    com.fluendo.jheora.FrArray.getNextBInit
  0.1%     3  +     0    com.fluendo.jheora.DCTDecode.ExpandBlock
  0.1%     2  +     0    com.fluendo.jheora.Decode.decodeModes
  0.1%     2  +     0    com.fluendo.jheora.Filter.FilterHoriz
  0.1%     2  +     0    com.fluendo.jheora.ExtractMVectorComponentA.extract
  0.1%     2  +     0    com.fluendo.jheora.Recon.ReconInter
  0.0%     1  +     0    java.lang.ClassLoader.defineClass1
  0.0%     1  +     0    com.fluendo.jheora.Decode.unpackAndExpandToken
  0.0%     1  +     0    com.fluendo.jheora.Filter.SetupLoopFilter
  0.0%     1  +     0    com.fluendo.jheora.FrArray.deCodeSBRun
  0.0%     1  +     0    java.awt.color.ColorSpace.getInstance
  0.0%     1  +     0    com.fluendo.jheora.HuffEntry.read
  0.0%     1  +     0    com.fluendo.jheora.FrInit.CalcPixelIndexTable
  0.0%     1  +     0    com.fluendo.jheora.Recon.CopyBlock
  0.0%     1  +     0    com.fluendo.jheora.iDCT.IDctSlow
  0.0%     1  +     0    java.util.jar.JarFile.getEntry
  0.0%     1  +     0    com.fluendo.jheora.Quant.compQuantMatrix
  0.0%     1  +     0    com.fluendo.jheora.FrArray.quadDecodeDisplayFragments
  0.0%     1  +     0    com.fluendo.jheora.Filter.FilterVert
  0.0%     1  +     0    com.fluendo.jheora.Recon.ReconInterHalfPixel2
  5.0%   101  +    10    Total interpreted (including elided)

     Compiled + native   Method                        
 31.8%   703  +     0    com.fluendo.jheora.DCTDecode.ExpandBlock
 19.5%   430  +     0    com.fluendo.jheora.Decode.unpackAndExpandToken
 16.5%   365  +     0    com.fluendo.jheora.Filter.LoopFilter
  9.3%   206  +     0    com.fluendo.jheora.FrArray.quadDecodeDisplayFragments
  3.1%    69  +     0    com.fluendo.jheora.DCTDecode.ExpandKFBlock
  2.7%    59  +     0    com.fluendo.jheora.DCTDecode.CopyNotRecon
  2.3%    51  +     0    com.fluendo.jheora.FrArray.getNextBBit
  2.2%    48  +     0    com.fluendo.jheora.Decode.decodeMVectors
  1.8%    40  +     0    com.fluendo.jheora.DCTDecode.CopyRecon
  1.5%    34  +     0    com.fluendo.jheora.DCTDecode.ReconRefFrames
  1.5%    33  +     0    com.fluendo.jheora.Decode.decodeModes
  1.0%    21  +     0    com.fluendo.jheora.Decode.decodeBlockLevelQi
  0.8%    18  +     0    com.fluendo.jheora.Decode.unPackVideo
  0.0%     0  +     1    com.jcraft.jogg.StreamState.pagein
 94.1%  2077  +     1    Total compiled

         Stub + native   Method                        
  0.9%     0  +    19    java.lang.System.arraycopy
  0.9%     0  +    19    Total stub


Flat profile of 0.00 secs (1 total ticks): DestroyJavaVM

  Thread-local ticks:
100.0%     1             Blocked (of total)


Global summary of 23.61 seconds:
100.0%  2216             Received ticks
  0.2%     5             Received GC ticks
  3.2%    72             Compilation
BUILD SUCCESSFUL (total time: 24 seconds)


Here's results for 720p big buck bunny, available for download at buck bunny site:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
47  
48  
49  
50  
51  
52  
53  
54  
55  
56  
57  
58  
59  
60  
61  
62  
63  
64  
65  
66  
67  
68  
69  
70  
71  
72  
73  
74  
75  
76  
77  
78  
79  
80  
81  
run:
started ogg reader
new stream 884871684
found theora video
new stream 1274777508
found vorbis audio
theora dimension: 1280x720
theora aspect: 0x0
theora framerate: 24x1
ogg reader done
ellapsed: 279081

Flat profile of 279.11 secs (23219 total ticks): main

  Interpreted + native   Method                        
  1.8%   413  +     0    com.fluendo.jheora.Decode.loadAndDecode
  1.0%   242  +     0    com.fluendo.jheora.DCTDecode.UpdateUMVBorder
  0.8%     0  +   176    java.io.FileInputStream.readBytes
  0.1%    14  +     0    com.fluendo.jheora.Decode.loadFrame
  0.0%     8  +     0    com.fluendo.jheora.Filter.SetupLoopFilter
  0.0%     8  +     0    com.fluendo.jheora.Decode.decodeMVectors
  0.0%     5  +     0    com.fluendo.jheora.State.decodePacketin
  0.0%     5  +     0    com.jme3.video.OggTheoraPerf.start
  0.0%     5  +     0    com.fluendo.jheora.FrArray.getNextBInit
  0.0%     4  +     0    com.jcraft.jogg.SyncState.pageout
  0.0%     3  +     0    com.fluendo.jheora.State.decodeYUVout
  0.0%     3  +     0    com.fluendo.jheora.DCTDecode.ExpandBlock
  0.0%     2  +     0    com.fluendo.jheora.Filter.FilterHoriz
  0.0%     2  +     0    com.fluendo.jheora.Recon.ReconIntra
  0.0%     2  +     0    com.jcraft.jogg.Page.serialno
  0.0%     1  +     0    java.util.jar.JarVerifier.beginEntry
  0.0%     1  +     0    com.fluendo.jheora.FrInit.InitFragmentInfo
  0.0%     1  +     0    com.fluendo.jheora.DCTDecode.UpdateUMV_HBorders
  0.0%     1  +     0    com.fluendo.jheora.Filter.FilterVert
  0.0%     1  +     0    com.fluendo.jheora.ExtractMVectorComponentA.extract
  0.0%     1  +     0    com.jcraft.jogg.StreamState.init
  0.0%     1  +     0    com.fluendo.jheora.DCTDecode.UpdateUMV_VBorders
  0.0%     1  +     0    com.fluendo.jheora.Filter.SetupBoundingValueArray_Generic
  0.0%     1  +     0    com.fluendo.jheora.Decode.unpackAndExpandToken
  0.0%     1  +     0    com.fluendo.jheora.DCTDecode.CopyRecon
  4.0%   740  +   178    Total interpreted (including elided)

     Compiled + native   Method                        
 35.4%  8223  +     0    com.fluendo.jheora.DCTDecode.ExpandBlock
 26.8%  6216  +     0    com.fluendo.jheora.Decode.unpackAndExpandToken
 12.1%  2806  +     0    com.fluendo.jheora.Filter.LoopFilter
  8.0%  1856  +     0    com.fluendo.jheora.FrArray.quadDecodeDisplayFragments
  2.5%   573  +     0    com.fluendo.jheora.DCTDecode.CopyRecon
  2.3%   545  +     0    com.fluendo.jheora.DCTDecode.CopyNotRecon
  2.1%   494  +     0    com.fluendo.jheora.FrArray.getNextBBit
  1.6%   364  +     0    com.fluendo.jheora.Decode.decodeModes
  1.5%   357  +     0    com.fluendo.jheora.DCTDecode.ExpandKFBlock
  1.5%   345  +     0    com.fluendo.jheora.Decode.decodeMVectors
  1.1%   259  +     0    com.fluendo.jheora.DCTDecode.ReconRefFrames
  0.7%   158  +     0    com.fluendo.jheora.Decode.unPackVideo
  0.1%    23  +     0    com.fluendo.jheora.FrArray.getNextSbBit
  0.0%    11  +     0    com.jcraft.jogg.SyncState.pageseek
  0.0%     5  +     0    com.jme3.video.OggTheoraPerf.start
  0.0%     1  +     0    com.jcraft.jogg.SyncState.pageout
 95.8% 22236  +     0    Total compiled

         Stub + native   Method                        
  0.3%     0  +    64    java.lang.System.arraycopy
  0.3%     0  +    64    Total stub

  Thread-local ticks:
  0.0%     1             Class loader


Flat profile of 0.03 secs (1 total ticks): DestroyJavaVM

  Thread-local ticks:
100.0%     1             Blocked (of total)


Global summary of 279.18 seconds:
100.0% 23227             Received ticks
  0.0%     4             Received GC ticks
  0.4%    94             Compilation
  0.0%     1             Class loader
BUILD SUCCESSFUL (total time: 4 minutes 39 seconds)

Problem with the -Xprof option is that it doesn't show the sub-trees under ExpandBlock, like ReconInter and IDCT, so you don't really know where the bottleneck is.

Quote
without profiling do you get real time.
It plays fine for the most part, but in high-action scenes it starts to lag/drop frames. At some point it becomes more stable (guess some code becomes compiled).
Offline princec

JGO Kernel


Medals: 362
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #98 - Posted 2010-04-29 19:16:57 »

Turn off inlining completely and try it again.

Cas Smiley

Pages: 1 2 3 [4]
  ignore  |  Print  
 
 
You cannot reply to this message, because it is very, very old.

 

Add your game by posting it in the WIP section,
or publish it in Showcase.

The first screenshot will be displayed as a thumbnail.

CopyableCougar4 (14 views)
2014-08-22 19:31:30

atombrot (28 views)
2014-08-19 09:29:53

Tekkerue (25 views)
2014-08-16 06:45:27

Tekkerue (23 views)
2014-08-16 06:22:17

Tekkerue (15 views)
2014-08-16 06:20:21

Tekkerue (22 views)
2014-08-16 06:12:11

Rayexar (61 views)
2014-08-11 02:49:23

BurntPizza (39 views)
2014-08-09 21:09:32

BurntPizza (31 views)
2014-08-08 02:01:56

Norakomi (37 views)
2014-08-06 19:49:38
List of Learning Resources
by Longor1996
2014-08-16 10:40:00

List of Learning Resources
by SilverTiger
2014-08-05 19:33:27

Resources for WIP games
by CogWheelz
2014-08-01 16:20:17

Resources for WIP games
by CogWheelz
2014-08-01 16:19:50

List of Learning Resources
by SilverTiger
2014-07-31 16:29:50

List of Learning Resources
by SilverTiger
2014-07-31 16:26:06

List of Learning Resources
by SilverTiger
2014-07-31 11:54:12

HotSpot Options
by dleskov
2014-07-08 01:59:08
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!