Java-Gaming.org    
Featured games (81)
games approved by the League of Dukes
Games in Showcase (481)
Games in Android Showcase (110)
games submitted by our members
Games in WIP (548)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: [1]
  ignore  |  Print  
  Cache locality and LibStruct - fixing Java's horrible memory layout  (Read 2234 times)
0 Members and 1 Guest are viewing this topic.
Offline theagentd
« Posted 2014-05-19 02:25:57 »

TL;DR: I'm making a multi-threaded server for a 2D space game, and it was performing very badly on AMD CPUs. I used a library developed by Riven called LibStruct to replace my Java instances with structs that I could control the location of, managed to improve the performance on AMD CPUs by almost 50%.

EDIT: A BIG thanks to Spacebeans for letting me test the program on his 8-core AMD CPU! Throw a post in here and I'll give you a medal! =D

Lately I've been experimenting with a server for a game I'm trying to make, to see how feasible my idea is. The game is supposed to be a 2D space game with a massive number of entities (preferably at least 100 000), with accurate gravity calculations and collision detection. To allow for extremely large worlds with high and uniform precision all coordinates and velocities are stored as fixed-precision 96-bit integers instead of doubles. To do real-time collision detection I decided to go with Broad-Phase Collision Detection. This technique basically involves sorting all the entities along the X axis and then using a smart traversing algorithm to find out which entities that potentially collide. For maximum performance, I decided to thread the engine as well.

The test program I developed currently consists of 4 main passes:
 - Updating: Updates each entity by computing gravity acceleration and updating the entity based on its velocity.
 - Sorting: Sorts all entities along the X-axis.
 - Collision detection: Identified collision pairs.
 - Rendering: Transforms each ship for rendering. This is mainly for visualizing the results.

I wanted to see if I could multithread it all and get good scaling.
 - Updating: Threading updating was simple since each entity can be updated independently of all other entities.
 - Sorting: I decided to go with a parallel insertion sort algorithm that I developed specifically for data sets that are almost sorted (the order of objects barely change in a single update).
 - Collision detection: I managed to modify the traversing algorithm to be able to split up the work into into independent chunks that could be processed in parallel.
 - Rendering: Simple to thread as well since all entities can be transformed in parallel.



The first version of my program worked pretty well. The scene tested was a planet with a large number of ships in fast low orbits, orbiting both clockwise and counter-clockwise, which gave a decent, realistic sorting load. Collision detection worked well and could detect collisions between objects that were less than a meter large and travelling at several kilometers per second. The only problem was performance. I could not even reach 30 FPS with 50 000 entities, far from my goal of 60 FPS with 100 000 entities. After that, I switched my focus to implementing threading, and DAYUM did it make a difference. I was running my test on an i7-4770K processor with Hyperthreading.

Results of Intel i7-4770K:
1  
2  
3  
4  
5  
6  
7  
8  
Task       <timing of 1 thread> / <timing of 8 threads> = <scaling>

Update:    10,886969 / 2,430073 = 4,4800995690252926558173355286035
Sort:       5,645249 / 0,888223 = 6,3556663135271210045225129274968
Collision: 13,476599 / 2,127155 = 6,3355039947723602652369009310558
Render:     5,608626 / 1,026621 = 5,4631904081447778683662227832861

Total:     35,706997 / 6,523294 = 5,4737678540933460917137875435325

What the hell? 6.36x scaling on a quad core?! Yep! These timings are the average timings when running the test for over a minute, and they're 100% consistent. This additional performance is gained by utilizing Hyperthreading. Hyperthreading allows each core to have TWO threads "loaded in" at the same time, and if one of the threads stall for some reason (a cache miss or a branch prediction miss for example) it can very quickly switch to the other thread. If the program is extremely heavy on cache misses, this can indeed grant a 2x performance boost, but this is extremely rare of course. So basically, my code was extremely heavy on cache misses, which lead to Hyperthreading making a huge difference. Before I attempted to look into this issue more, I asked Spacebeans to run the test on his 8-core AMD FX-8320 CPU and give me the timings. His results were much less optimistic:

Results of AMD FX-8320:
1  
2  
3  
4  
5  
6  
7  
8  
Task       <timing of 1 thread> / <timing of 8 threads> = <scaling>

Update:    16,018526 /  2,856606 = 5,6075377563444171159760919076695
Sort:       6,851715 /  2,446308 = 2,8008390603309149951682290210391
Collision: 16,008198 /  3,536307 = 4,5268122931634612040187687324658
Render:     6,617009 /  2,420236 = 2,7340346148061593993313048810116

Total:     45,587364 / 11,337404 = 4,0209702326917167280975433176766

The only part of the test that scales reasonable well is the updating, which is the part that consists mostly of brute force computations. Sorting and rendering has abysmal scaling, under 3x despite the program utilizing all 8 cores. His AMD CPU had pretty much opposite scaling results compared to my Intel CPU. AMD CPUs are known for having worse per-core performance than Intel CPUs, and this is in large part due to Intel having a better memory controller. If Hyperthreading is helping a lot when it comes to sorting, collision detection and rendering performance, and AMD's CPUs are weaker in those tasks, it further implies that there are a shitload of cache misses going on in those parts of the test. Taking a look at my Ship class revealed the problem rather quickly. Each Ship has 2 instances of Vector2 for position and velocity, and each Vector2 has 2 instances of Int96 for X and Y, which finally hold the actual primitive data types. Java doesn't let us allocate all these objects at the same location in memory, and may even move around stuff during garbage collections. What this means is that each entity had its data spread out over (1+2+4) = 7 different locations in RAM! When sorting the array, it would essentially first hit a cache miss when following the reference to each Ship instance, then hit a cache miss when following the position Vector2 of each Ship, and finally hit a cache miss when following the X coordinate stored as an Int96. Each Ship is essentially a linked list with 3 elements, which is horribly inefficient!

Looking into how each task works, I could tell the following:
 - Updating involves reading from all those 7 addresses, but the large amount of number crunching makes this task mostly dependent on arithmetic performance.
 - Sorting involves almost no number crunching, but does cause a large number of cache misses. In addition, we have a huge number of branch prediction failures, making this a very good case for Hyperthreading.
 - Collision detection also a lower number of cache misses and branch prediction failures than sorting, but also involves number crunching.
 - Rendering has a small amount of number crunching, but seems to be dominated by a large number of cache misses.
But what can we do to solve the cache miss issue? Java doesn't let us decide where our instances are stored!



Enter LibStruct! LibStruct allowed me to redefine my Ship, Vector2 and Int96 classes to be allocated as basic structs instead of real Java instances. Using the malloc() and free() methods provided by the library, I could essentially force a much better memory layout for my entities, so that each Ship struct was lying (most of the time) right next to its Vector2s and Int96s. When the CPU hits a cache miss, it loads in a large block of data, which means that it often also fetched the Vector2s and Int96s of each Ship instance when the Ship is accessed. The question is, how much does this cache locality matter?

Results of Intel i7-4770K:
1  
2  
3  
4  
5  
6  
7  
8  
Task       <timing of 1 thread> / <timing of 8 threads> = <scaling>

Update:    12,40554  / 2,489921 = 4,982302651369260309865252752999
Sort:       4,348568 / 0,722247 = 6,0208875910872596217083629284718
Collision: 13,828989 / 2,353232 = 5,8765939779843211379073546509651
Render:     4,341144 / 0,971539 = 4,4683167634032190164265150446868

Total:     34,966988 / 6,574033 = 5,318955350543570438420373003908

Damn, looks like it didn't help much here. We see some overhead in the updating and collision detection parts due to the structs, but this is outweighed by better sorting and rendering performance in both single-threaded and multi-threaded tests, implying that the goal of improving cache performance was successful to some extent. The reduced scaling of sorting, collision detection and rendering also implies that Hyperthreading now has less of an impact. Let's take a look at the AMD FX-8320!

Results of AMD FX-8320:
1  
2  
3  
4  
5  
6  
7  
8  
Task       <timing of 1 thread> / <timing of 8 threads> = <scaling>

Update  20,080358 / 3,403189 = 5,9004533688843023411276893525455
Sort     6,043065 / 0,941988 = 6,415225034713818010420514911018
Collis- 17,833417 / 2,317607 = 7,6947545463920328166078200488694
Render   6,068055 / 0,952332 = 6,3717852597623517848817429215862

Total   49,985323 / 7,626102 = 6,5545049095855261311742224271325

Holy shit. We see a MASSIVE performance boost in the most memory intensive parts of the test. Sorting performance is up by 160%, collision detection performance is up 53% and rendering performance is up 154%, but updating is 16% slower. In total, this adds up to a 49% boost in total performance, making the AMD FX-8320 competive with the Intel i7-4770K, as it should be. Scaling is also massively improved, mostly hovering at around 6x but achieving almost linear scaling for collision detection.


So there you have it!

Myomyomyo.
Online Roquen
« Reply #1 - Posted 2014-05-19 07:54:08 »

I'd want to stick with prim sized values.  Use 32-bit integers.  But I have a 64-bit machine on a 64-bit VM?  Sure but HotSpot devs have spent less time writing transforms for 64-bit values so you'll get better performance with 32-bit.  So you might want to think about some partitioning your space and using local coordinates.

Another thing to think about is some data partitioning and laying out spatial data in a flat double buffered array.  That way the collision detection could be multithreaded.  One array is read-only for sim step 'n' and the other write only for sim-step 'n+1' so no false sharing, then flip.
Offline Endos
« Reply #2 - Posted 2014-05-19 08:32:35 »

Why you need to compute 100 000 entities if it's impossible to have all of them on screen? Compute only what is on screen and you will get 60 fps easily.

Bored Birds - End with your boredom for Android
Retroships - Space Shooter for Android
Ultimate - 2D Side-Scrolling Platformer project
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline ctomni231

JGO Wizard


Medals: 99
Projects: 1
Exp: 7 years


Not a glitch. Just have a lil' pixelexia...


« Reply #3 - Posted 2014-05-19 08:44:55 »

I think it is because it is simulating all the entities at once. Therefore, since it is online, I believe faking it won't cut it for accuracy of all the spaceships.

Exactly how precise does the collision detection have to be? Even though I love the idea of controlling the data through structs, possibly just packing the relevant data into one integer would yield better results. Honestly, the GC always worked against me when dealing with multiple creations and deletions even on the smallest scale.

Online Roquen
« Reply #4 - Posted 2014-05-19 08:55:43 »

Quote
Hyperthreading allows each core to have TWO threads "loaded in" at the same time, and if one of the threads stall for some reason ...
No that's not quite how it works.  Each core has two threads and only partial hardware support for each (like each has it's own register set).  The "first" core gets to use any available resource as if it were running by itself (not quite true, but close enough) at a given clock tick.  Once the first core has chosen what execution units it's using that tick, then second looks at what execution units its want to use and if any are "free" (the first core isn't using it) it jumps in and and uses it/them.  So even if the first thread isn't stalled the second thread can (usually) make progress.  Of course if the "first" thread stalls and isn't swapped out then the "second" thread will be running a full rate.  This is the hand-wavey high view.
Online princec

JGO Kernel


Medals: 362
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #5 - Posted 2014-05-19 12:26:12 »

The other thing about hyperthreading is that quite often one core will stall completely waiting for memory accesses, and while it's waiting it's not using any processing units at all so the theory is the other hardware thread gets to jump in a bit more than might otherwise be expected.

Cas Smiley

Offline theagentd
« Reply #6 - Posted 2014-05-19 14:39:51 »

I'd want to stick with prim sized values.  Use 32-bit integers.  But I have a 64-bit machine on a 64-bit VM?  Sure but HotSpot devs have spent less time writing transforms for 64-bit values so you'll get better performance with 32-bit.  So you might want to think about some partitioning your space and using local coordinates.

Another thing to think about is some data partitioning and laying out spatial data in a flat double buffered array.  That way the collision detection could be multithreaded.  One array is read-only for sim step 'n' and the other write only for sim-step 'n+1' so no false sharing, then flip.
What exactly are you proposing? I currently have an Int96 class which holds three 32-bit integers and emulates 96-bit math (I only need addition and subtraction). Having coordinates relative to certain points in space would possibly simplify certain calculations, but problems are introduced when sorting entities and running collision detection, since I essentially need to be able to determine their order. Although this can be solved, it does carry a certain amount of overhead. I have yet to experiment with this though, but it might be worth looking into.

I already have very good multithreading working for my collision detection. If you're interested, feel free to look at my code: http://www.java-gaming.org/?action=pastebin&id=944. It's inspired by the Bentley-Ottmann algorithm.


Why you need to compute 100 000 entities if it's impossible to have all of them on screen? Compute only what is on screen and you will get 60 fps easily.
This is for a server, so there is no "screen". The server needs to run accurate computations of all entities all the time. The clients on the other hand will not need to run any collision detection or sorting, and will only need to show an approximation of the entities that the client can see.

Exactly how precise does the collision detection have to be? Even though I love the idea of controlling the data through structs, possibly just packing the relevant data into one integer would yield better results. Honestly, the GC always worked against me when dealing with multiple creations and deletions even on the smallest scale.
The collision detection is currently has better at least millimeter precision, depending on the speed difference between the two entities colliding. Acceleration and collision detection is actually done using doubles instead of Int96s. Certain values such as the distance between an entity and a planet (for calculating gravity) are a much better fit for floating point precision since the precision requirements get smaller and smaller the larger the distance is, so I only calculate what I need to calculate with my Int96 class. Here's some of the code being run when updating the game for calculating the gravity a certain planet applies on a ship:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
      StructVector2 temp = new StructVector2(new StructInt96(), new StructInt96()); //Stack allocated! No garbage! =D
     
      temp.set(planet.getPosition()).sub(ship.getPosition());
     
      //Convert dx and dy to doubles
     double dx = temp.getXAsDouble();
      double dy = temp.getYAsDouble();
     
      double distSqrd = dx*dx + dy*dy;
      double dist = Math.sqrt(distSqrd);
     
      double mul = SpaceSimulation.GRAVITATIONAL_CONSTANT * mass / (dist*distSqrd);
     
      ship.addVelocity(dx*mul, dy*mul); //Convert back to Int96

I'm not sure how packing all the data of a ship into an integer would work when just the position and velocity of the ship require around 384 bits, far exceeding what I can fit in a 32-bit integer. GC performance shouldn't be a problem if Riven has done his job right with the structs' malloc() and free() functions. The test program at the moment does not generate any garbage at all in the game loop (except for Strings when printing stuff). As a third point, although the game is real-time, it's far from twitch-based. The game is meant to be played over much longer periods of time. I was once a fan of Ogame, but was annoyed to death by the fact that it was basically a new skin over any other generic browser game in that category. It'd be perfectly acceptable to occasionally get spikes on the server side, since the clients will be around 10 seconds behind the server at any time.


Quote
Hyperthreading allows each core to have TWO threads "loaded in" at the same time, and if one of the threads stall for some reason ...
No that's not quite how it works.  Each core has two threads and only partial hardware support for each (like each has it's own register set).  The "first" core gets to use any available resource as if it were running by itself (not quite true, but close enough) at a given clock tick.  Once the first core has chosen what execution units it's using that tick, then second looks at what execution units its want to use and if any are "free" (the first core isn't using it) it jumps in and and uses it/them.  So even if the first thread isn't stalled the second thread can (usually) make progress.  Of course if the "first" thread stalls and isn't swapped out then the "second" thread will be running a full rate.  This is the hand-wavey high view.
Interesting! That means that Hyperthreading is even more advanced that I thought, which makes sense considering the performance increase I got in sorting and collision detection.

Myomyomyo.
Offline theagentd
« Reply #7 - Posted 2014-05-19 21:27:13 »

I did some optimizations. I realized that my Ship structs were 4 bytes larger than they were supposed to and fixed that. I also moved some stuff that didn't have to be repeated for each iteration of a loop outside of the loop.

i7-4770K:
1  
2  
3  
4  
5  
6  
Update:    11,785027 / 2,362425 = 4,9885295829497232716382530662349
Sort:       4,363652 / 0,675312 = 6,4616828962020517923567180799393
Collision: 12,803392 / 2,093591 = 6,1155173097324166945692831121265
Render:     4,018574 / 0,779708 = 5,1539473751712179431274271906919

Total:     32,999605 / 5,943186 = 5,5525108923059113411560735268928


Got a decent 10.6% speed up, and at least when using 8 threads the struct version is now faster in all four parts of the game loop!

Myomyomyo.
Offline Spacebeans
« Reply #8 - Posted 2014-05-19 22:17:25 »

I have absolutely no idea what this is, but thanks for letting me test it!
Offline Longor1996
« Reply #9 - Posted 2014-05-20 10:57:32 »

Could LibStruct be useful when making a raytracer in Java?
The biggest problems I got in my Raytracer is the creation/management of vectors (and I am already reusing them), and the use of methods which I can't manually inline without causing a big mess.

That could be a good example for Multithreading when I think about it.
Gotta try that out, with LibStruct.

Have a nice day.

- Longor1996

Sorry for my bad English! That's because i am from Germany.
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline Nate

JGO Kernel


Medals: 145
Projects: 4
Exp: 14 years


Esoteric Software


« Reply #10 - Posted 2014-05-20 15:43:48 »

How about showing some code so I can see the difference with and without LibStruct without putting in much effort? Cheesy

Online Riven
« League of Dukes »

JGO Overlord


Medals: 781
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #11 - Posted 2014-05-20 15:46:01 »

Synthetic benchmark results:

Making 10M instances of Vec3:
1  
2  
3  
4  
5  
6  
7  
8  
tInstance1              47ms    209M/s     // new RegularVec3()
tStackAlloc1            66ms    149M/s     // new Vec3() once in 1 method, slow, there is overhead per method to fetch the struct-stack
tStackAlloc1N           16ms    610M/s     // new Vec3() 10 times in a loop in 1 method
tStackAlloc10N          14ms    702M/s     // new Vec3() 10 times unrolled in 1 method
tStackAllocArr          31ms    319M/s     // new Vec3[10] (implies stack allocation of 10 structs and a container)
tMemoryAllocFree       111ms     89M/s     // Struct.free(Struct.malloc(Vec3.class));
tMemoryAllocFreeArr     60ms    165M/s     // Vec3[] vecs = Struct.malloc(Vec3.class, 100); for(Vec3 vec: vecs) Struct.free(vec);
tMemoryAllocBulkFreeArr 35ms    280M/s     // Struct.free(Struct.malloc(Vec3.class, 100));


To reproduce, 'just' run the StructTest class, with the LibStruct agent

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Online Riven
« League of Dukes »

JGO Overlord


Medals: 781
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #12 - Posted 2014-05-20 15:50:58 »

@Nate: I'll work on that, later today.

How does one showcase a technology that most likely slightly complicates your sourcecode, which reduces memory latency and thwarts GC pauses... I'm hesitant to write two full fledged visual tech-demos Emo

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline saucymeatman
« Reply #13 - Posted 2014-05-20 20:42:40 »

Is this way over anyone else's head? haha
Offline lcass
« Reply #14 - Posted 2014-05-20 21:01:10 »

I get a little bit of it however he is "number dumping" a lot.
Online Riven
« League of Dukes »

JGO Overlord


Medals: 781
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #15 - Posted 2014-05-20 21:09:19 »

The thing is that performance is all about numbers - it aint sexy, so it's hard to convince people to join in.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline theagentd
« Reply #16 - Posted 2014-05-20 22:16:21 »

... it aint sexy...
What the hell is wrong with you?  Angry

Myomyomyo.
Online Riven
« League of Dukes »

JGO Overlord


Medals: 781
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #17 - Posted 2014-05-20 22:21:33 »

Poor grammar skills. Let me correct myself: "it is not heartthrobish"

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Online Riven
« League of Dukes »

JGO Overlord


Medals: 781
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #18 - Posted 2014-05-28 11:30:43 »

How about showing some code so I can see the difference with and without LibStruct without putting in much effort? Cheesy

I added 2 versions of a demo app to the LibStruct repo: one based on objects, one based on structs.

With structs: calculation takes ~45ms, GC never runs (!) after init - we allocate about 64MB per frame.
With objects: calculation takes ~75ms, GC pauses about twice per second with ~30ms collections.

It needs to be mentioned that this demo does not take advantage of memory locality - all structs are scattered over memory - therefore it only tests allocation performance and I/O throughput while being just as cache-trashing as the version based on objects.

https://github.com/riven8192/LibStruct/tree/master/src/test/net/indiespot/demo/softlylit/objects
https://github.com/riven8192/LibStruct/tree/master/src/test/net/indiespot/demo/softlylit/structs




A screenshot of SoftlyLit demo with 4096 (642) randomly placed light sources and 64 moving/rotating walls:
You can also peek at this poor quality GIF video (40MB!)




To run the demo implemented with objects:
1  
2  
3  
4  
5  
Main Class: 
   test.net.indiespot.demo.softlylit.objects.Demo

JVM args:
   -Djava.library.path=./lib/lwjgl-2.9.1/native/windows


To run the demo implemented with structs:
1  
2  
3  
4  
5  
6  
Main Class: 
   test.net.indiespot.demo.softlylit.structs.Demo

JVM args:
   -javaagent:struct-agent.jar=test/net/indiespot/demo/softlylit/structs/struct-def.txt
   -Djava.library.path=./lib/lwjgl-2.9.1/native/windows

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Online Roquen
« Reply #19 - Posted 2014-05-28 11:58:52 »

As a related side-note: in days of yore it was claimed that escape analysis could only mark an object as non-escaping  (and stack like allocate and make possible for scalar replacement) IF it's being allocated in an effective method that isn't passing it to any other method.  By effective here I mean after any inlining of methods have occurred.  I haven't had the chance to look into this.
Online Riven
« League of Dukes »

JGO Overlord


Medals: 781
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #20 - Posted 2014-05-28 12:02:23 »

For the last years and to this day, Escape Analysis only worked in extremely trivial use cases, which makes it pretty much guaranteed that it won't work where it's most needed. Further, in real life it just happens that you need to put objects in a data structure, like an array or a list. That pretty much rules out EA for the next two decades.

Anyway, LibStruct is still very immature, there are lots of optimizations left to do, bugs to fix, a few design choices to work out (compound structs, which are currently implemented through 'views') and the control flow analysis could be better, as it currently gets rather confused by switch statements, which is why the structs version of the demo uses a slow if/else chain in the performance critical section - yet still is faster.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline Orangy Tang

JGO Kernel


Medals: 56
Projects: 11


Monkey for a head


« Reply #21 - Posted 2014-05-28 12:16:55 »

theagentd: Could you post a high-level description of what's actually going on under the hood on libStruct? I'm guessing you're generating bytecode to do some of this? What does that do for portability (I'm thinking Android and libGDX/iPhone platforms)?

[ TriangularPixels.com - Play Growth Spurt, Rescue Squad and Snowman Village ] [ Rebirth - game resource library ]
Online Riven
« League of Dukes »

JGO Overlord


Medals: 781
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #22 - Posted 2014-05-28 12:21:16 »

LibStruct is developed by me - theagentd plays around with it, yelling at me when it breaks again (and again). I'm his main cause of hair loss these days.

Portability wise, you're bound to the all powerful HotSpot VM.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline Orangy Tang

JGO Kernel


Medals: 56
Projects: 11


Monkey for a head


« Reply #23 - Posted 2014-05-28 12:43:26 »

LibStruct is developed by me - theagentd plays around with it, yelling at me when it breaks again (and again). I'm his main cause of hair loss these days.

Portability wise, you're bound to the all powerful HotSpot VM.
Ah, I didn't realise that, sorry :S

Anyway, could you provide a bit of an overview of what it's doing under the hood? Or point me somewhere about it? I remember reading your posts about such hackery a while back but didn't realise you turned it into a proper working thing. Smiley

[ TriangularPixels.com - Play Growth Spurt, Rescue Squad and Snowman Village ] [ Rebirth - game resource library ]
Online Riven
« League of Dukes »

JGO Overlord


Medals: 781
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #24 - Posted 2014-05-28 12:45:51 »

The idea behind LibStruct is to get rid of any instance related behaviour of a type, including all non-static methods.

Original source code:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
public class Vec3 {
   public float x, y, z;

   public Vec3() {
      x = y = z = 1337;
   }

   public void load(float x, float y, float z) {
      this.x = x;
      this.y = y;
      this.z = z;
   }

   public void load(Vec3 xyz) {
      this.x = xyz.x;
      this.y = xyz.y;
      this.z = xyz.z;
   }
}

public void run() {
   Vec3 pos = new Vec3();

   Vec3 vel = new Vec3();
}


Rewritten code:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
public class Vec3 {

   public static void _<init>_(int _this) {
      fput(_this, 0, 1337);
      fput(_this, 4, 1337);
      fput(_this, 8, 1337);
   }

   public static void load(int _this, float x, float y, float z) {
      fput(_this, 0, x);
      fput(_this, 4, y);
      fput(_this, 8, z);
   }

   public static void load(int _this, int _xyz) {
      fput(_this, 0, fget(_xyz, 0));
      fput(_this, 4, fget(_xyz, 4));
      fput(_this, 8, fget(_xyz, 8));
   }
}

public void run() {
   StructAllocationStack _sas = StructEnv.getFastThreadLocalStack();
   _sas.save();

   int pos = StructEnv.allocate(_sas, 12); // sizeof(Vec3) == 12 bytes
  Vec3._<init>_(pos); // run specified constructor, which is now a static method

   int vel = StructEnv.allocate(_sas, 12);
   Vec3._<init>_(vel);

   _sas.restore();
}


The int is a pointer with the lowest 2 bits cut off (they are known to be zeroes as structs are 4 byte aligned).
This allows addressing up to 16GB of memory using LibStruct, instead of the 4GB you'd expect from 32 bit integers.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline theagentd
« Reply #25 - Posted 2014-05-28 13:14:36 »

Riven, I exceeded my maximum PM limit so I can't respond anymore.  Sad

Myomyomyo.
Online Riven
« League of Dukes »

JGO Overlord


Medals: 781
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #26 - Posted 2014-05-28 13:17:56 »

Riven, I exceeded my maximum PM limit so I can't respond anymore.  Sad
You chatty bastard. Changing the PM rate limit will be quite an adventure, so why not post your rant in pastebin? Smiley

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Pages: [1]
  ignore  |  Print  
 
 

 

Add your game by posting it in the WIP section,
or publish it in Showcase.

The first screenshot will be displayed as a thumbnail.

atombrot (26 views)
2014-08-19 09:29:53

Tekkerue (24 views)
2014-08-16 06:45:27

Tekkerue (23 views)
2014-08-16 06:22:17

Tekkerue (14 views)
2014-08-16 06:20:21

Tekkerue (22 views)
2014-08-16 06:12:11

Rayexar (60 views)
2014-08-11 02:49:23

BurntPizza (39 views)
2014-08-09 21:09:32

BurntPizza (30 views)
2014-08-08 02:01:56

Norakomi (37 views)
2014-08-06 19:49:38

BurntPizza (67 views)
2014-08-03 02:57:17
List of Learning Resources
by Longor1996
2014-08-16 10:40:00

List of Learning Resources
by SilverTiger
2014-08-05 19:33:27

Resources for WIP games
by CogWheelz
2014-08-01 16:20:17

Resources for WIP games
by CogWheelz
2014-08-01 16:19:50

List of Learning Resources
by SilverTiger
2014-07-31 16:29:50

List of Learning Resources
by SilverTiger
2014-07-31 16:26:06

List of Learning Resources
by SilverTiger
2014-07-31 11:54:12

HotSpot Options
by dleskov
2014-07-08 01:59:08
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!