Java-Gaming.org    
Featured games (81)
games approved by the League of Dukes
Games in Showcase (487)
Games in Android Showcase (112)
games submitted by our members
Games in WIP (553)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: [1]
  ignore  |  Print  
  The JVM and bit errors  (Read 2608 times)
0 Members and 1 Guest are viewing this topic.
Offline Varkas
« Posted 2012-12-07 11:46:18 »

RAMs now then then suffer from bit errors. That is, a bit on the heap just changes from 0 to 1 or the other way round. It happens very seldom, but the more memory an application uses and the longer the application runs the more likely it is that it will be affected by such a bit error.

Does the JVM have any means to detect such errors?

At the moment I assume that if a bit error hits a value field of a java object which is allocated on the heap, the value just changes. I have no idea what happens if a bit error will hit a reference field, but bets are that the jvm will just crash once the faulty reference is followed. Bit erros inside the jvm own structures most likely result in "undefined behavior" my favorite term from C and C++ standards.

I know such errors are extremely rare, but I'm curious if someone knows more about this topic, and can shed some light on it Smiley

if (error) throw new Brick(); // Blog (german): http://gedankenweber.wordpress.com
Offline princec

JGO Kernel


Medals: 367
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #1 - Posted 2012-12-07 11:53:48 »

You're screwed basically. There is no way a JVM can recover from this. There's every chance of a bit flipping in the kernel. Or in some driver. Any which way you look at it, it is likely to lead to some subprocess crashing. The JVM will likely crash, or if you're lucky, it'll only be pure data you get corrupted and your process will throw some exception instead of doing something weird but allowed.

The problem must be addressed at the hardware level with parity checked RAM and other related solutions that you find on expensive servers (ever wondered why real servers are more expensive than commodity PCs?)

Cas Smiley

Offline Varkas
« Reply #2 - Posted 2012-12-07 12:44:52 »

Yes, servers have error checking and correction for most data transfers and also the RAM.

But Java gamers usually don't, that why I was asking Wink

Thanks for the answer, it clarified quite some of my questions!

Some simulation games run quite long - I mean, players start a game, play some hours, save the game and will continue another day. Assuming that value errors do not immeditaley crash the game. they should be saved with the game data and accumulate over time. Also these simulation game soften need a lot of ram. I wonder how likely it is that such games will be affected by this sort of errors ... I'll go and google  Cool

Edit: Wikipedia says:

Quote
Recent studies give widely varying error rates with over 7 orders of magnitude difference, ranging from 10−10 − 10−17 error/bit·h, roughly one bit error, per hour, per gigabyte of memory to one bit error, per century, per gigabyte of memory

The one bit per hour per gigabyte seems a bit much, but if people overclock their rams I wouldn't be surprised if the rate is actually that high.

if (error) throw new Brick(); // Blog (german): http://gedankenweber.wordpress.com
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline princec

JGO Kernel


Medals: 367
Projects: 3
Exp: 16 years


Eh? Who? What? ... Me?


« Reply #3 - Posted 2012-12-07 13:48:09 »

Depends on luck of the draw and manufacturing batch quality and all sorts like power stability. You ever wondered why PCs crash inexplicably now and again? Well, this can be one of those reasons. We just live with it. This is why they don't use cheapass computers on space stations too I suppose.

Cas Smiley

Online Roquen
« Reply #4 - Posted 2012-12-07 13:52:08 »

Man oh man...that's why I'm always wearing my aluminum foil hat.  I hate it when my brain experiences a bit-error.  And it works!  I've never had to reboot...well from external causes that is.
Offline Varkas
« Reply #5 - Posted 2012-12-07 13:56:11 »

I heard that brain reboots are quite painful, so lucky you  Grin

For the paranoid (use as JUnit test):

1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
38  
39  
40  
41  
42  
43  
44  
45  
46  
public class MemoryTest extends TestCase
{
    public void testMemory()
    {
        try
        {
            int len = 1<<18;
           
            long t0 = System.currentTimeMillis();
           
            byte [] block = new byte[len];

            for(int i=0; i<len; i++)
            {
                block[i] = (byte)0xAA;
            }

            int pass = 1;
            while(true)
            {
                System.out.println("Starting pass " + pass + " eta: " + ((len * 6L)/1000L) + " seconds.");
               
                for(int i=0; i<len; i++)
                {
                    byte v = block[i];
                    if(v != (byte)0xAA)
                    {
                        System.out.println("Error at pass " + pass + "  position " + i + " with value " + v);
                       
                        // now make the test fail.
                       assertEquals(v, (byte)0xAA);
                    }
                    Thread.sleep(5);
                }

                long t1 = System.currentTimeMillis();
                System.out.println("Pass " + pass + " finished, passed time: " + ((t1 - t0)/1000) + " seconds.");
                pass ++;
            }
        }
        catch(Exception ex)
        {
            ex.printStackTrace();
        }
    }
}


I'm curious if this will ever detect something. Maybe I'll run it as background task for a while  Cool

Edit:

It feels a bit like running a neutrino detector, waiting for the one event per year.

Starting pass 1 eta: 1572 seconds.
Pass 1 finished, passed time: 1383 seconds.
Starting pass 2 eta: 1572 seconds.
Pass 2 finished, passed time: 2726 seconds.
Starting pass 3 eta: 1572 seconds.
Pass 3 finished, passed time: 4073 seconds.
Starting pass 4 eta: 1572 seconds.
Pass 4 finished, passed time: 5420 seconds.
Starting pass 5 eta: 1572 seconds.


if (error) throw new Brick(); // Blog (german): http://gedankenweber.wordpress.com
Offline Varkas
« Reply #6 - Posted 2012-12-07 14:01:29 »

This is why they don't use cheapass computers on space stations too I suppose.

Radiation mostly. Tthe space enabled electronics are no modern 22nm process but big structures which can take more random influence before a signal fails. Old designs often - like the 6502 and 8080 and such. Hardened versions of those, even.

This is why robotic civilizations don't like outer space. It hurts too much in the electronics Wink

if (error) throw new Brick(); // Blog (german): http://gedankenweber.wordpress.com
Offline Best Username Ever

Junior Member





« Reply #7 - Posted 2012-12-07 21:56:23 »

It has to be done in hardware. Since PCs also store program instructions in memory, there is no way a program could protect against such a problem. If one of the bits of some instruction got toggled, there goes your redundancy insurance.

I'm skeptical of the 10^-10 metric applying to most consumer computers.
Offline Varkas
« Reply #8 - Posted 2012-12-10 22:53:28 »

It feels a bit like running a neutrino detector, waiting for the one event per year.

Nothing so far ...

Testsuite: villagers.utils.MemoryTest
Starting pass 1 eta: 1572 seconds.
Pass 1 finished, passed time: 1333 seconds.
Starting pass 2 eta: 1572 seconds.
Pass 2 finished, passed time: 2665 seconds.
Starting pass 3 eta: 1572 seconds.
Pass 3 finished, passed time: 4008 seconds.
Starting pass 4 eta: 1572 seconds.
Pass 4 finished, passed time: 5360 seconds.
Starting pass 5 eta: 1572 seconds.
Pass 5 finished, passed time: 6720 seconds.
Starting pass 6 eta: 1572 seconds.
Pass 6 finished, passed time: 8075 seconds.
Starting pass 7 eta: 1572 seconds.
Pass 7 finished, passed time: 10741 seconds.
Starting pass 8 eta: 1572 seconds.
Pass 8 finished, passed time: 12109 seconds.
Starting pass 9 eta: 1572 seconds.
Pass 9 finished, passed time: 13439 seconds.
Starting pass 10 eta: 1572 seconds.
Pass 10 finished, passed time: 14780 seconds.
Starting pass 11 eta: 1572 seconds.
Pass 11 finished, passed time: 16131 seconds.
Starting pass 12 eta: 1572 seconds.
Pass 12 finished, passed time: 17493 seconds.
Starting pass 13 eta: 1572 seconds.

... so at least the lower bound (1 error per gb per hour) seems not to be the common case, or I was extremely lucky with this test run.

if (error) throw new Brick(); // Blog (german): http://gedankenweber.wordpress.com
Offline CaptainJester

JGO Knight


Medals: 12
Projects: 2
Exp: 14 years


Make it work; make it better.


« Reply #9 - Posted 2012-12-11 03:16:01 »

Long running games will tend to save automatically as the game progresses. So if it crashes there will be a closer point of recovery.

Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline Varkas
« Reply #10 - Posted 2012-12-11 14:04:16 »

Yes, but what if the data was modified silently by an error before the save? The buggy data will be saved with the game then - the crash will happen later in some function using the data ... and if you load the borked game after the crash, it will just crash again when the function is triggered.

But as it was pointed out, there isn't anything to do about it, unless you try to protect your date with your own integrity checks or use a mainboard with ECC ram and such.

My testing is just out of curiousity, and to get a feeling how frequently such problems might occur. It's not that I really want to do a lot about it. It's just that I have seen saved games with a playing time of weeks to years (I guess mostly idle, but still) and was wondering how much errors might sneak into such long running games.

Well, actually the core thought was how to make code robust enough to deal with a certain amount of buggy game data, to "fix" errors, or work with corrupt data without crashing. I'm somehwat sure that one can code in a way that will break easily, and also in a way that can deal with a number of problems. The latter is my goal for my curent simulation game project, and I already had a few problems which the game engine couldn't clean up anymore - not caused by hardware faults, but software bugs. Still game cripling events, which would have required the player to start again. I want to find ways to do that better, kind of "self healing game code".


if (error) throw new Brick(); // Blog (german): http://gedankenweber.wordpress.com
Online Riven
« League of Dukes »

JGO Overlord


Medals: 783
Projects: 4
Exp: 16 years


Hand over your head.


« Reply #11 - Posted 2012-12-11 18:49:55 »

Always assume the memory & CPU are 100% correct. You cannot recover from flipped bits, don't waste time on attempting to do so. You should spend your time on getting your game ready to be published instead. That's what everybody (else) cares about.

Hi, appreciate more people! Σ ♥ = ¾
Learn how to award medals... and work your way up the social rankings
Offline sproingie

JGO Kernel


Medals: 202



« Reply #12 - Posted 2012-12-11 20:04:08 »

Banks tend to use Tandem machines or some equivalent, which actually runs everything on two physically separate machines and compares the results to ensure they're the same.  The Space Shuttle, primitive as it was, used four processors and took the consensus result.

Your game on the other hand ... just let it break.  You should at least add a checksum to saved game files to protect against gremlins far more common than random bit flips.
Offline Varkas
« Reply #13 - Posted 2012-12-12 22:09:28 »

Always assume the memory & CPU are 100% correct. You cannot recover from flipped bits, don't waste time on attempting to do so. You should spend your time on getting your game ready to be published instead. That's what everybody (else) cares about.

I agree on the waste of time part.

But you can recover from bit errors in data, at least some. It requires to have redundancy in the data and error correction algorithms in place. (Easiest way: keep all data three times, and in case of inconsistencies you assume the two identical data sets to be correct and wipe the third - in case of a double error, that is in case of three inconsistent sets, you've lost. There are better algorithms, but it's not of relevance anymore, just for the sake of completeness in case of a future reader stumbling in here.)

Maybe I wasn't clear enough. Since princec's reply I've only continued this out of curiousity. It's not of any practical relevance for me anymore at the point. princec answered all the questions that I had, and I understood that there isn't much to do about it.

I continued this topic out of a more theoretical interest in this sort of events. If it disturbs, I'll just quit talking about it. I'm sorry Sad

I'll some day make a new topic about writing robust code and related questions, that should be something of more general interest, but this week I have too little time for discussions  Sad

if (error) throw new Brick(); // Blog (german): http://gedankenweber.wordpress.com
Offline Best Username Ever

Junior Member





« Reply #14 - Posted 2012-12-12 23:42:34 »

That's called taking the consensus. How often do you want to take consensus? What happens if a reference (pointer) gets modified? Would you going to test every reference before every function call and array access? What about the stack? How will you know which variables are in a register and which aren't? Will you check every primitive type, too? What happens if the return address on the stack gets modified in the middle of another function and you return to the wrong place? What if a JUMP -10 opcode turns into a JUMP -11 opcode? or an ADD -10 opcode?

You cannot achieve effective redundancy in software in the computer architecture of any machine you own. Think of it, your computers, phones, and game systems all store software in RAM. If you must rely on a software solution and your hardware only fetches the next opcode once, how can you possibly prevent RAM corruptions from breaking your code as well as your data? The way computers work make it impossible as well as pointless. Even if you could it's a waste of time for something with such a tiny chance of happening. The time, development costs, hardware cost (to offset running at a fraction of your normal speed,) and energy costs would be offset by either buying higher quality RAM or a machine designed with redundancy in mind.
Offline theagentd
« Reply #15 - Posted 2012-12-13 00:21:08 »

You can't protect your computer from physics with software.

Myomyomyo.
Offline Varkas
« Reply #16 - Posted 2012-12-17 15:01:57 »

I'm sorry again. You're right.

if (error) throw new Brick(); // Blog (german): http://gedankenweber.wordpress.com
Pages: [1]
  ignore  |  Print  
 
 
You cannot reply to this message, because it is very, very old.

 

Add your game by posting it in the WIP section,
or publish it in Showcase.

The first screenshot will be displayed as a thumbnail.

TehJavaDev (17 views)
2014-08-28 18:26:30

CopyableCougar4 (26 views)
2014-08-22 19:31:30

atombrot (39 views)
2014-08-19 09:29:53

Tekkerue (36 views)
2014-08-16 06:45:27

Tekkerue (33 views)
2014-08-16 06:22:17

Tekkerue (22 views)
2014-08-16 06:20:21

Tekkerue (33 views)
2014-08-16 06:12:11

Rayexar (67 views)
2014-08-11 02:49:23

BurntPizza (45 views)
2014-08-09 21:09:32

BurntPizza (36 views)
2014-08-08 02:01:56
List of Learning Resources
by Longor1996
2014-08-16 10:40:00

List of Learning Resources
by SilverTiger
2014-08-05 19:33:27

Resources for WIP games
by CogWheelz
2014-08-01 16:20:17

Resources for WIP games
by CogWheelz
2014-08-01 16:19:50

List of Learning Resources
by SilverTiger
2014-07-31 16:29:50

List of Learning Resources
by SilverTiger
2014-07-31 16:26:06

List of Learning Resources
by SilverTiger
2014-07-31 11:54:12

HotSpot Options
by dleskov
2014-07-08 01:59:08
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!