Java-Gaming.org    
Featured games (81)
games approved by the League of Dukes
Games in Showcase (481)
Games in Android Showcase (110)
games submitted by our members
Games in WIP (547)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: [1]
  ignore  |  Print  
  advice for debugging weird error  (Read 1835 times)
0 Members and 1 Guest are viewing this topic.
Offline JasonB

Junior Member





« Posted 2004-06-30 18:32:11 »

One of our (server-side) apps runs perfectly for anything up to a couple of weeks, and then will suddenly start running at 100% of the CPU.  The problem is completely intermittent.  When it started happening a couple of months ago, it was during peak usage, then happened a week or so later at the lowest point of usage during the day, then happened 2 days later, then ran fine for another 3 weeks.

After the latest occurrence, top shows 3 threads that are causing the problem, but as far as I know, there's no way to map a PID to a Java thread.  The stack dump produced by kill -3 shows me nothing obvious, although perhaps I'm don't know what I should be looking for.

The worst part about this problem is that I've been completely unable to reproduce it during testing.  Frustrated hair pulling, ahoy!

Anyone have any thoughts/advice...?

Thx
J
Offline blahblahblahh

JGO Coder


Medals: 1


http://t-machine.org


« Reply #1 - Posted 2004-06-30 19:12:50 »

1. perhaps this is better in networking, or is it nothing to do with games? Smiley

2. Could be a hacker. Never underestimate the number of strange problems they can cause, often through acts of random incompetence rather than insightful brilliance.

3. Which stress testing framework are you using?

malloc will be first against the wall when the revolution comes...
Offline JasonB

Junior Member





« Reply #2 - Posted 2004-06-30 20:49:02 »

1.  not really games.  sort of entertainment-related though.

2.  can't see anything obvious in terms of security breaches.  my sys admin is pretty damn on to it, in regard to security. in fact he's anal retentive, so I'm doubtful it's a hacker.  might be wrong of course, so I'll get him to take another look.

3.  using jakarta jmeter to stress test.  and I hate it.  currently hunting around for a console-based, scripted test tool.
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
kul_th_las
Guest
« Reply #3 - Posted 2004-07-01 21:08:41 »

Probably an obvious suggestion, but do your app logs point to anything out of the ordinary?

When the app goes to 100% CPU, is it still functioning, or does it seem to be caught in a bit of repeating code somewhere?

Does the CPU usage ever go back down? After 5 minutes? After 2 hours?

What sorts of issues do you see on the client side when the server app goes to 100%?

What's different about your test environment vs. your real-world environment? Of those things, which one of the seemingly small and unrelated things have you not tested against? For example, I once tracked down a printing problem to a change from standard time to daylight savings time - a PRINTING problem of all things. Get as far outside the box as you can.
Offline oNyx

JGO Coder


Medals: 2


pixels! :x


« Reply #4 - Posted 2004-07-01 21:49:35 »

100% cpu... hm.

There was a way to do that by sending weird pathes to the servlet.

Something like http://<path to servlet>/../../../../../<a hundret times>/../../../

But that was fixed ages ago. However, it might be a good idea to update everything.

弾幕 ☆ @mahonnaiseblog
Offline JasonB

Junior Member





« Reply #5 - Posted 2004-07-02 04:52:01 »

That's my problem.  We're logging pretty comprehensively, but there's nothing obvious in the log files.

When the CPU hits 100%, everything still works, but -extremely- slowly.  Slowly enough to cause timeouts in the client browsers (mobile phones).

It certainly doesn't seem to recover after (up to) an hour.   But it has never been left longer than that, for obvious reasons.

Differences between the test and production environment...?  Well I've got no budget to duplicate production unfortunately, so we've mimicked it as best we can, on the hardware we've got to hand.  Operating System wise, they're both running Fedora and I'm pretty sure the same version.  Exactly the same version of JDK and same version of Jetty, Apache & mod_jk  (Thought about running with tomcat for a while, but that caused a few minor issues that need to be resolved first).  We're running the latest versions of everything we can.

In terms of some part of the system that isn't 'exercised' other than when the problem occurs -- there isn't one.  Everything is pretty much slammed every day.
Offline Middy

Junior Member




Java games rock!


« Reply #6 - Posted 2004-07-02 05:18:17 »

Well, what are your application solving?

Some of your algoritms might be superpolynomial (2^n, where n is input size) without you reliasing it. That means that as the data grows, the server will slowly kill itself. That could be what happened.

Dont think it has to be a very advanced algorithm thats superpolynomial . A famous NP Hard problem is SubSetSum, it goes like this.

"using a list of numbers, find out if a subset of it, is equal to k"

So try to debug with the same data as you have in your server now.

When do I get my makeMyGameAsILike() extension?
Offline JasonB

Junior Member





« Reply #7 - Posted 2004-07-02 07:57:30 »

Tried that.  Tested with live data a number of times.  The problem isn't easily reproducable.  Plus we've done various optimisations in recent releases which have seen performance improvements without an commensurate increase in CPU usage.  Data isn't the issue, otherwise restarting the server wouldn't fix the problem -- and it does.  When I restart, CPU usage goes back to between 1-8% depending upon the load at the time.

This is why I'm drawing a blank.
Offline blahblahblahh

JGO Coder


Medals: 1


http://t-machine.org


« Reply #8 - Posted 2004-07-02 08:11:38 »

re: hacking, I've encountered numerous situations where the security is flawless, nothing is compromised, but the weird and wonderful side-effects of e.g. a "crack Apache" script being run against a non-HTTP server stress various very very strange protocol situations that weren't covered by unit tests.

e.g. one server had a race condition where if the connection was dropped at exactly the right time, with a 1-byte full server-side read buffer, that thread would stick in a while loop. Or another time, I had a server which assumed (reasonably enough) that 32kb was enough data for a simple text messaging protocol...until a hacker tried to send a multi-megabyte windows rootkit (to a linux server Wink). The security system worked fine, and nothing would have happened even on windows, but it uncovered a subtle bug in the "buffer overflow" code that our unit tests hadn't covered.

Anyway, what *I* would do first of all is grab ethereal and start doing packet sniffs on the incoming data. That should give you enough to do semi-perfect replays of the condition. There's a lot of things that won't show up in logfiles (trivial example, not particularly relevant for java: weird packet fragmentation) but will show up in packet sniffs.

Beyond that, if this server is using NIO, email me at ceo @ grexengine.com and I'll try and help some more.

malloc will be first against the wall when the revolution comes...
Offline Middy

Junior Member




Java games rock!


« Reply #9 - Posted 2004-07-02 08:13:34 »

heh well its hard to know, with the little info.

What about bottlenecks?

Do you have a database running?

# of transactions
#of connections

whatever?

Perhaps its somethin  on the server. Transfeer application to another server or replace it with a backup.

When do I get my makeMyGameAsILike() extension?
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
kul_th_las
Guest
« Reply #10 - Posted 2004-07-02 19:40:41 »

Hardware problem?

Perhaps you have a piece of code that's failing, but rather than throwing an exception, or putting something in the log file, it's simply trying over and over and over, because of the way that code handles errors. Perhaps trying to write something to disk or RAM, and the chip or the drive is starting to fail? Maybe trying to read from an input device (network, hard drive, database)?

Do you have only one server?

If you have multiple servers, is it only one particular server?
Offline JasonB

Junior Member





« Reply #11 - Posted 2004-07-02 20:44:34 »

here's a basic rundown of the application:

apache running on one server, connects via mod_jk to another 2 servers each running jetty + our own application server.  We have multiple connections (using jakarta dbcp) to a postgres database running on another server.  99% of traffic goes to jetty instance (#1).  Server #2 is reserved for infrequent events which have a significantly higher load than our typical daily 1 million+ hits.  We control access using rewrite rules, so if you ain't on our IP list, you ain't getting in (so to speak).  #1 is the one with the problem.

That's a good point regarding h/w failure.  Something I hadn't considered actually (and should have).  We'll have to replace another server shortly, so we can probably roll this problematic one out and use the new one and see if the problem still occurs.  Thx kul_th_las,

J
Pages: [1]
  ignore  |  Print  
 
 
You cannot reply to this message, because it is very, very old.

 

Add your game by posting it in the WIP section,
or publish it in Showcase.

The first screenshot will be displayed as a thumbnail.

atombrot (26 views)
2014-08-19 09:29:53

Tekkerue (24 views)
2014-08-16 06:45:27

Tekkerue (23 views)
2014-08-16 06:22:17

Tekkerue (13 views)
2014-08-16 06:20:21

Tekkerue (20 views)
2014-08-16 06:12:11

Rayexar (58 views)
2014-08-11 02:49:23

BurntPizza (38 views)
2014-08-09 21:09:32

BurntPizza (30 views)
2014-08-08 02:01:56

Norakomi (37 views)
2014-08-06 19:49:38

BurntPizza (67 views)
2014-08-03 02:57:17
List of Learning Resources
by Longor1996
2014-08-16 10:40:00

List of Learning Resources
by SilverTiger
2014-08-05 19:33:27

Resources for WIP games
by CogWheelz
2014-08-01 16:20:17

Resources for WIP games
by CogWheelz
2014-08-01 16:19:50

List of Learning Resources
by SilverTiger
2014-07-31 16:29:50

List of Learning Resources
by SilverTiger
2014-07-31 16:26:06

List of Learning Resources
by SilverTiger
2014-07-31 11:54:12

HotSpot Options
by dleskov
2014-07-08 01:59:08
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!