Java-Gaming.org Hi !
Featured games (83)
games approved by the League of Dukes
Games in Showcase (539)
Games in Android Showcase (132)
games submitted by our members
Games in WIP (603)
games currently in development
News: Read the Java Gaming Resources, or peek at the official Java tutorials
 
    Home     Help   Search   Login   Register   
Pages: [1] 2
  ignore  |  Print  
  Should you risk using NIO for hard-core networking  (Read 9138 times)
0 Members and 1 Guest are viewing this topic.
Offline blahblahblahh

JGO Coder


Medals: 1


http://t-machine.org


« Posted 2003-05-12 12:30:56 »

I was promoting NIO to someone who thought it not worth it recently, and got into a little difficulty. I left the conversation with a few doubts, mainly because of the following points he made:

 - Why isn't NIO properly documented yet? Lots of critically important parts of the API (e.g. How do you detect a disconnect at each and every stage) are not covered in the API docs...you have to read FAQ's and tutorials to find them out instead. (note: this was true of the 1.4.0 release, but perhaps 1.4.1 or .2 has corrected this?)

 - Can we trust something that contained several show-stoppers in the first post-beta release? Is this actually being tested properly by Sun? Several of the bugs are basic problems that should have been uncovered by unit-testing, which sets a worrying precedent that they *weren't* found. (I'm not intimately familiar with the bugs, but I remember spotting some stuff that was fixed in 1.4.1 that surprised me; IIRC there was one bug that the NIO API wasn't actually implemented properly on Windows, it used completion ports, and was constrained to several low fundamental limits inherited from that. Also various problems of bits of network NIO just not working at all on some platforms, IIRC?)

 - How should one use network NIO for high performance servers? There are seem to be few  patterns and documents anywhere on the web (I found one eventually, although it's part of a research project at Stanford, and somewhat off the beaten track) describing how to use NIO in a typical tens-of-thousands-of-clients-per-server system - i.e. with at least 20 threads, and all the associated problems. The API's do some pretty mean and nasty things when you start multi-threading (lots of obvious ways of using the API break because of e.g. places where threads are mutex'ed in surprising ways). I have in the past had problems with the fact that SelectableChannel.register() blocks if the Selector you're registering with is actually in use; this makes it impossible to use .register() in the obvious way, in an MT environment - instead you have to write rather atrocious logic (from an OO point of view) in your select() loop which, immediately after a select() unblocks, registers a queue of awaiting requests. This is a stupid API design, and makes every user jump through the same hoop in every app that uses MT'd NIO networking.

 - How do you have 40 threads sharing the load of answering requests, working off five Selectors, and using several native IO subsystem pre-allocated ByteBuffers? (my suggested answer to this is that if you create a mem-mapped/direct buffer, then make views on it, one for each thread, hopefully this gives the same effective performance as if you had just one thread and one mem-mapped buffer (no views).The API is not entirely clear on this, saying only that mem-mapped buffers "behave no differently to direct" BBs, and that direct BB views are also direct; it doesn't make explicit what the memory requirements are for making views on direct BBs (saying only that you, as the programmer, won't be able to measure it because they are outside the java IO system). Who really knows what the overhead is per view?

...these were his major points, and I could only suggest ways in which they MIGHT be solved, but we couldn't find details in the API docs to answer all of these questions. Does anyone know the answers to the above questions? Do you think it's worth worrying about these problems (they mostly seem to be issues of poor documentation, which you eventually learn yourself the hard way anyway, but the possible lack of testing by Sun had me worried)?

malloc will be first against the wall when the revolution comes...
Offline Herkules

Senior Devvie




Friendly fire isn't friendly!


« Reply #1 - Posted 2003-05-12 14:16:16 »

Hard to tell, the 'N' stands for 'new' anyway.

My stuff works with JRE1.4.1_01, but not with 1.4.1 due to a subtle bug in ByteBuffer#slice().
If I ever developed with 1.4.1, I'm not sure wether I ever uncovered that fact or if I just would have given up.

After I figured out some things with own experiments (e.g. what happens if a connection closes), I have to say NIO works like a charm to me. But I admit that is wasn't easy to get there.

Examples/tutorials exist, but they all cover the same, simple situation. I wished there were more.

Currently I'm not sure wether I do everything like it was meant to do. So I still feel a bit insecure.

But is there a choice? From an architetural point of view there is no other way for many-connection-apps that don't want to afford as many threads (be careful with slice() in a MT environment BTW).

Wether you dare to suggest it depends on the business situation I think. For a game, there might be enough room for a little risk and checking things out?

HARDCODE    --     DRTS/FlyingGuns/JPilot/JXInput  --    skype me: joerg.plewe
Offline blahblahblahh

JGO Coder


Medals: 1


http://t-machine.org


« Reply #2 - Posted 2003-05-12 19:10:43 »

Quote

Currently I'm not sure wether I do everything like it was meant to do. So I still feel a bit insecure.


...I know exactly what you mean Smiley...

Quote

But is there a choice? From an architetural point of view there is no other way for many-connection-apps that don't want to afford as many threads (be careful with slice() in a MT environment BTW).


Thanks for the warning about slice...what in particular goes wrong with it?

I agree that in many cases there may seem to be no other options; generally that leads to a choice between a variety of equally crap alternatives (for networked stuff, usually: use a different language, use Java with JNI (which turns into really ugly hackery in this situation), or use 3rd party software (IF you can trust it, which often you can't)).

malloc will be first against the wall when the revolution comes...
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline Herkules

Senior Devvie




Friendly fire isn't friendly!


« Reply #3 - Posted 2003-05-13 05:58:49 »

Quote

Thanks for the warning about slice...what in particular goes wrong with it?


Nothing particular. It just isn't a stand-alone structure but just a view to a buffer. So if you overwrite the buffer in one thread while another uses the slice..... common MT problem.

HARDCODE    --     DRTS/FlyingGuns/JPilot/JXInput  --    skype me: joerg.plewe
Offline GergisKhan

Junior Devvie




"C8 H10 N4 O2"


« Reply #4 - Posted 2003-05-13 13:37:17 »

Herkules,

Do you have any other examples other than what Sun has provided for NIO?  I'm doing conversions to 1.4 at the moment and I'd like to use it, but Sun's is very small and doesn't really address what I need to understand.

Further, you mentioned in another thread that using XML isn't a good way to send game data over the wire.  I agree.  I'd like to know if you could describe the method you use a bit clearer.  For example, you said you're sending a ByteBuffer.  What are its contents (if that's ok for you to describe).

I'm looking for a simple example.  Right now, my networking sends "sentences" comprised of a Command ID, subject, target, and additional params, like:

05:8A8B:409F:48:45:4B:4B:4F

Which is something that translates to private message from person 'A' to person 'B' : hello

Any thoughts?

gK

"Go.  Teach them not to mess with us."
          -- Cao Cao, Dynasty Warriors 3
Offline blahblahblahh

JGO Coder


Medals: 1


http://t-machine.org


« Reply #5 - Posted 2003-05-13 16:01:04 »

Quote

Do you have any other examples other than what Sun has provided for NIO?


I'm doing two NIO servers at the moment for different projects; one has to handle several thousand players, the other has to handle 10,000. I'm working towards maximally efficient IO, but worrying too much about it yet - so don't take my examples as being necessarily good Cheesy. One thing to watch out for is that relative efficiency in NIO is VERY sensitive to the sizes of Buffers, and also to lots of circumstantial details - there is also a lot of variability because the underlying implementation of NIO on different platforms is very different, using whatever native features are available.

For a server that has to handle thousands of clients at once, I'm using a couple of thread pools, so that I can independently scale the number of threads at different points in the request-calculation-response pipeline. It's also a good way (at this stage) of discovering MT-problems and non-thread-safe code before we get to later development stages Smiley.

I have a single selector for the ServerSocketChannel, served by one thread; it's possible (I've done it before) to have multiple threads on this selector, but it's tricky (I can provide a detailed example if you need), and if you are ONLY doing accepts() on the Selector, then probably unnecessary. There might be an easy way - if so, I haven't found it yet.

That thread takes accepted connections and registers them with one of several read-selectors. The first reason for multiple read-selectors is that you can have one thread per selector, and get effective MT on reads, without having to f*** about with MT on a single selector. The second reason I have several is to ensure that the response-time per client is close to the average response-time; if you have thousands of clients in one select, the first ones will get great response time, the last ones will get poor times (this is heavily dependent on what you do next, but multiple selectors here avoid problems further down the line, even if you make mistakes later on).

Each read-selector has a single thread (as noted above, I wanted MT on one selector, but it just proved far too difficult). It uses a helper class to decide whether the incoming data constitues a complete request - if not, it just attaches the incomplete request to the SelectionKey, and moves on. If it IS a complete request, it parses it into some form of RequestObject. Note: I've had to do servers in the past, using NIO, where the only way to tell if a request/message was complete was to parse it - this can happen with, for instance, an extensible message-format. Life is MUCH easier if you choose a network protocol which has an explicit "END-OF-MESSAGE" marker, and that guarantees to escape that marker if it ever appears in the body of the message. If you don't have this luxury, I can suggest a slightly different approach which works well, based on what I did before.

Parsed RequestObject's get passed to a "RequestServicer" object, via a method like "addToIncomingQueue(...)". From here on in, you can just use standard thread-pooling techniques to achieve decent concurrency on the number of reads. Theoretically, unless your parse(...) method is particularly expensive, you also make the initial-response to any particular client close to the average response time (as mentioned above).

As the queue-processing-threads act on messages and generate responses, they just attach the response back onto the SelectionKey, and mark the Key as being "interested in OP_WRITE"; the read-selector threads also take care of writing, if a key isWritable().

Let me know if this helps (I'd also be pleased to hear from anyone who can spot a gross inefficiency in this approach! Smiley). This is the first time I've tried this particular take on things - previously, I had one set of selectors dedicated to reads, and a different one dedicated to writes.

Perhaps you need something with more details? For instance, I haven't mentioned any of the hoops you have to jump through with Buffers - but that's somewhat dependent on your higher-level strategy. On this project, I'm using a combination of static pre-allocated direct ByteBuffer's - passing a readOnly view on each (HOPING that the "direct"ness of the BB percolates through properly) - and dynamically generated BB's, combined using a Gathering Write. I've logged an RFE with Sun on the grounds that View's of a BB don't let you automatically lock/synchronize with the underlying position(), making various obvious implementations a PITA. I'm not sure what difference it makes if your gathering write combines only direct BB's, or a mix of direct and non-direct - I would EXPECT it was slower in the latter case? But the API docs don't appear to be explicit on this matter (I may just have missed the key sentence, though I've re-read them several times in the last few weeks!).

I even experimented with doing non-blocking reads, and blocking-writes - on the same Socket(Channel) (this is what one of the Sun examples did!). It works; I was very dubious that it would, and suspicious even after I'd tried it myself Smiley. But I find it quite disturbing, and I'm steering clear of doing it again Smiley.

malloc will be first against the wall when the revolution comes...
Offline bt_dan

Senior Newbie





« Reply #6 - Posted 2003-05-13 16:30:53 »

I am new to this site looking in particular to more insight on this particular subject.  We are doing a very similar approach of having multiple selectors for read, one per thread.  

Have you done any stress testing to find an efficient number of connections to be handled per read selector?

What in particular about the Byte Buffers is so sensitive?  

I have seen some rather long latencies in the testing that I have done and am in the process of determining if those latencies are on the underlying network connection or somewhere inside the NIO code.

I have noticed rather long garbage collection periods in between the selector.select() methods when running the -gc:verbose command.  These may be a result of how I am managing my ByteBuffers. (??) My approach of handling incoming connections is rather simple.  Each SelectionKey.attachment object reads in the first 4 bytes which is the size of the message, then allocates a ByteBuffer of that size and does non-blocking read until buffer is full, then flips, reads and sends to be processed at which point the ByteBuffer.clear() is called (which doesnt really free up the memory, just resets the indexes) and does process all over again.

Has anyone done any load testing with simple NIO example?  What kinds of results are seen?

Thanks for any insight.
Offline leknor

Junior Devvie




ROCK!!!


« Reply #7 - Posted 2003-05-14 03:05:24 »

Quote
I'm looking for a simple example.  Right now, my networking sends "sentences" comprised of a Command ID, subject, target, and additional params, like:

05:8A8B:409F:48:45:4B:4B:4F

Which is something that translates to private message from person 'A' to person 'B' : hello

Any thoughts?
I'm of the opinion that unless you are trying to trim bits off your protocol you should include a 'length of this frame' field so that older clients can recover from unknown packet from a newer client. For a little more info see section 2.3. (Blah suggests something similar in paragraphe 5 of his big post above.)
Offline blahblahblahh

JGO Coder


Medals: 1


http://t-machine.org


« Reply #8 - Posted 2003-05-14 05:08:27 »

Quote
(Blah suggests something similar in paragraphe 5 of his big post above.)


Grin Smiley. Very good advice for protocol design. I'd also add that you should include a "conversation ID" and "message ID" in each message (one is constant for all messages in a sequence, back and forth, but changes for each new sequence. The other is one use only.

They are IMMENSELY helpful when debugging. If msg length is REALLY critical then you can take them out for your final version. However, the length of the msg is often not the bottleneck, so you can often get away with leaving them in. If you have debug code that can be switched on in the final version, it can make great use of messages when a user discovers a problem that involves the networking.

malloc will be first against the wall when the revolution comes...
Offline Herkules

Senior Devvie




Friendly fire isn't friendly!


« Reply #9 - Posted 2003-05-14 07:53:54 »

My selection scheme only uses one selector and one thread. I didn't check against a large number of connections. blahblahblahhs description seems to be very feasable for scalebility.

For each connection I maintain one preallocated ByteBuffer to avoid garbage. Data is streamed into it and roughly parsed as long as there is at least one complete message in it. This can be determined by comparing a transfered length with the limit() of the buffer. When reaching the end of the buffer, I reset the buffer and copy the last unparsed message to its beginning and continue reading.
When I have a complete message in place, I split() the buffer and pass it to a system of listeners that actually decode it.

I did stress tests with several clients trying to flood a server with chat messages and the server could handle it properly.

I don't work with integer IDs in any way. My protocol works with 'Identity' objects that can be serialized to/from a wrapped ByteBuffer (NamedIdentity, ByteIdentity, LongIdentity....). But this is more a technical detail. Basically it could be a simple ID.

Each message has to carry at least one Identity object at it's beginning to denote the 'channel' I am sending on. From there on, the protocol is channel specific. For everything in my (high-level) system is identifyable, means has a method getIdentity(), things can be uniquely addressed over the network easily.
But this of course has nothing to do with NIO, but with the architecture of the networking system.


For the case that client and server reside in the same application (common for games), my network system allows to pass messages directly to between them ommiting network stuff completely. This had some tricky synchronisation issues, but works meanwhile...

HARDCODE    --     DRTS/FlyingGuns/JPilot/JXInput  --    skype me: joerg.plewe
Games published by our own members! Check 'em out!
Legends of Yore - The Casual Retro Roguelike
Offline leknor

Junior Devvie




ROCK!!!


« Reply #10 - Posted 2003-05-14 11:46:54 »

Quote
blahblahblahhs description seems to be very feasable for scalebility.
It has one weakness that I see that could manifest itself on more than one CPU servers.

Since blah has a Thread for a bunch of connections and each of those threads works in a serial manner it would be possible for one thread to get backed up while the others are empty, wasting spare CPU cycles. This could lead to some users experiencing slower response times than needed.

The commonly suggested pattern of one Thread working the Selector and worker threads has one big negative which blah advoids and is why I choose not to mention it earlier. In the case of one Thread works the Selector and hands off buffers to worker Threads, you run the risk of race issues. The race condition being that a worker starts processing one ByteBuffer for a client and another comes in and a different worker starts working on a new ByteBuffer for the same client. If the contents of the second buffer depend on info of the first, which is likely in a stream protocol, then things fall apart. The solution is for the worker Thread to synchronize on an unique to each client Object until that worker is at least done processing the buffer.

So which was is better? On a single CPU server, blah's way will probably perform better becuase of the lack of synchronization blocks. But, as you add CPUs you are more and more likely to waste CPU cycles when there is work to be done. Blah can add more threads to increase the number of subdivisions but he'll get diminishing returns. I'm gonna guess he needs number of CPUs times 2 to be reasonably sure all CPUs are mostly busy. With synchronization you can get away with number of worker Threads equal to number of CPUs.

Now the question is, when does context switches between Threads become more expensive than synchronization overhead? I don't know the answer. People have thought long and hard on how to optimize switching between threads. Synchronization has been optimized quite a bit too and is generally a simpler task as I understand it. I think switching between threads is gonna blow out your CPU cache more often, but I'm not convinced you can optimize CPU cache use in Java because it's so high level.  I definatly think blowing out the CPU cache is worse than synchronization overhead.

Well, now that I've followed the thought process all the way to here I'm not so sure blah's method is really all that even for one CPU servers, but I'm working on a lot of speculation. Anyone have hard data to prove me wrong or a strong counter argument?

EDIT: small changes to try to not claim on way is better than another on a superficial level.
Offline blahblahblahh

JGO Coder


Medals: 1


http://t-machine.org


« Reply #11 - Posted 2003-05-14 18:35:44 »

Quote

It has one weakness that I see that could manifest itself on more than one CPU servers.


Chuckle. Sounds like you're in the same position I was until fairly recently - too knowledgable of how bad threading used to be on certain OS's.  When hardware and OS people now say "don't use too many threads" they seem to mean "don't go above a few hundred".

FYI the design aims were:
- simple to understand, explain, and reuse
- great for single-CPU (assuming you tweak the size of each threadpool etc)
- even better for multi-CPU (assuming you tweak the size of each threadpool etc)

Quote

Since blah has a Thread for a bunch of connections and each of those threads works in a serial manner it would be possible for one thread to get backed up while the others are empty, wasting spare CPU cycles. This could lead to some users experiencing slower response times than needed.


That is a problem it was specifically designed to overcome; the only serial bottleneck's are at the connect and parse stages. I did specifically make the comment about it being OK unless your parse was slow...


Quote

The commonly suggested pattern of one Thread working the Selector and worker threads has one big negative which blah advoids and is why I choose not to mention it earlier.


Thanks for pointing that out; I'm so used to avoiding complex sync issues that I didn't think to mention it, but it's a very important issue. My reasons for avoiding sync, in THIS case:

- IN THIS CASE it is not trivially obvious where and why the synchronisation is necessary (to a maintainer), introducing some serious risks of accidental corruption by a different coder (or myself, if I'm away from this project for long enough to forget!)

- synchronization usually has a considerable overhead, especially so on highly-parallelizable code. Not only do you prevent lots of useful compiler optimizations, you also cause significant problems if you have to transfer to a machine with more CPUs.

- thread switching on modern systems is exceptionally fast, (see below)

malloc will be first against the wall when the revolution comes...
Offline blahblahblahh

JGO Coder


Medals: 1


http://t-machine.org


« Reply #12 - Posted 2003-05-14 18:47:55 »

Quote


So which was is better? ... With synchronization you can get away with number of worker Threads equal to number of CPUs.


Nah; you need to update yourself on the threading capabilities of modern systems. Note:
- modern consumer CPU's have excellent built-in support for thread-switching (IIRC IA64 is really good, and even IA32 is very good)
- modern consumer OS's have dispensed with all the low-end "don't use lots of threads or performance suffers" problems. Specialist OS's are currently coping with THOUSANDS of threads
  o Win2k and above have all been given decent threading (although I can't confirm/deny how well they cope with extremely large-process-sizes [multi-gigabyte memory per-process])
  o Linux is now pretty good at MT (although I've not yet tested this myself)
  o Solaris is still the king of MT performance, apparently.

Assuming I'm handling the IO properly, this should work extremely well with any number of CPU's. Intended for 1 CPU up to 32 CPUs - so I decided to go for user-tweakable numbers of threads etc.

Given the hardware support for multi-threading, you would be wasting performance (well, maybe not, but certainly wasting thousands of transistors Wink ) if you only had one thread per CPU. I would not recommend it (unless you're running Windows 95 Wink).

Quote

Now the question is, when does context switches between Threads become more expensive than synchronization overhead? I don't know the answer.


In the nicest possible way, whilst you clearly know what you're talking about on the specifics, you're barking up the wrong trees with your analysis. Also, I don't think you really understand the specific issues of networking performance.

If analysing the performance of heavily loaded servers were anything like simple enough that you could focus on just two hardware issues (in this case thread context switches and CPU caches), then a lot of people would be very very happy and much less stressed. A lot of clever people would also be out of a job.

There's little to be gained in looking at the hardware for this, because in practice the performance is highly sensitive to everything, and the moment you eliminate a software bottleneck, it's replaced by a hardware one, or (more frequently) an OS one. And vice versa.

Threading and caches:

It's common nowadays (has been for some years) to have very many (e.g. > 100) registers in your CPU, and be able to dedicate registers to threads. Context switch cost is minimal - although there are many who wish that all 128 (or even 512 in some cases IIRC) registers were available to the application programmer!

Specifics of networking performance:

Assuming it's worth looking at the hardware aspect, your talk about CPU caches still surprises me. It's been a a while since I looked at network hardware architectures, so forgive me if I'm unaware of today's common practice... when I knew about this stuff, networking IO that went via a CPU cache was considered slow.

I thought this was the MAIN POINT of using NIO Buffers in networking - you get the system to do low-level IO routines (especially copy's), e.g. bypassing the CPU? It has been known for a long time that if you're trying to get data into memory, or from one type of memory to another (network cards are just a "special" type of memory, as are hard disks etc), then having a CPU in the way can only make things slower.

Unless I'm greatly mistaken, cpu-cache issues have no impact on the performance of the system I described. But I'd love to be corrected if I'm wrong... (for obvious reasons to do with improving the performance of my systems!)

malloc will be first against the wall when the revolution comes...
Offline jbanes

JGO Coder


Projects: 1


"Java Games? Incredible! Mr. Incredible, that is!"


« Reply #13 - Posted 2003-05-14 19:02:50 »

Geez, blah. Where do you get all the time to type this stuff? Anyway:

Quote
 - modern consumer CPU's have excellent built-in support for thread-switching (IIRC IA64 is really good, and even IA32 is very good)


Actually, this is true for just about every CPU except the IA32. If you read Intel's docs, the CPU has support for hardware context switching, but it's so poor that they tell you to do your own context switching in software. Nice design, huh?

Quote
- modern consumer OS's have dispensed with all the low-end "don't use lots of threads or performance suffers" problems. Specialist OS's are currently coping with THOUSANDS of threads


True enough. Most OSes still can't handle more than 500-1000 threads tho. You could easily service this many using poll/selects spread across only a few threads.

Quote
  o Win2k and above have all been given decent threading (although I can't confirm/deny how well they cope with extremely large-process-sizes [multi-gigabyte memory per-process])


W2K threading isn't bad. Unfortunately, it's dragged down by the rest of the system which insists on paging memory whenever possible. I love going to a nice Unix machine and never even touching the swap file. At least W2K/XP don't swap when you minimize. Development under NT was *real* interesting.  :-/

Quote
  o Linux is now pretty good at MT (although I've not yet tested this myself)


I'll punt on this one. Don't want to anger the Linux crowd...

Quote
  o Solaris is still the king of MT performance, apparently.  


Right on the money. You can throw threads at Solaris like they're candy. The system just chugs along with no noticable performance drop. I don't know who the guy was that came up with the Solaris thread design, but if I ever meet him, I'd like to buy him dinner. Now if Sun could just come up with consumer Sparc hardware, we could migrate the entire world to Solaris+KDE3! Well, at least we could get the set-top boxes (big market for SPARCs).   Grin

Java Game Console Project
Last Journal Entry: 12/17/04
Offline leknor

Junior Devvie




ROCK!!!


« Reply #14 - Posted 2003-05-14 21:33:38 »

Quote
- even better for multi-CPU
As you increase the number of   CPUs the performance increase with your method will dimunish at a faster rate than the "common" NIO algorithm. How much so, and is it an siginificant amount? I do not know.

Quote
- synchronization usually has a considerable overhead, especially so on highly-parallelizable code. Not only do you prevent lots of useful compiler optimizations, you also cause significant problems if you have to transfer to a machine with more CPUs.
This  article claims otherwise and when you consider that the amount of work done in micro-benchmarks, it is insanely trivial compared to how much work you'd normally do peeling apart a frame the cost of entry and exit of a syncronization block isn't considerable; it's barely measureable. Combine that with the added flexiability to possibly choose an algorithm that more than makes up the difference and syncronization doesn't look so bad.

Maybe we can get Jeff to weigh in on this issue. Smiley

Quote
If analysing the performance of heavily loaded servers were anything like simple enough that you could focus on just two hardware issues (in this case thread context switches and CPU caches), then a lot of people would be very very happy and much less stressed. A lot of clever people would also be out of a job.
I don't claim that there are only two performance points. It's just two points that as a programmer you have the ability to influence the programs performance by adjusting it's interaction with the CPU with respect to the very small chunk of code we are talking about. I don't make blanket statements about everything in one fell swoop. Blanket statements can almost always be found false1. Please don't continue to think I'm talking about more than I actually am. What I did question, and this supports your argument, is if it's even possible to influence CPU cache interaction at the high level that Java is.

Quote
There's little to be gained in looking at the hardware for this, because in practice the performance is highly sensitive to everything, and the moment you eliminate a software bottleneck, it's replaced by a hardware one, or (more frequently) an OS one. And vice versa.
Bottlenecks are everywhere, no disagreement here. But when discussing a peice of code, hardware that the code doesn't interact with or the software cannot control is beyond the scope of this dicussion.


Quote
It's common nowadays (has been for some years) to have very many (e.g. > 100) registers in your CPU, and be able to dedicate registers to threads. Context switch cost is minimal
The actual switch is minimal. There is a side effect that when you switch context you are at a different point of execution and the relevent set of memory for each thread will be different. Accessing a different set of memory makes it more likely for a cache miss, which is expensive.

Quote
when I knew about this stuff, networking IO that went via a CPU cache was considered slow.

I thought this was the MAIN POINT of using NIO Buffers in networking - you get the system to do low-level IO routines (especially copy's), e.g. bypassing the CPU?
The cost of the IO is the same for both you're algorithm and the "classic" one; they both use NIO in basicly the same way. That said, I hope NIO gets the data from the NIC to RAM without involving the CPU.

Either way, the CPU gets involved once you read or manipulate that data. I see little utility in reading data from a network into RAM and then discarding it. Unless your server just forwards data around like a switch then any intersting work will require passing that data though the CPU. Also, since you cannot do pointer math in Java you have to copy data from the ByteBuffer to to variables to do anything intersting.

Quote
Unless I'm greatly mistaken, cpu-cache issues have no impact on the performance of the system I described. But I'd love to be corrected if I'm wrong... (for obvious reasons to do with improving the performance of my systems!)
A cache miss for the CPU is usually an order of magintude slower than a cache hit. Add that up many times and it's not trivial. synchronization overhead tends to not be as many times as expensive. Did you ever have to study Big "O" notation for a programming class? If so, do the math and see where it takes you.


In the end your algorithm may be better than the classic NIO one. I don't really know with out doing a lot of testing. You don't know either. I've just laid out my reasons that I belive you may be wrong.

All things being equal, your algorithm is easier for programmers to not screw up with. That's probably worth much more than some tiny fraction of a percent performance difference.

1: I didn't mean that as a joke but it's kinda funny when you think of it as a blanket statement.
Offline swpalmer

JGO Coder


Exp: 12 years


Where's the Kaboom?


« Reply #15 - Posted 2003-05-15 02:49:23 »

Clue me in if I'm way out to lunch here, but don't Java synchronization details require a cache flush in some cases?  Particularily on multi-CPU systems?

This may be something that has long since been dealt with, but I had heard that the Java memory model was a bit too strict in this area and that cache flushing was needed too often to comply with it.  That being said I also heard that even Sun's VM doesn't comply with the memory model 100% because it is just too darn limiting in this area.

Anyone have up to date details about this stuff?

Offline leknor

Junior Devvie




ROCK!!!


« Reply #16 - Posted 2003-05-15 05:53:42 »

Quote
Clue me in if I'm way out to lunch here, but don't Java synchronization details require a cache flush in some cases?  Particularily on multi-CPU systems?
I may be mixing my understanding of x86 cache with the IBM PowerPC cache but I believe that the cache is managed in 4K chunks. So when you aquire a sync lock, even though it may only be updating a few bytes in RAM a while 4K will be pushed across the bus. Yea, that is wasteful but it's far from being the whole CPU cache.

Quote
I had heard that the Java memory model was a bit too strict in this area and that cache flushing was needed too often to comply with it.
My understanding of the problem is that in a non-syncronized code block the Java Language Specification (JLS) allows a reference to an Object to be established before the Object has been initilzed. In a threaded enviroment it would be possile for a read of a String object to return "foo" or whatever happened to be already in memory one moment and then once the String is done initilzing read "bar" the next moment. This could be bad if the STring represents a file name to be deleted and the first read was a security check.

Quote
Anyone have up to date details about this stuff?
I think this is the best source: http://www.cs.umd.edu/~pugh/java/memoryModel/
Offline blahblahblahh

JGO Coder


Medals: 1


http://t-machine.org


« Reply #17 - Posted 2003-05-15 08:47:19 »

Quote

This  article claims otherwise.

Maybe we can get Jeff to weigh in on this issue. Smiley


Thanks for the article; interesting that people still believe sync was 50 times (!) slower than non-sync - I only worry about it being 1.5 - 2 times slower, up to perhaps 3 times slower in worst-case-scenarios. I was going by the theoretical side-effects of synchronization, simply because the actual performance of it seems to keep changing from release to release - and I can't keep up Smiley.

I know of platform-dependent issues w.r.t. synchronization which muddy the waters even more. Depending on the OS (and the JVM vendor!) certain approaches to implementing sync can cause lots of accidental performance problems, because they interfere with the OS's behaviour and/or assumptions.

Basically, until Sun disseminates a lot more information on sync implementations - "Facts and Figures: no benchmarks" - I don't trust it in a highly-parallelized environment. The issues I mentioned originally are theoretical, and hence unavoidable. It could be that the latest hotspot compilers use new compiler optimizations that give sync'd code practically the same performance as non-sync'd code - but I do know that you would need NEW optimizations; there are plenty of standard ones that get screwed by sync's.

Quote

I don't claim that there are only two performance points. It's just two points that as a programmer you have the ability to influence the programs performance by adjusting it's interaction with the CPU with respect to the very small chunk of code we are talking about. I don't make blanket statements about everything in one fell swoop.


Sorry if my response came across badly... I didn't fully explain. The key point here is to use the classic approach to optimization - look at each possible optimization, work out the saving on each, and then apply a weighting to each saving, where the weighting is proportional to how often that saving will actually be realised in practice. AFAIAA, the optimizations you were talking about are rarely the bottleneck in a networking server, and so the optimizations there become weighted to almost nothing.

However, it varies so much from system to system that they could well be the problem on some systems - hence you have to look at each system separately, work out the amount of bottlenecking at each point in the system and THEN start looking at optimizations by area. It's a little like optimizing an app, except that it's much much harder to profile, and much much harder to predict bottlenecks Sad.

Quote

What I did question, and this supports your argument, is if it's even possible to influence CPU cache interaction at the high level that Java is.


Smiley I've been hoping that NIO might herald a new direction for Java - towards more voluntary use of low-level things. In particular, perhaps one day we might see cache-hinting? Although, realistically, I suspect that Hotspot etc are taking the execution stage so far away from the raw byte-code (i.e. it's very very hard to predict what your compiled byte code will look like by the time it gets executed) that it would be hard to achieve.

PS Wonder if the "register" keyword will ever get a meaning? Smiley ...and all the other unused reserved words...

Quote

Bottlenecks are everywhere, no disagreement here. But when discussing a peice of code, hardware that the code doesn't interact with or the software cannot control is beyond the scope of this dicussion.


Yes, but...Not if the hardware has a bottleneck so huge that it completely dwarfs any savings you may make. E.g. if there is hardware you cannot control that is soaking up 95% of the performance at some point, even if you make the 5% 100 times faster, you only improve performance by approx 5%.

(just to be clear, I'm not claiming 95% is typical, it's just an extreme example. However, large percentages are certainly possible)

Quote

The actual switch is minimal. There is a side effect that when you switch context you are at a different point of execution and the relevent set of memory for each thread will be different. Accessing a different set of memory makes it more likely for a cache miss, which is expensive.


True. AFAIAA this same problem is the main reason why Xeon chips have so much RAM - oops! I mean cache Smiley. Because the bus is shared, the chips in an MP system have to pre-cache as much RAM as possible so that they can survive until their next chance at the CPU-RAM bus without getting blocked.

However, modern L2 caches generally are large enough to satisfy quite a lot of threads - this is actually a really good reason for having a large L2.

Quote

Either way, the CPU gets involved once you read or manipulate that data. I see little utility in reading data from a network into RAM and then discarding it.


Yeah, but when you're writing a server app you don't usually care about what happens once the data is in memory. Your network card is so incredibly slow that you want to do everything you can to avoid waiting for it. I'm in a situation where my worker-threads sometimes do a lot of work for a single request, so I do actually have to be careful about CPU performance, and attempt to increase parallelism...but the cost of a context switch (including associated knock-on-effects) is vanishingly small compared to what else I'm doing in the worker thread...

malloc will be first against the wall when the revolution comes...
Offline blahblahblahh

JGO Coder


Medals: 1


http://t-machine.org


« Reply #18 - Posted 2003-05-15 08:47:39 »

Quote

In the end your algorithm may be better than the classic NIO one. I don't really know with out doing a lot of testing. You don't know either. I've just laid out my reasons that I belive you may be wrong.


If I'd found a classic NIO algo, I would have used it. All I've seen so far are tutorials that quote the same tutorial-code, and Sun's examples which are crud, hardly using NIO at all. Well. They call the API's. But they don't take real advantage of the new features. I find this VERY frustrtating.

So, this algo is an attempt to bring some of the wisdom of high-performance unix server designers to java; I've had to modify classic approaches a little to fit the fact that e.g. java's select() is not quite identical. I'm pleased at the chance to open it up for criticism on JGO because, as Herkules remarked, it's rather hard to be sure you're really using NIO well - there are few experts on the java impl around at the moment!

I'm writing a chapter for a game-development book on high-performance servers at the moment, and would like to provide an example using java NIO; but only if I can get a really good one together (there's plenty of poor examples around already, I don't need to add to the pile Wink). For this reason, although my immediate concern is a good algo for the project I'm using it on, I also have a long-term interest of making it REALLY good. I'll also back-port any corrections to the other project where I'm using a similar approach.

Quote

All things being equal, your algorithm is easier for programmers to not screw up with.


It's very nice to hear that Smiley. It's one of the critical requirements of this project Smiley. And although the book will be read by lots of hard-core game programmers, it will also be read by relative newbies, so if I want to reuse any code, it'll have to be non error-prone Smiley.

malloc will be first against the wall when the revolution comes...
Offline leknor

Junior Devvie




ROCK!!!


« Reply #19 - Posted 2003-05-15 19:36:51 »

Quote
I'm writing a chapter for a game-development book on high-performance servers at the moment
Best of luck with that.

Below is a half rant that could be misinformed but I belive it's more informed than most rants about synchronization. It's also a chance for you to poke holes in what I think would be optimal non blocking NIO processing.

Synchronization is part of a method signature or implemented with the monitorenter and monitorexit byte codes. Anyway what is not free is the transition across a synchronization boundary. Once you're inside or outside there is no performance difference.

People sometimes assert that synchronization prevents the JVM from doing optimizations and that is true when the optimization would have been across a synchronization boundary. For example a JavaBean getter method is often just a return of a private field and is commonly inlined. Making a getter synchronized is relativity expensive because you cross a boundary, read one value, cross back across that boundary. That boundary complicated optimization. If an optimization would happen entirely inside a synchronization boundary then the optimization is no differnent than any other. (This sorta conflicts with the advice of keeping syncronization blocks as small as possible. I think that advice stems from the fact it is often very hard to determine how long something will block because of contention.)

In what I consider more intelligent use of synchronization with respect to NIO is to synchronize on the Object attached to a SelectionKey. In the thread that watches the Selector have something like:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
 // This method is only called by the Thread that works the Selector

public void processSelectionKey(SelectionKey key) {

  ConsumerThread pooledThread = // from the worker pool
  Channel channel = key.channel();
  ChannelConsumer consumer = (ChannelConsumer)key.attachment();

  pooledThread.consume(consumer, channel);
}

Then hand off to the worker ConsumerThreads and immeaditly synchronize on the consumer and start running it:
1  
2  
3  
4  
5  
6  
7  
8  
9  
10  
11  
12  
13  
14  
15  
16  
17  
18  
19  
20  
21  
22  
23  
24  
25  
26  
27  
28  
29  
30  
31  
32  
33  
34  
35  
36  
37  
class ConsumerThread extends Thread {

public synchronize void consume(ChannelConsumer consumer, Channel channel) {
 if (this.consumer != null || this,channel != null) wait();

 this.consumer = consumer;
 this.channel = channel;

 notify(); // Wake this thread up
}

public synchronize run() {
 Channel channel;  // A NIO Channel
 ChannelConsumer consumer; // Something specific to your project that reads from a Channel

 while (!done) {
  synchronize (this) {
   while (this.consumer == null || this.channel == null) wait();

   consumer = this.consumer;
   this.consumer = null;

   channel = this.channel;
   this.channel = null;

   notify(); // Wake the consume method up if needed

  } // synchronize (this)

  synchronize (consumer) {
   consumer.consume(channel);
  } // synchronize (consumer)

  // Ok, we're done, release this ConsumerThread back to the thread pool.
 } // while (!done)
}
}


So what do we have? Three points of synchronization. The first synchronization points could block the Thread that works the Selector but if you look at the second synchronization point you'll see that the window for contention is very small. The third point of synchronization is to prevent second entry of the consumer at the same time. Once you get past the third synchronization point the code flow is the same as blah's design.

So if my assertion that having the number of threads be closer to the number of CPUs is best for performance holds up then compared to the increased work done will out weigh the synchronization points.

Also, I'm still convinced that Blah's Selector per thread model will create more latency processing the queue of ready Channels because if it happens that a bunch of activity is all being worked by the same Thread then others CPUs will waste cycles. The model above automatically balances the work load so it gets to each WorkerThead evenly.

Blah could detect when a worker thread gets backed up somehow and then unregister a Chanel from one Selector and put it to a non-busy one but I think that would be hard to do without creating a window for notification of new data to get lost.

Now, it is still possible for there to be a big block from contention on my thrid synchronization point. This is less likely with TCP becuase reads and writes tend to be coalesced, since it's a stream. In other words the probably if two updates on the same connection arriving back to back is slim.

UDP is different, read/writes are atomic and you are more likely to have back to back reads on the same connection. My first guess is to reccomend haveing the number of worker threads equal to number of CPUs plus one. That guess is based on the advice to parallize gnu make to  number of CPUs plus one when compiling large C/C++ programs because it does better at keeping all CPUs busy while the newest compile process is waiting for disk I/O to complete.

EDIT: Made the code samples less ambigous.
Offline bt_dan

Senior Newbie





« Reply #20 - Posted 2003-05-15 20:57:05 »

Which select method have you guys used? select() vs. select(long) vs. selectNow()?  Have you noticed any implementation issues between them?  From my testing I have found that select() often blocks too long and so switched to doing selectNow() followed by a thread.sleep() to rest a bit and saw significant connection throughput.  Anybody else seen something familiar?
Offline blahblahblahh

JGO Coder


Medals: 1


http://t-machine.org


« Reply #21 - Posted 2003-05-15 21:21:57 »

Quote

Anyway what is not free is the transition across a synchronization boundary. Once you're inside or outside there is no performance difference.
...
People sometimes assert that synchronization prevents the JVM from doing optimizations and that is true when the optimization would have been across a synchronization boundary.
...
If an optimization would happen entirely inside a synchronization boundary then the optimization is no differnent than any other.


Ah. You are not quite correct in thinking that there is no performance alteration once you are inside a sync'd block.

Once a thread is inside a sync block, various things happen. One of these is that it now cannot be pre-empted by any other thread that syncs on the same key. In a multi-threaded environment, that can potentially be the kiss of death - one way of looking at it is that you are in fact moving from a windows NT/2k/XP scheduler to a windows-3.1 scheduler (for those threads only). For the threads that sync-conflict, you deliberately disable the advantages of intelligent scheduling, and force your current thread to hog the system until it completes it's block. This is a really really good reason to use sync'ing sparingly!

Thus, especially in a highly MT environment, there is a performance loss IN SOME WAY proportional to the length of time you spend inside sync blocks. The proportion is governed mainly by how often other threads WOULD have been scheduled to have CPU time instead, but other factors can come into play; if the sync implementation is good, I suspect that external factors should have  little effect, but I'm afraid I really don't know - back to that point of not being able to keep up with the state of the art in current VM's!.

The best thing to do to avoid this problem is have synchronisation on as many different keys as possible - thereby reducing the frequency with which you get contention to use any particular sync-block, which hamstrings the scheduler. You should also spend as little time as possible inside sync'd blocks.

Depending upon what you actually DO inside the sync'd block, you could also temporarily screw your whole OS performance - if you did some blocking IO, for instance, which can easily tie up the CPU for many thousands of operations, then you rapidly waste CPU cycles. This particular situation CAN affect all threads in the system. In a perfect world, non-java threads would NOT be affected. However, if a JVM uses the underlying OS's sync primitives - or the hardware sync primitives - then it is feasible.

One of the things that recent JVM's have been doing to improve sync performance MIGHT be that they implement their own sync, instead of using the OS, but still using the hardware primitives. Shrug. Maybe.

Some more info on scheduling of threads (pretty simplistic, and sadly not authoritative on the details - prefers to leave them out, probably wisely!):
http://www.javaworld.com/javaworld/jw-07-2002/jw-0703-java101.html?

Whilst trying to find some facts and articles, I found an interesting quote by ond of the inventors of the Monitor for concurrent programming (which java's sync is similar to, but not identical):

1  
"It is astounding to me that Java's insecure parallelism is taken seriously by the programming language community a quarter of a century after the invention of monitors and Concurrent Pascal. It has no merit".

(found in Sun's "High performance Java" book, chapter 4).

malloc will be first against the wall when the revolution comes...
Offline blahblahblahh

JGO Coder


Medals: 1


http://t-machine.org


« Reply #22 - Posted 2003-05-15 21:33:51 »

Quote

UDP is different, read/writes are atomic and you are more likely to have back to back reads on the same connection.


When you say UDP does "atomic" read/writes, what do you mean? Do you mane you cannot do non-blocking UDP with nio? A while ago I asked Sun whether you could or not, and got no reply; the API docs don't say either way.

PS I'm hoping to have spare time (but not get any for a while Sad ) to look at your sync-based code carefully, but I have great problems with reading java code that uses wait() and notify() - they are appallingly named methods, from a code-reading perspective. I'll need plenty of time and coffee to make sure I've read it properly Smiley...

malloc will be first against the wall when the revolution comes...
Offline leknor

Junior Devvie




ROCK!!!


« Reply #23 - Posted 2003-05-16 01:42:34 »

Quote
Ah. You are not quite correct in thinking that there is no performance alteration once you are inside a sync'd block.

Once a thread is inside a sync block, various things happen. One of these is that it now cannot be pre-empted by any other thread that syncs on the same key. In a multi-threaded environment, that can potentially be the kiss of death - one way of looking at it is that you are in fact moving from a windows NT/2k/XP scheduler to a windows-3.1 scheduler (for those threads only). For the threads that sync-conflict, you deliberately disable the advantages of intelligent scheduling, and force your current thread to hog the system until it completes it's block. This is a really really good reason to use sync'ing sparingly!
Yea, this is called contention. It's when you try to sync on an Object but the lock is already taken. The Thread that failed to aquire a lock goes to sleep until the lock has been freed, which is the desired behavior.  Saying "[it] cannot be pre-empted by any other thread that syncs on the same key" makes it sound like a horrible thing. It's not.  I'm pretty sure it's the same effect as any other language's locking mechanism.

In my code there are three possible place for contention but I think it is rather rare that it will happen.

1: If a ConsumerThread instance has been put back to work before it has made it from the end of the while (!done) loop and back to the while (...) wait() loop you could have some contention. The Thread that works the Selector would have to be in the ConsumerThread.consume(ChannelConsumer, Channel) method after the if (...) wait() line in that method. It's a very small window. So small I've pondered trying to not synchronize it but I know it Not the Right Thing To DoTM and can lead to subtle bugs.

2: The opposite of above. If the ConsumerThread is in between the while (...) wait() line and the end of the synchronize (this) block the Selector worker thread could have to contend for the the lock when it calls ConsumerThread.consume(ChannelConsumer, Channel). Again, a very small window.

3: There could be contention at synchronize (consumer) in the ConsumerThread.run() method if the time it takes to process consumer.consume(channel) is longer than it takes for the next set of data to come in on the same channel and get distpatched to a different ConsumerThread. While this could be likely it gets less likely as you get more clients connected to your server and the faster you return from the consumer.consume(channel) method.


Quote
When you say UDP does "atomic" read/writes, what do you mean?

I mean each UDP packet results in a unique read. It's what I say in the 3rd post in:
http://www.java-gaming.org/cgi-bin/JGOForums/YaBB.cgi?board=Networking;action=display;num=1045163896

This is different than TCP which is a stream of bytes and it just happens we tend to interact with them in byte[] chunks at a time. Since TCP is a stream NIO (and your operating system) is free to optimize the transfer by combining smaller chunks into larger chunks to reduce the number of seperate packets or seperate reads needed to move the data around.

I don't know if the above happens in NIO, as far as I know it isn't specified and would be an implementation detail. One possible example is a bunch of channels have data ready to be read. Before your code gets around to reading the data for a channel, more data comes in. Does NIO append the new data to the end of the existing to-be-read data? I think it would be more efficent if it did because then we wouln't have to go though the whole selection process again to get data that is already local.

BTW: that Java NIO book has had the anser to all NIO questions I've had soo far. While the author's writing style isn't to my tastes I cannot complain about the quality of the content.

Quote
Do you mane you cannot do non-blocking UDP with nio?
I'm currently doing non blocking DatagramChannel I/O in a pet project of mine.

Quote
PS I'm hoping to have spare time (but not get any for a while Sad ) to look at your sync-based code carefully, but I have great problems with reading java code that uses wait() and notify()
I'll admit I'm not the strongest programer when it comes to using those methods but but I'm reasonably sure I'm using them right is this rather simple case. I am assuming that there are no external threads interating with the mentioned code fragments. If this were not the case I belive it would be prudent to use notifyAll() instead.


As if this wasn't long enough already here is one more quote from the Java Performance Tuning 2nd Edition book from O'Reilly. In reading the Synchronization Overhead section he says a lot of bad things about the performance characterists of Synchronization. BUT at the end of that section on page 291 he does say:
Quote
The 1.4 server-mode test is the only VM that shows negligible overhead from synchronized methods. [...] On the other hand, I shouldn't underplay the fact that the latest 1.3 and 1.4 VMs do very well in minimizing the synchronization overhead (espically the 1.4 server mode), so much so that synchronization overhead should not be an issue for most applications.
(The emphasis is mine.)

I feel like a broken record but I'll try to sum it all up one more time. If I'm wrong you can come over to my house and beat me up.

Synchronization with out contention is cheap, not free, but cheap enough that flexibility in being able to pick a better algorithm may lead to a net performance gain. Synchronization with contention is expensive because it is effectively serializing what would be an otherwise parallel task(s). Understanding the probable contention characteristics is the only real way to make an educated choice.
Offline blahblahblahh

JGO Coder


Medals: 1


http://t-machine.org


« Reply #24 - Posted 2003-05-17 21:26:29 »

(slight digression, but relevant...) Having finally had time to read the theses and papers regarding SEDA, I seem to have been heading towards the same solution as them (a Staged Event Driven Architecture - although I'd prefer to include the word Message in there, if I were naming it, given it's heritage).

I did actually do a design for precisely such a system, but haven't implemented it yet; my current work (as described in this thread) is a cut-down version; the big question is how many of the benefits of SEDA-like systems my cut-down version achieves Smiley. If the answer turns out to be none/few, I'll probably start using hte SEDA libraries Smiley.

http://www.eecs.harvard.edu/~mdw/proj/seda/index.html

I suggest reading the "SOSP'01 paper on SEDA" (link on that page, near the bottom of the main text) to start off with...

malloc will be first against the wall when the revolution comes...
Offline blahblahblahh

JGO Coder


Medals: 1


http://t-machine.org


« Reply #25 - Posted 2003-05-19 18:15:44 »

Alright, I've used NIO in anger a few times for commercial projects, and each time I've ended up with a slightly different architecture.

So, this time round, I decided I'd start documenting each of the different architectures I've tried, so that next time I can just do a pick-n-mix job. Hopefully I can also build up a library of do's and dont's for each architecture. I'm trying to cover different strategies for using Buffer's, and also strategies for network-server pipelines, and any and all combinations of those two.

Is anyone interested in seeing/sharing this stuff? Some of the responses on this thread suggest to me yes. If so, I'll put up a webpage, and volunteer to collate, edit, re-format and maintain architectures from different people, with comparitive pros, cons, feedback, etc.

I'm only suggesting this because I'm fed up waiting for sun to produce good quality network NIO docs, and can't really afford to wait any longer Smiley. If there's a site out there that already does this, PLEASE let me know !  Cheesy

malloc will be first against the wall when the revolution comes...
Offline blahblahblahh

JGO Coder


Medals: 1


http://t-machine.org


« Reply #26 - Posted 2003-05-19 18:47:44 »

Quote
Which select method have you guys used? select() vs. select(long) vs. selectNow()?  Have you noticed any implementation issues between them?  From my testing I have found that select() often blocks too long and so switched to doing selectNow() followed by a thread.sleep() to rest a bit and saw significant connection throughput.  Anybody else seen something familiar?


Only just got around to looking at this...

I would be really really surprised if select blocks "too long"... I often use it with no problems at all (instant response), but that's because I'm fond of high-performance, and select(long) is always doomed to be a low-performance method for the vast majority of apps (it forces you into a poll-based approach; select() forces you into an event-driven approach - c.f. the SEDA reference I posted for info on advantages of event-driven; its the ED in SEDA Grin).

What you describe, using selectNow combined with a sleep, is obviously a poll Smiley *yeuck*.

If I were to hazard a guess, I'd say you're having a problem I had in a much more extreme version - so extreme, it blocked my first  NIO app indefinitely. This was bad, so I dug further, and noticed that some (note: not ALL! This is why you don't necessarily find it!) of the NIO methods have notes about how they block on synchronized method calls in seemingly unrelated classes (sometimes the notes are pretty vague). Several methods block on not just one but two (or even three, IIRC?) other methods.

I.e., if you call method a() and then call b() or c(), they will block until a() returns. Usually, a is in a different class, and b() has two versions. E.g. b() and b( long blah ). In some places, ONLY the documentation for the one with most arguments actually mentions the blocking behaviour. The other versions all tell you to read that version IIRC, so you are pointed in the right directions...

Look carefully at ALL the jdocs for the following methods, and then look at your source:

- register
- select
- interestOps (yes, really! And there's some pretty shitty hand-waving going on in this one...)
- keys (no comment on the method - you HAVE to read the class comment. This is really bad documenting style Sad)
- selectedKeys (no comment on the method - you HAVE to read the class comment. This is really bad documenting style Sad)
- readyOps
- selector

malloc will be first against the wall when the revolution comes...
Offline GergisKhan

Junior Devvie




"C8 H10 N4 O2"


« Reply #27 - Posted 2003-05-20 01:48:07 »

Yes.  I would be VERY interested in seeing this, blahblahblah.

Having not found the docs I need to be able to produce good networking code, I can't tell you how much that this would assist me.

gK

"Go.  Teach them not to mess with us."
          -- Cao Cao, Dynasty Warriors 3
Offline blahblahblahh

JGO Coder


Medals: 1


http://t-machine.org


« Reply #28 - Posted 2003-05-20 16:13:47 »

Quote
Yes.  I would be VERY interested in seeing this, blahblahblah.

Having not found the docs I need to be able to produce good networking code, I can't tell you how much that this would assist me.


OK, I've put a poor initial version up at:

http://grexengine.com/sections/people/adam

I haven't had time to collate all my notes - or even to document anything more than the most recent approach I've used - but I've checked it over and it doesnt say anything that is actually WRONG at the moment (AFAIAA).

This could be a significant improvement over many of the existing resources on writing network code in java Smiley, many of which are "ill-informed", to be polite Wink.

Actually, I might also include a pruned version of the TCP/UDP debate I waded into last month, because a lot of people seem to be reading outright lies about TCP/UDP - and end up doing foolish things through no fault of their own. If I cut and pasted, and reordered and reformatted (for clarity) the "best bits" of that thread, do you think it would be worth adding too? I have no personal need for a TCP/UDP reference (this stuff hasn't changed from the pre-java days) but perhaps other people might appreciate it?

malloc will be first against the wall when the revolution comes...
Offline GergisKhan

Junior Devvie




"C8 H10 N4 O2"


« Reply #29 - Posted 2003-05-20 16:19:28 »

Just a quick glance through indicates to me that this is the kind of stuff we REALLY need to be able to post on here, as part of the "WEBSITE" and not the "FORUMS."

Yes, I think we need to somehow get together and start seriously thinking about getting up GOOD articles on how to program games, including all aspects such as networking.

I'll respond with more info after I have read your article in detail, and I may fire off an email to Chris about what the scoop is on the website itself.

gK

"Go.  Teach them not to mess with us."
          -- Cao Cao, Dynasty Warriors 3
Pages: [1] 2
  ignore  |  Print  
 
 
You cannot reply to this message, because it is very, very old.

 

Add your game by posting it in the WIP section,
or publish it in Showcase.

The first screenshot will be displayed as a thumbnail.

rwatson462 (35 views)
2014-12-15 09:26:44

Mr.CodeIt (26 views)
2014-12-14 19:50:38

BurntPizza (59 views)
2014-12-09 22:41:13

BurntPizza (94 views)
2014-12-08 04:46:31

JscottyBieshaar (54 views)
2014-12-05 12:39:02

SHC (70 views)
2014-12-03 16:27:13

CopyableCougar4 (72 views)
2014-11-29 21:32:03

toopeicgaming1999 (132 views)
2014-11-26 15:22:04

toopeicgaming1999 (123 views)
2014-11-26 15:20:36

toopeicgaming1999 (34 views)
2014-11-26 15:20:08
Resources for WIP games
by kpars
2014-12-18 10:26:14

Understanding relations between setOrigin, setScale and setPosition in libGdx
by mbabuskov
2014-10-09 22:35:00

Definite guide to supporting multiple device resolutions on Android (2014)
by mbabuskov
2014-10-02 22:36:02

List of Learning Resources
by Longor1996
2014-08-16 10:40:00

List of Learning Resources
by SilverTiger
2014-08-05 19:33:27

Resources for WIP games
by CogWheelz
2014-08-01 16:20:17

Resources for WIP games
by CogWheelz
2014-08-01 16:19:50

List of Learning Resources
by SilverTiger
2014-07-31 16:29:50
java-gaming.org is not responsible for the content posted by its members, including references to external websites, and other references that may or may not have a relation with our primarily gaming and game production oriented community. inquiries and complaints can be sent via email to the info‑account of the company managing the website of java‑gaming.org
Powered by MySQL Powered by PHP Powered by SMF 1.1.18 | SMF © 2013, Simple Machines | Managed by Enhanced Four Valid XHTML 1.0! Valid CSS!