[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [MiNT] State of the Union




I've seen those studies too. They had me convinced enough to go 
implement epoll() for OpenLDAP. Too bad their results were not 
applicable. They only model the behavior of a web server, like Apache. 
They are not general enough to represent all the types of client/server 
interaction one encounters on the internet.

Not all APIs are designed to work how you want them to. 

Don't be stupid. It is possible to design an event mechanism that 
efficiently handles all classes of application load, as I have 
described. Not doing so, and forcing all applications to work only one 
way, is ridiculous.

And yet, so many people have accepted this massive limitation.  There are so many ridiculous and stupid software designers out there.   Its really too bad that everyone else sucks so bad huh?

I agree that the class of problems any MiNT program is likely to see is 
probably different from the problem I was faced with. In that regard, 
I'm willing to let most of this discussion drop.

Good.  We aren't likely to ever agree on equeue()

> Also, under Linux the trip across the user/kernel barrier is nothing.  

Nonsense. I have the benchmarks that prove this statement to be false.

I have benchmarks that prove the moon is made of cheese too!  It's proven!   One of the things that sets Linux apart from most other kernels is that the OS call interface has been extensively optimized and is orders of magnitudes faster than most commerical OSs.  This is about as accepted a statement as you can get when it comes to Linux.

> The problem, as I see it, is that if you notify the OS of what file 
> handles you want, you are indeed taking a slight performance hit for the 
> extra OS call, but you are only going to be doing this once per new file 
> handle.  The most common case where this is critical is in web servers, 
> so this is done when a new socket is opened, and another call is done 
> when its closed.  The OS knows exactly what files its watching the rest 
> of the time.  This information stays static between calls.

Again, it's not up to you to dictate the connection behavior of a system 
/ protocol.

Funny.  I read almost the same thing as I just said above, but in a paper detailing kqueue, and I had never read the paper before in my life!  Odd that me and apparently quite a few other people have such similar ideas, and yet, we're all horribly wrong.

So far, that's exactly what kqueue/epoll do. Now the added feature in 
equeue:
   the event structure is directly accessible from user and kernel 
context, it is not copied by the kernel. The user can set a MASK bit in 

Interesting return value from epoll_wait() :

EFAULT The memory area pointed to by  events  is  not  accessible  with
              write permissions.

See, you have to have write permission to that area because the kernel writes into it.  The event mask is kept in the kernel and modified by API calls, and not in the calls to epoll_wait(), while the events themselves are written directly to the user supplied buffer.   No copies.

the event structure at any time. If an event happens on a descriptor, 
and the kernel checks the event pointer, and sees the MASK bit set, it 
ignores the event. Like I said, this is similar to the sigprocmask concept.

Thus, you can change things in the shared memory region and they will 
take effect directly without requiring a system call.

I have no idea why you would want to temporarily ignore incoming events, but that is about the only thing you will gain on in efficiency.  Creating or removing events is only going to be possible through an OS call.   Secondly, the idea of changing some table that the kernel might look at for any reason whatsoever, sounds iffy at best because I can smell a race condition.  Its not worth it.

Once again, you're assuming that there's only one way to do things, and 
that your way is the right way for all purposes. Anybody who thinks that 
is pretty much always wrong.

Not what I said.  There is a way that the interfaces were designed, and if you attempt to use them differently, they won't scale well, and you will have performance issues like you describe.   I'm pretty sure the problems you're having are best solved in a way other than blaming the OS design.

> I don't want to check if something happened on file descriptor 
> #5.  I NEVER have to iterate through the returned structures looking for 
> descriptor #5.  This is where your logic breaks and why the existing 
> stuff doesn't work for you.  Userland isn't doing any linear searches!

Maybe you don't use prioritized events, but other applications do.

Or there are better ways to do a priority than to ever manually check if a file descriptor has changed.

Again, that introduces more unnecessary overhead. Your approach requires 
a complete roundtrip through the event list to find the priority events 
and place them on the queue, then a second roundtrip through the list to 
handle the normal events. Such an approach will never scale. This is 
exactly what I meant about pushing the linear search overhead into userland.

Roundtrip?  You talk about checking specific file handles, and it would seem from your description that will you check every high priority file handle one after another, and then turn around and tell me that when a small handful of events come in, that copying a single pointer to defer execution of that 1 event is a huge linear search overhead?   Either we really don't understand each other, or you wrote some really crazy code and you are trying to justify the algorithm.

You mention how horribly this method scales with the existing kqueue and epoll methods, and I'm wondering which operating systems your algorithm actually scales well with?

No. I have an array of event structures, each element in the array 
corresponds to a descriptor of interest. My app happens to know that 
descriptor 5 is "special" and it happens to know that descriptor #5 is 
slot #3 in the array of structures. When the event call returns, I can 
immediately check slot #3 and test the flags to see whether an event was 
triggered. Once that's handled, I go through the list of offsets 
handling each remaining event in turn.

Why is #5 special?   How many special descriptors do you have?  You check descriptor 5, slot 3, every time you go through your event loop instead of letting the OS tell you if there was an event?   Your indirect table is just some way of allowing you to prioritize the order of events being returned it would seem, but what you describe is the first step in a  linear search algorithm. 

Set the limits of the algorithm as being unbounded.   Are you checking slot #4 after #3, and then slot #5, and then #6 and then #7?  This algorithm you describe of "just checking slot #3" is what you accuse my method of doing, except my way of handling priorities (should I feel the need for them), doesn't kill the scalability that the OS provides, and can scale reasonably to large numbers of priorities.   I think this is just about a text-book approach especially when I mentioned letting a second thread do the work.  One thread grabs events and fills priority queues for the worker thread, with the threads synchronized via semaphore locked queues whos memory is shared between threads.  This even allows you to use a ring buffer or something and silently drop low priority events if you want.

I'm curious what would happen if you just removed all notion of priority and just handled the events when they came in from the kernel.  I'm betting that the speed-up you get by simplifying this whole thing way down, will actually give your higher priority descriptors as much or more CPU resources than trying to special case them.  Simplify it and see how it performs.  Never know.

No, that would be stupid. You perform one system call when you want to 
wait on events. As part of that system call, you can also indicate 
whether the event table has changed. This is all explained in my equeue 
writeup.

Forgive me for not liking your equeue(), but I just don't.  Notify the kernel that a shared memory region has changed?   Now, either it re-reads the whole table, or you just tell it exactly what changed.  The former is a linear search, and the latter is what we have with kqueue and epoll.
It's OK to be opinionated, but better to have well-founded facts on 
which those opinions are based. It doesn't sound to me like you've ever 
written an application / server that handles thousands of clients and 
hundreds of thousands of operations per second. I have, many times.

Actually I was doing a project that involved deep packet inspection and manipulation at gigabit speeds, operating from an ip-less bridge.  Not really a client/server app though so your right.

slapd is well known for its usefulness, but not its stellar performance.  The performance of slapd is the one thing I see most criticized from what is otherwise a really remarkable program. 

Don't be like Sun and forget about signals. epoll() certainly is an 
improvement for some class of applications on Linux, but without a 
totally integrated kitchen-sink approach like kqueue/equeue you're going 
to go through all this churn and still have an unsolved event management 
problem.

2 sides to every coin.   There is some things that signals are great for, but signals are often somewhat limited as well.  One could very well argue that signals are SUPPOSED to be handled outside the regular flow of a program.   Many people want to dispatch signals from their event loops, and this is quite possible with epoll() or even select() - the signal handler sets a flag and returns, but the system call is interrupted anyway and you see the error code, and dispatch your code to handle what the signal told you.   It works, sure, but why not have a message queue or some file handle for signals if you are only going to look at them in your event loop anyway?  Signals are supposed to work outside the scope of program event loops.  They can interrupt anything and everything (just about).

While I'm sure the idea of integrating signal handling with the event loop is an exciting prospect for some, for others they'll say "what for?"  .  One just handles the same thing at a different point in the code, shifting slightly more or less to userspace with associated trade offs.  I'm not going to say one way is better or worse, but there isn't any loss in not having signals integrated with epoll.

Now, having said all that, and being a big fan of Linux and not liking BSD nearly as much, I will say that I do like the idea of using kqueue for AES events in MiNT, although I'd like to see some more discussion of the details of the implementation plan.   I don't really see your equeue() as a good match for the needs or MiNT or the needs of the AES.