Assumed Audience: Anyone interested in OS design.
Epistemic Status: Confident, but without time.
Introduction
Nearly three months ago, I wrote some ideas for a better operating system. I complained about lack of time.
Well, I was talking about lack of time for programming, not thinking; I have insomnia, so I got plenty of thinking time.
So here’s another post.
There are a lot of misconceptions about microkernels, but I have now provided a link to dispel them.
That said, I want to talk about one that gets hammered: performance.
First of all, the first formally verified kernel has impressive performance.
Second, what if there was a way to eliminate the double kernel round trip?
Hardware Pipes
I have had an idea to do this for a long time, at least 4 years. I call it hardware pipes.
Unfortunately, the problem with my hacky implementation “hardware” pipes is that there is no hardware support. In particular, signalling is expensive.
So what if there was hardware support for one userspace process signalling another? Could we implement a “hardware” pipes with just that? I think we could, and it would make the resulting microkernel very fast.
As long as the receiver was already on a core.
But the problem is that there is no hardware support for cross-process userspace signalling.
Oh, wait; there is!
Using userspace interrupts, which Intel calls “user interprocesser interrupts” or UIPIs, I think it would be entirely possible to have applications and userspace servers communicate without the kernel being involved at all.
Of course, the receiver is currently be on a core or the OS has to schedule it.
Likewise, the OS has to interfere if there is some sort of security boundary that must be maintained. Unfortunately, disk and network access does require security boundaries for permissions.
However, the OS could step in just to check permissions on open(2)
, and if
permissions are good, then it could set up the hardware pipe between the app and
server and get out of the way.
“What if permissions change, Gavin?”
I think it would be best if the OS created a new file with the new permissions, copied over the current data, and atomically replaced the old with the new. Sure the old one would still exist, and processes that have it open could write to it, but they are not writing to the one that had its permissions changed, which avoids the security problem.
This does mean that there needs to be some way for the process to know if the current file is still at the same path it started with, but that’s an existing problem with current operating systems, so it would need to be solved anyway.
Anyway, this means that read(2)
and write(2)
could happen entirely without
the kernel, and apps would only need the kernel for open(2)
and probably
close(2)
.
It would also mean that there is no copying of data between buffers because the server could directly read into, and write from the hardware pipe buffers.
In other words, poor microkernel performance is entirely because hardware is optimized for monolithic kernels. Linus Torvalds was wrong! Especially because that may be changing.
Batching
But we can go even further. What if there were only two syscalls?
“Uh, what two syscalls, Gavin? Don’t you need at least open(2)
, close(2)
,
read(2)
and write(2)
? Even Plan 9 has 30 or more.”
It sure looks like you need at least that many, but that’s false because you can
put almost every syscall under one which I will call syscall_uring
.
“That sounds almost like io_uring
!”
Yep! That’s on purpose.
Imagine if Linus designed Linux today; would he really just add syscalls like he did by default around the time I was born?
I hope not; I hope he has learned a bit about OS design since then. Especially since I’m not the first to think of this.
And if he has, I suspect that he’d just make io_uring
the one single syscall
interface. Because if he did, then every syscall could be batched. And apps
could batch their syscalls and then just wait on results to return when ready.
Oh, wait; apps would need to wait too!
And that’s the second syscall: something like epoll(7)
or kqueue(2)
for
waiting on file descriptors (including syscall_uring
ones). This is because
besides submitting work to do, apps need to be able to wait on results.
So those are the only two syscalls an OS needs: a way to signal the kernel when syscalls are added to a queue, and a way to wait on events.
There may be one or two more for setting up and tearing down a syscall_uring
,
but that’s just details.
“But Gavin! I like my blocking syscalls!”
So do I, but that blocking could be implemented entirely in userspace, like so:
- The
libc
creates asyscall_uring
for each thread. - A thread that wants to block on a syscall submits that syscall to the thread
syscall_uring
. - Then that thread would wait (
epoll(7)
/kqueue(2)
) on the threadsyscall_uring
for the result. - Then the thread would return the result.
This is why those two syscalls are needed: they enable building all others.
Remember what I mentioned power in the last OS post? There was a reason: power enables simpler, yet more flexible, designs. This is a great example because more power enables both asynchronous and blocking syscalls.
And it gets better!
Linux’s io_uring(7)
has a feature that a new OS should adopt: submission
queue polling.
In essence, you can have the kernel constantly poll io_uring
s so that your app
don’t even have to make a syscall to signal the kernel. It can just submit to
the queue and keep working while the kernel detects the submission and takes
care of it without stopping your app.
That is performance, and it’s entirely possible in microkernels!
Conclusion
So yeah, my design for an OS would not be sacrificing performance; not one bit. It wouldn’t have to, despite being a microkernel.
Security and safety don’t always have to hurt performance.
And despite the personality cult around Linus Torvalds, microkernels are better.