Introduction

tl;dr: The way we do computing is fundamentally broken, but I have a vision for how to fix it, and this post contains that vision.

Several months ago, a coworker asked me a question that most programmers would have struggled to answer.

He asked, “Why do race conditions exist?”

That quote may not be exact, but the gist is correct.

I am not most programmers; I did not struggle to answer the question. In fact, I answered immediately, off the cuff.

Because hardware does not provide enough guarantees.

With about 10 seconds thought, I revised my answer.

Because hardware and operating systems do not provide enough guarantees.

And with the benefit of time, I have since further revised my answer.

Because computing platforms do not provide enough guarantees.

Guarantees

But to understand what I mean, you first need to understand what I mean by the word “guarantees.” In this case, it is a plural noun, not a conjugation of a verb.

Mirriam-Webster defines the word “guarantee” like this:

guarantee
an assurance for the fulfillment of a condition.

That definition will do.

So what I mean is this:

Because computing platforms do not give enough assurances for fulfillment of necessary conditions.

You can see why I worded it the way I did originally; it’s more confusing now.

However, the reason it is more confusing is because now we need to know what conditions need to be fulfilled, or are not fulfilled, and that is different for every platform.

Examples

It’s kind of hard to put a finger on what I mean without examples, so let’s see a couple.

TOCTOU Race Conditions

On Unix and derivatives, there is a class of race conditions called Time of Check to Time of Use (TOCTOU) race conditions. They are difficult, but they are actually among the easiest to spot.

One example is using stat() on a file before opening it; the file could be deleted in the small window of time between the stat() call and the fopen() call. That can be solved, but it is an example of a missing guarantee. Here, the guarantee could be:

If you stat() a file then open it, the file will always exist.

Unfortunately, actually providing this guarantee is impossible. The OS does provide another guarantee though:

If a file is deleted while a process has it open, the file and its contents are not removed until the file is closed by all processes that have it open.

This guarantee is what provides the solution I linked to, and it is a good guarantee.

However, that begs the question: can we use this guarantee to make the TOCTOU race condition impossible? The answer is no…unless we change the requirements.

Right now, stat() can be called on an unopened file. What if we completely removed the stat() call and made every program open the file and then use its companion fstat()?

This works because fstat() needs an open file descriptor to work. This means that the file is already open and cannot disappear.

Interrupted Syscalls

System calls, also known as syscalls, are meant to ask the OS for services, and most services are provided through them.

However, there are some services are provided without a program asking for them. One kind of service is called signals, and not only are they provided regardless if a program asked for them, they are provided at any time.

This means that if a signal arrives, Linux will interrupt a program and send the signal.

The reality is far more complex than the above sentence, but the gist is that signals can arrive at any time.

What happens if a program is in the middle of syscall when a signal arrives? According to Linux and other Unix derivatives, the answer is: it depends.

What does it depend on? The syscall, the current state of signal handling in the program, and many other things. Linux, for example, may not restart the system call, even though it is completely capable of doing so.

This is another example of a missing guarantee. If every program could rely on Linux to restart syscalls if necessary, they would not need extensive and brittle error checking to see if a syscall was interrupted, but they do.

Fixes Are Broken

But that’s just one part of the story. Platforms are broken, yes, but the way we try to fix the problem is itself a problem.

Cybersecurity Is Failing

Can you name the last data breach that happened? Equifax? Facebook? No, you can’t. They happen every day. I call this the Cybersecurity Crisis.

And that is all I need to say to prove that computing is broken, but I will go further.

Obviously, cybersecurity researchers and specialists, including some of my relatives, spend their working lives trying to fix the broken pieces of our computerized world. And might I say, their efforts are heroic in many cases.

But they are chasing the wrong thing. They are all focused on finding attack vectors before the bad guys do and then mitigating the effects of the attacks they find, rather than trying to eliminate the possibility of attack.

Mitigations

Let’s look at some of the mitigations.

Dynamic Linking

Dynamic Linking is a complex topic, but suffice it to say that it allows users to download a new version of a library and have the bug fixes in that version of the library apply immediately to all software that uses that library. It also allows the library to only have one copy on the system, not multiple copies all statically linked to every program that needs them.

It’s a good idea in principle, but in practice, it causes problems.

Here are some examples:

Why does it cause problems? There are many reasons, but the biggest is that the linker/loader is yet another piece of software that can have bugs, and not only that, but it is actually a piece of software that the OS invokes inside of another program to start that other program. If the linker gets something even slightly wrong, huge problems, like the ones above, are sure to result.

Address Space Layout Randomization

ASLR is used to prevent attackers from being able to find and use specific pieces of code in an already-existing binary that will do what they want.

And it can be bypassed. In more than one way. In fact, the attacker doesn’t even need the binary!

Control Flow Integrity

CFI is used to make sure that attackers can’t redirect execution of a program.

And it can also be bypassed.

The Vicious Cycle

Basically, we have many mitigations, but they all have problems. This makes sense, since the definition of “mitigation” is:

The process or result of making something less severe, dangerous, painful, harsh, or damaging.

What this means is that the threat is not eliminated; it is merely alleviated.

But it doesn’t stop there.

If any of the software implementing the above mitigations are themselves vulnerable, or depend on vulnerable software, it is as though the mitigations don’t exist. The best example is when the dynamic linker removes protections on constant data, which can then defeat any existing Control Flow Integrity mitigations.

On top of that, the more we find out, the more we find attacks, which means we add yet more software to mitigate the attacks. But the more software you have, the bigger your attack surface is, since all software has bugs.

Thus begins a vicious cycle: attack vectors are found, mitigations introduced, bugs are found in the mitigations (or more attack vectors are found), more mitigations are introduced, and the situation just keeps getting worse.

As Bruce Schneier and Niels Ferguson said,

Security’s worst enemy is complexity.

Broken Chips

What’s even worse is that all of the above doesn’t even consider whether our chips are exploitable. As we have found out in the past couple years, they are. In so many ways.

In other words, speculative execution, the technique that chip makers have used to get more performance out of their chips since the lunch bill came due, have made our computer chips insecure.

My opinion is that speculative execution is like software mitigation: it is fundamentally broken because, almost by definition, it can leak information through side channels.

And that is bad news because on the one hand, if we have it, our computers will never be secure, and if we don’t, we can have a performance hit of up to 40 percent.

How to Fix It

There are ways to fix this problem. They are difficult, but absolutely necessary.

Yes, I am revealing my ideas. I am doing this on purpose for several reasons.

First of all, “Ideas are cheap, execution is everything.” (Chris Sacca)

Second, my true ultimate goal is not to make money; it’s to eliminate the Cybersecurity Crisis.

And third, as an aspiring professional, I have a duty to do what is best for my users, not for myself.

Starting Over

The gist of the solution is this: start over and do it right this time.

Sounds simple enough, but as Chris Sacca said, “execution is everything.” We can’t just start over, we must also do it right.

And it is even more complicated, since we have to start over with hardware and software, not just one or the other.

So, what does doing it right look like?

I don’t know; no one does. But I do have a design for a computing system that, in my opinion, could be the answer.

With that said, let’s think about a design for non-broken computing system.

Designing a Computing System

Before we can design a computing system, we have to figure out what a computing system is.

For this post, I will use the Wikipedia definitions of computing and system at Wikipedia. This means that I will use this definition:

computing system
a computer system designed to manage, process, and communicate information.

In other words, we are designing a system to connect users with information.

Requirements

Beyond figuring out what we are designing, the first step in design is always to identify the requirements.

General-Purpose

The first question we have to answer is this: what kind of computing system are we designing?

For the purposes of this post, I am going to focus on designing a computing system that will fulfill the same purposes as the computing systems now served by Windows desktops, Linux and other Unix derivative desktops, Mac OSX, iOS, and Android.

In other words, this will be a general-purpose computing system; we will ignore custom-designed systems, especially embedded systems.

Smartphones do not count as embedded systems in this post because they are meant to be used as general-purpose computing platforms.

Performant

The next requirement is that the computing system must be performant.

For many years, we have been getting away with waste, and because users have had no other choice. But they will always want to process more data, render bigger scenes, or simulate more physics and weather.

We also have the problem of trying to unseat computing systems that are well-established and familiar.

This means that, in order to outcompete the established players, we just can’t waste computing power unless it is to ensure security. And if we do not need to waste cycles to ensure certain aspects of security, we shouldn’t.

Speaking of security…

Secure

The other requirement, which is obvious if we are trying to solve the Cybersecurity Crisis, is that the computing system must be secure.

How do we do this?

Well, we have to design for security, but that is not all we have to do.

Need for Formal Methods

In every computing platform, there is some subset of the hardware and software that, if compromised, will compromise the entire system. This subset is called the Trusted Computing Base (TCB).

Because any vulnerability in the TCB will compromise the entire system, special care must be taken when writing anything in the TCB. In fact, I would argue that everything in the TCB must be proved correct using formal methods.

For the rest of this post, if I say that something should be considered to be part of the TCB, that means it should be proved correct.

Need for Open Source

Beyond using formal methods on everything in the TCB, there is one more thing we must do: we must open source everything in the TCB.

Why? Because “given enough eyeballs, all bugs are shallow.”

What that quote means is that bugs cannot hide forever in open source software, and that is what we want.

But wait…if we prove the software correct, why do we need to open source the software to find bugs? Haven’t we already proven it’s free of bugs?

No, and there are a few reasons why:

  1. If we don’t open source the software, the proofs of correctness are useless for users; they need the code to check the proofs.
  2. Proofs of correctness only prove what they are meant to prove. If we do not add something to the proof that is important, than there could still be bugs because a proof will not see those bugs; it can only detect bugs that violate what it is trying to prove.
  3. Having more eyeballs on the code will make it possible to spot important things missing from the proofs.

This is important enough that I will repeat it twice:

Proving software correct does not prove that it’s perfect!

Using formal methods and creating proofs for software only proves what the proofs were meant to prove, and the proofs may overlook different kinds of bugs.

In fact, when I was debating on whether to use the terms “proof” and “prove” in this post, because most people might assume that a proof is infallible, an acquaintance of mine on IRC said,

In the end, “proof” is the appropriate term here. It’s just important to keep in mind that you’re not proving that <xy> is secure, but rather that “this formal model of <xy> isn’t vulnerable to these formal models of a few potential attacks we considered” with the obvious limitation that not all of those models might match reality, or that you forgot to account for some classes of possible attacks.

— Luis Ressel (edited for punctuation)

So in the rest of the post, when I say that something is in the TCB, I mean that it should be proved, and its source should be opened.

Software

So what would such a computing system look like? Let’s start with the software.

Requirements

What are the requirements for the software of a computing system?

Obviously, we have to limit what software we are talking about, since a computer can do much more than we care about in this post, so when I say “software” in this post, I mean the software required to use the computing system and to program the computing system.

I include the software needed to program the computing system because that is necessary for the computer to be used to its full potential; if there is no way to program it, then programmers cannot make it do useful things for users.

And since we only care about what software programmers need to make the system useful for users, our users are programmers. So the software we care about is the software used by programmers.

So what software do programmers need?

Going up to the definition of computing system above, when using a computing system, what do programmers do? They manage, process, and communicate information.

What kind of information? Software.

So the software we care about is the software that helps programmers manage, process, and communicate software.

But we are not quite there yet. Software that programmers use has to help them manage something; what things do programmers want their software to help them manage?

What Programmers Manage

Programmers generally have two concerns: time and space.

What is space?

To a programmer, the following is a good definition of space:

space
any limited resource where instances of the resource can be used concurrently with other instances of the same resource.

That seems like a weird definition, but it’s intentionally broad. Before I explain why, I will define time:

time
any resource that cannot be used concurrently with other instances of the resource.

Okay, so those are weird definitions, but there is a reason for them. You see, inside of a computer, there are, obviously, different resources, but even though there are a lot of different resources, there are really only two important ways of dealing with those resources. Those two ways are hinted at in the definitions above.

Thus, when I say “space,” I am really referring to computational resources that need to be managed in a certain way, and when I say “time,” I am referring to computational resources that need to be managed in the other certain way.

The first way of managing resources, the “space” way, is to allocate a subset of the relevant resource to the entity that asked for it and wait for it to be returned; the allocation is said to take up space.

Examples of space-type resources include memory, I/O ports, and file handles.

The second way of managing resources, the “time” way, is to give the entity that asked for the resource exclusive access to the resource for some limited amount of time.

Examples of time-type resources include CPU cycles, GPU cycles, and using peripheral devices.

There are some resources that fall into both camps, but we won’t worry too much about them in this post.

So now we know what programmers care about managing when using a computing system; they care about space and time. And our job is to make a computing system that makes it easy to manage those two things.

But what exactly is it about space and time that programmers care about?

For space, we care about how much memory something takes. We also care about how many resources, like threads and files, we are using, as well as copying data, which takes space, from one space (disk, as an example) to another (memory).

For time, we usually worry about the performance of our code, but that’s about all.

Now that we have a good idea of what programmers want and need from a computing system, we can start thinking about what software they need.

Programming Language

The first piece of software is a programming language and its distribution, including the compiler/interpreter, since that is the number one tool programmers have for managing software.

So let’s think about what is needed in a programming language.

Obviously, for security reasons, all of the languages in use on the computing system will have to be safe languages, though in order to write system software, there needs to be a language in which there is a way to break the language while keeping as many guarantees as possible.

How the language, Yao, will keep as many guarantees as possible even when using its own version of Rust’s unsafe is all dependent on the compiler backend that Yao will use.

Because much of the security of the system will depend on the base programming language keeping its guarantees, everything in the distribution for that language, including the compiler, runtime, and standard library, will be considered to be in the TCB, and thus, must be open source and proved correct.

In addition, the languages will have to be designed such that they make it easy to fall into the Pit of Success. How to do that is well outside the scope of this post, but as an example, this:

if (x = 5) do_stuff();

should be a compiler error, and so should many other similar mistakes.

Another thing about the base language is that it should be extensible, allowing the language to grow with the computing system and allowing it to be used for problems of the future, problems that we haven’t even imagined. In addition, this will allow it to grow into other languages which can then use it and its distribution as a base to work from.

Yet another thing is that the language should be able to be compiled, and the reason is performance. Interpreting the language reduces the performance of the code by 10 times, and we need a performant system; we can’t afford that waste.

Real-Time Control

When it comes to managing space, any programming language worth its salt will allow the programmer to carefully manage how he uses the space and resources at his disposal.

Unfortunately, many modern programming languages are not worth their salt. I am looking at you, Java, Ruby, Python, JavaScript, and PHP.

Some of these tools include:

  1. Being able to specify the memory layout of data structures.
  2. Embedding data inside of data structures instead of using a pointer.
  3. Packing data.
  4. The ability to build intrusive data structures.

And many more.

Of course, the languages we use could always use more tools to manage space, and they will have them if I have anything to say about it.

But languages nowadays have very poor tools for managing time. Right now, the only real tools we have to do so are profilers, and they only help us improve performance.

But, wait, isn’t performance the only thing we care about when it comes to time?

Absolutely not!

You see, I know there is one group of programmers that angrily face-desked when they read what I wrote above. They are the programmers that build embedded systems.

Embedded systems programmers often deal with something called real-time computing, which is a type of computing where the computer has to beat certain deadlines or bad things happen.

Imagine this: there is a computer in the airliner you are a passenger in, and the computer is running the autopilot. If the computer fails to hit a deadline to respond to a signal from a sensor that detects a microburst, the plane could crash, so the computer must hit that deadline.

And often, those deadlines are within milliseconds, far faster than human reaction times.

In fact, there is a programming language that does have tools for real-time computing: HAL/S. And it was the language used by the programmers who wrote the most bug-free program ever. I don’t think that is a coincidence.

So our language must have tools for real-time computing, probably based on the designs in HAL/S.

But wait! Didn’t I say that we are not designing a computing system for embedded systems?

Yes, I did. However, some of the pieces of our computing system can be designed to work everywhere, including embedded systems, without compromising on the design. Those pieces are the pieces that embedded systems need as well, by default, and programming languages are one of those pieces. Another one is the compiler backend because embedded systems still need compilers.

I am also not ignoring embedded systems in this design because programmers for embedded systems need better tools themselves, and the programmer’s toolkit often directly corresponds to their programming language. If a tool is in the language and its distribution, the programmer generally has it. And I want embedded programmers to have as many tools as they can use because I want to fly in planes that don’t crash.

And even if I were to ignore embedded systems completely, the type of computing system we are designing will need real-time capability as well. It’s necessary to run everything from audio (must output the audio signal to the devices on time or the music will break) to video (must output the next frame on time or the video looks terrible).

And even if I were to ignore that, there is still one thing I will not ignore: cryptography.

Most, if not all, of cryptography implementations have to avoid leaking data about the secret key through a side channel, and the number one side channel is time. This means that crypto has to use constant-time implementations when using a secret key in any way. If they don’t, data about the secret key can be figured out based on the amount of time the algorithm took.

It is imperative that cryptographers have the tools they need to ensure that the relevant parts of their code will always run in constant time.

So at least one, if not all, of the programming languages should have the tools to manage time.

Signals

Another thing HAL/S had that other programming languages do not is the direct ability to manage signals.

I am sure that there are scoffers among my readers who are saying, “What do you mean? C and POSIX sure give you ways of handling signals.”

That is not the kind of signal I mean here. I mean anything from POSIX signals to interrupts, to any kind of asynchronous delivery of information. In this case, the word “signals” is a term referring to all of those things.

And no, there is no language out there, besides HAL/S, that gives solid guarantees of what happens when signals are sent or received.

Basically, the language should provide tools for handling how, and when, the program responds to receiving outside signals, as well as how to send such signals. It should also provide semantics for when signals can arrive, to prevent race conditions.

Formal Methods

Because we will need to use Formal Methods to provide proofs of correctness for much of the software and hardware, we need some easy way to do it, preferably as part of the programming language and its distribution.

Therefore, the programming language will be built to make formal analysis easy with dependent types, and its distribution must include a proof checker, which will be considered to be in the TCB, and thus, needs to be open source and proved correct itself.

Eliminating Race Conditions

Except for data races, the language must eliminate race conditions by careful design of the API’s in its standard library. Data races will be taken care of by the compiler backend.

Standard Library

The standard library should include all of the utilities needed to write all software that only needs the basics. That doesn’t sound like much, but nowadays, the basics includes everything from graphics to audio, from cryptography to data structures, from providing an API to OS services to providing a way to extend software by plugins.

And the standard library should be in the TCB, for reasons that will become clear later.

Runtime

The runtime of the language must also be in the TCB, for reasons that will become clear later.

Shell

One thing is abundantly clear about programming languages today: they fail at being shells. Shell languages don’t, but they fail at being languages.

The programming language we need has to have some way of making it easy to use it as a shell.

And so it will: using the $ character as a prefix, the language will then parse until the next semicolon and run the result as a shell command. So this code:

$ ls;

will list the current working directory.

We should be able to assign the result to a variable to get access to the stdout, stderr, and return code:

p := $ ls;

if p.retcode {
	print(p.stderr);
}
else {
	print(p.stdout);
}

Once we have that, all the language needs to be a shell is a REPL that automatically inserts $ characters for the user, which should be easy because current shells can already do that. And if there is a way for the user to take away the automatically inserted $ character, the user basically has a full and well-designed programming language as their shell.

Compiler Backend

I want to talk about the compiler backend because it is the crux of the entire design.

Like the JVM and LLVM, using a common compiler backend reduces the work for everyone, so I want such a compiler backend.

I am considering the JVM a compiler backend because it has a common form that compilers target, Java bytecode, and it translates that form to machine code by way of executing it. And that is what compiler backends do.

Could I just use one of those two?

Well, I can’t use the JVM because it’s too slow. And I cannot use LLVM because it does not provide good enough guarantees. Thus, I need to design my own.

As should be obvious, I have already built up a design for one, called Yvm. You can read more about it at that link.

Semantics of Time

What does it need?

Well, the first thing it needs is a way to manage time; if that cannot be expressed in the compiler, than implementing it in the base language will not guarantee that it will be easy to add to the other languages.

Second, it needs enough guarantees to eliminate data races. How this is done is explained in the Yvm docs, but suffice it to say that the semantics of instructions have been defined to require automatic locking on all pieces of memory during use.

That is just the semantics of the instructions; if a compiler can prove that a piece of memory is not used outside one thread, something that the programming language should make easy, then it can avoid all of that locking.

A great side effect of the design is that it can also eliminate deadlocks.

Semantics of Syscalls

One thing that Yvm must provide is a standard way of specifying the semantics of syscalls.

Why? It is needed for applying Formal Methods.

Semantics of Signals

Yvm will also need careful design for the semantics of signals because it is also imperative that programmers be able to prove their code is correct even with the possibility of external signals.

Formal Methods

For all of my talk about using the language to apply Formal Methods, I left out one key detail: the proofs of correctness should apply to the code generated by the compiler, not the code written by the programmer.

So when proofs are applied to the code written by the programmer, the proofs will need to apply all of the way down the chain until machine code is reached.

The way this will be done is that the proofs and the dependent types they use will also be translated into Yvm, a translation done by code that itself is in the TCB. After that is done, the proofs and dependent types will use Yvm semantics to figure out what the code does and prove that it is correct. And then, using optimization and code gen stages that have been proven to preserve the semantics of Yvm when optimizing it and generating machine code.

Thus, by proving that the code written by the programmer translates correctly into Yvm, then proving that the Yvm code is correct and that it is translated correctly into machine code, we can indirectly prove that the machine code is correct.

To do this, every bit of Yvm semantics must be carefully defined because any undefined behavior would mean that Yvm could not prove the behavior of any software that uses it.

This means that Yvm must carefully define the semantics of time, dependent types, syscalls, and signals, among other things.

And it is also the reason that data races and race conditions must be eliminated; Yvm and its prover could not work if there was anything that could behave non-deterministically.

In fact, if there is one thing to learn from this post, it is that this computing system, and all other computing systems, should remove the possibility of non-deterministic behavior.

Intermediate Representation

For reasons that will become clear later, the compiler backend must have a standard language to represent code as the backend sees it, like LLVM has its own intermediate repretation (IR).

This IR must have the ability to interoperate with (be linked to) C code because C is still the gold standard for interoperability.

The IR also must have a way to access all of the compiler backend’s features and semantics.

And for reasons that will become obvious later, the IR must have the ability to include platform-specific code for multiple platforms at once, and it should also have the ability to store optimization routines for the code.

Operating System

For obvious reasons, the OS of our computing system must be considered to be in the TCB.

But wait! Aren’t operating systems HUGE? How can we possibly prove an entire OS correct?

Operating systems do not have to be huge. There are operating systems that have comparatively tiny codebases; they are called microkernels.

Not only that, but there already is a microkernel that has been proven correct, and it is fast.

And it can be made even faster, but to do that, we need better hardware. Stay tuned until the performance subsection of the hardware section to learn how microkernels can become faster than monolithic kernels despite being slower historically.

And as a bonus: using a microkernel would also solve the Thirty Million Line Problem.

Purpose of the OS

Obviously, an operating system is supposed to manage the resources of a computer against the competing interests of all of the processes running on it, and do it in such a way that the user stays happy.

Okay, so if an OS has to manage resources, what are the resources?

That is easy: time and space. Those are the only two kinds of resources.

Here, input and output fall under the “space” category. There are two reasons for this:

  1. There are only so many I/O ports, which can be considered space.
  2. I/O ports can operate concurrently with each other.

But there are another two purposes for the OS: security and communication. The OS should make sure that no process can affect another negatively, and it also needs to allow processes a way of communicating with the kernel and with each other.

With all of that, let’s start taking a look at what the OS must be able to do.

Virtual Memory

The first resource that an OS must manage is memory. This one is doubly critical since managing memory correctly is required for ensuring the security of processes.

Because of this, the OS should be the one directly managing the allocation and partitioning of memory. In short, it should be the only one allocating and granting permissions to sections of memory.

This also means that the OS should enforce process isolation.

Scheduling

The second resource the OS must manage is CPU time. It should be able to ensure that no program can claim more CPU time than it should, nor should any program be able to take away CPU time from any other process without the permission of the OS.

Message Passing

To enable processes to talk to each other, the OS should also take care of message passing.

This one is especially important in a microkernel because the drivers are not part of the kernel; they are normal programs that the OS grants special permissions to in order to communicate with hardware devices, and in order for the information from devices to make it to programs that want that information, the driver must be able to communicate with other programs.

Speaking of which…

Controlling Access

The OS should handle controlling access to devices. It should decide what drivers will be allowed to run and what devices they can communicate with.

Security

Besides making sure that processes don’t stomp on each other, the OS must be tasked with maintaining the security policies of the system.

This includes:

  1. Ensuring that users only have access to what they should have access to.
  2. Ensuring that programs only have access to what they should have access to.
  3. Not allowing users without credentials to use the system.

And many other things.

Hints from Plan 9

The design of the OS should be as simple as possible, and there is already an OS that showed the way: Plan 9 from Bell Labs.

Everything Is a File

In Plan 9, everything is a file. This makes the interface to everything very simple.

In our OS, it will work like this: drivers and other programs will expose their interfaces as certain device files, much like the /dev files in Linux, and when an app opens a file, the OS will open a message passing channel between the driver and the app.

For example, to get inputs, an app might open /dev/mouse for reading, and the driver running the mouse will simply write to that file, which will show up in the app, allowing it to process the inputs.

Namespaces

Plan 9 also had something our OS should have: namespaces. This allows better isolation between users, their programs, and their environments.

This also enables better package management, something that, on Linux at least, can only be done with hacks and a large piece of software.

Capabilities

To control access to files and other resources on a per-process basis, the OS should require that processes must give capabilities to drivers, and the drivers must use those capabilities to access hardware devices.

Since this will require hardware support, I will say more in the section about hardware security.

Package Manager

It is important that software should be easy to install in our computing system, so there will be a package manager. In fact, the package manager will be one of the most complicated parts of the whole system.

Despite that, it must be in the TCB for the same reasons that the language standard library and runtime must be in the TCB.

And it is time for me to reveal what that reason is.

The reason is complex to explain, so I will start at the beginning.

Remember that dynamic linking does have some advantages:

  1. Updating a library automatically updates the library for every program that uses it.
  2. There is only one copy of the library on the system, not one for every program that uses the library.
  3. Only one copy of the library needs to be in memory for programs to use it.
  4. Dynamic linking enables the use of plugins.

These advantages can be significant. However, dynamic linking also has some disadvantages:

  1. It’s harder to distribute a program that depends on a particular dynamic library because you have to ensure that the library is installed on the system.
  2. A dynamically-linked executable can take much longer to start up.
  3. A dynamically-linked executable pays an extra cost on every function call.
  4. If only a few programs need a library, and they don’t need much of it, the library can actually waste space because it needs to have all of it when a statically-linked library can have extraneous code removed.

If possible, it would be nice to come up with a solution that, for the most part, has the advantages of both and the disadvantages of neither. But most importantly, it should not allow any security holes.

And while we are at it, if we can get rid of the need for ASLR and Control Flow Integrity, so much the better.

In order to enable security, we should start with static linking.

Let’s first figure out how to get the first advantage of dynamic linking with our version of static linkng, which I will call “distribution linking.”

There is a subtle hint as to what to do first in the name I chose: we should make sure that linking happens when a library, or a new version of a library, is distributed.

How can we do that?

Well, we have a common compiler backend. What if when we built packages, we could build the software to the point where it was in the intermediate representation, which as we established, includes platform-specific code and routines to optimize the code.

Then we place that code, in IR form, in the package and distribute it in that form.

Then, when the package manager is installing the program, it also downloads the IR for any libraries it needs, resolving dependencies in the way that package managers usually do. Then once everything is downloaded, it runs the optimization routines on all of the code, generates the machine code, and links it all statically together.

Then, when a library is updated, the package manager knows all of the programs that depend on that library, so it can relink them with the updated copy of the library.

Cool. We are off to a good start. However, our packaging system does not have the rest of the advantages; let’s see if we can add advantages 2 and 3.

The biggest thing to notice is that 2 and 3 are advantages only when there is more than one program that needs the library, and sometimes, more than two or there.

If the package manager knows that there is more than one program that needs the library, perhaps it can mark the library in a special way, a special way that will be recognized by the OS and generates code for the library such that the library can be memory-mapped into a specific spot in the address space that is not already taken.

And then, it should mark the executables that need that library in a special way, so they will be recognized by the OS. Then it uses the addresses that the library will be memory-mapped to in order to resolve the missing addresses in the executable, thus “linking” it statically, while still allowing only one copy to exist.

Then, when an executable is run that needs a marked library, the library will be memory-mapped by the OS into that spot before the executable is run.

In this way, there is only one copy of the library on disk and in memory. And it is still statically-linked…in a way.

This system would have the greatest effect when the library that is being used is the standard library, the library used by the most programs. For that library, the space savings could be big, but the start up times would still be small, and function calls would not have any extra cost.

However, there is a problem with that solution: with libraries that are memory-mapped into the same spot in memory, a hacker can find and use gadgets to exploit the machine with return-oriented programming (ROP).

That said, if we can find a defense that eliminates the possibility of ROP, we will still be fine.

And we can. ROP needs a memory bug called stack buffer overflow in order to work. If we can find a way to eliminate that, and all other memory bugs, we will be fine.

And those are the very same bugs eliminated by safe languages, which is why the languages we use will be safe.

However, if there is a bug in the safety mechanisms, we will still be vulnerable to those attacks, so we need to ensure that there are no bugs in the safety mechanisms.

This is the reason that the standard library and runtime of the programming language need to be in the TCB, which means they will be formally verified. They will be in the TCB and formally verified because they will contain the safety mechanisms against memory bugs, and if those safety mechanisms are formally verified correctly, there is no vulnerability to ROP, which means that we can safely memory-map libraries in specific spots.

And as an extra bonus, since we have gotten rid of ROP, which is exactly what ASLR was meant to protect against, we don’t need ASLR or its complicated code and possible bugs.

As yet another bonus, with no indirect function calls or memory bugs, it will also be impossible to redirect the control flow of a program, so Control Flow Integrity, with its bugs and complications, is also not needed.

And there is yet another bonus: all packages can be distributed in a platform-agnostic way, yet still be compiled specifically for the CPU in the machine, making all software more efficient and performant and packaging more consistent and efficient as well.

But that still leaves plugins; how will we enable that?

Easy: we have a compiler backend, and if it can do JIT compilation, programs can just use the JIT compiler and run the plugins that way.

Hardware

Now it’s time for the hardware.

I am not an electrical engineer, so I may not have a good idea of what good hardware can be like. I am not confident in my hardware design like I am confident in my design for the software.

Security

The first step to ensuring security is to put all of the hardware in the TCB, which means it will be proven correct and open sourced.

I bet some of you are wondering how a hardware company can still make money with an open source hardware design. The answer is easy: normal people can’t build fabs, so they will buy what the company produces.

And for those users that really care about security, they can run the hardware tests and proofs, also open sourced, as well as their own tests and proofs to ensure that the hardware is built exactly as the manufacturer said.

There is already a open hardware specification called RISC-V that might fit the bill. And even if RISC-V is not chosen, the hardware probably should be RISC itself because that simplifies it, which will make it much easier to prove correct.

Eliminating Speculative Execution

As we saw above, speculative execution is the cause of many security bugs in hardware. I don’t think that is a coincidence. In fact, I think it’s inevitable.

Why? Because not only is speculative execution fiendishly complicated, it does things whose stated purpose is to change the state of the processor in ways that should not be visible to the program that the CPU is running.

Unfortunately, any extraneous state change can be detected somehow.

And if there are detectable state changes that are supposed to never happen, but do happen, you have a side channel through which attackers can get information and break the security of the system.

So speculative execution is gone.

Separating the OS

What other security problems do we need to eliminate?

Well, the Meltdown vulnerability made it so attackers could read kernel memory, and while there is a mitigation, we learned above that mitigations do not go far enough.

What can we do to fix the problem of processes being able to read the kernel’s memory?

We can separate the memory that the OS runs on from the memory that processes use. And if we do that, why don’t we just run the OS on a separate processor entirely?

You see, operating systems are in an interesting position. They, themselves, have to run on CPU’s, and the way computers are currently designed, they have to give up the CPU to any process before the process can run. Thus, they cede control of the CPU.

In order to regain control, an OS has to tell the hardware to stop the process every so often, and the hardware has to do that even if the OS decides to let the process run some more, which means the stop was really for nothing.

And the stop, called a context switch, is expensive.

Technically, there are two context switches whenever the OS regains control: once to start executing the OS, and another to start executing the process again.

Maybe we can put the OS on a separate core entirely, one with its own memory, so it does not have to interrupt a process until it has to.

Obviously, there needs to be a way for the OS to interrupt a process, even if it’s on a different core, so we will have to create a way for the OS to tell the hardware to do so.

Okay, cool. Not only do we have a way of protecting the OS from bad processes, we also have a way of eliminating useless context switches, so we can get back some performance that we lost from eliminating speculative execution.

Hardware-Supported Capabilities

Before, I talked about the OS using capabilities to restrict access on a per-process basis. This will be used to implement things like chroots, jails, and other sandboxing.

This will be done be drivers being required to submit capabilities in order to access hardware devices. If a capability does not have the authority to access something, the hardware should trap and tell the OS, which should be able to stop the driver.

Performance

Speaking of performance, how much have we lost?

Probably most of it. And we need to get it back because otherwise, our computing system will not be used, despite having better security.

Why do consumers choose performance over security? Because they can see the effects of bad performance, but the effects of bad security are less clear. In fact, they may remain invisible indefinitely.

Well, we have regained some performance by getting rid of useless context switches, so that’s good, but it will not be enough.

Thinking about what we could do will be easier if we also think about what else we have lost: die area and power consumption, which means we now have some unused power budget and area.

How much of a processor’s area and power are taken up by speculative execution, excluding the caches?

An answer I got from IRC was,

To a first approximation, 100%.

What can we use that extra power budget and area for?

Before we can answer that question, let’s see what other expensive things we can get rid of.

Reducing Context Switches

We got rid of useless context switches, but there are still some when the OS has to put another process on a core that is already being used. Perhaps we can minimize those kinds of context switches.

We can do that by adding more cores. Since we eliminated context switching, each core is really small, so we can have a lot of them.

How many should we have? Well, let’s look to operating systems and their timer interrupts for guidance.

On my work machine, Linux is configured to interrupt the processor 300 times every second, and it is capable of interrupting up to 4016 times per second.

Let’s use the lower figure as a guide, just in case. If we go with the closest power of 2, because reasons, we end up with 256.

So let’s put 256 tiny cores, plus one OS core, into our computer.

That many cores should mean that a lot of processes can be attached to a core on startup, and the OS should never have to move them, or it should not have to move them much, minimizing the number of context switching for scheduling.

Will that amount of cores really make it so most programs can be attached to one core?

Yes. After running ps -ef | wc -l on my machine, it came to 325 processes. And 40 of them are running my web browser alone.

That should get us back an enormous amount of performance, since context switches cost a lot.

However, we can still do better.

Reducing Address Translation

In conventional machines, there is something called the Translation Lookaside Buffer (TLB). This exists because the address of a piece of memory that a program sees may not actually be the true address of the memory, and the TLB is supposed to provide a good way of translating between the two addresses.

There is an upcoming computer architecture, the Mill, that has moved the TLB such that it is only used when a cache line is evicted, or sent to main memory. On the page about memory in the Mill, the designers claim that 20-30% of cycles and power budget can be spent on the TLB. If we move it to the same place they have moved it, we can claim some of that power budget and performance while reducing complexity.

Seems like a win-win-win, so we’ll do that.

There are patents on the Mill architecture, which basically makes all of their ideas unusable, for now. However, the earliest I could build a chip like the one we are designing now is about when their patents expire, so we’ll still take some of their ideas.

Unfortunately, doing that requires us to do something else the Mill does: use a single address space. Doing so is only feasible on 64-bit hardware, so that’s all that we are designing.

Limiting ourselves to 64-bit is okay because everything in my opinion in a consumer’s hand or in his pocket, or on his desk, or in the cloud, should be 64-bit.

We are, after all, ignoring embedded systems.

Reducing Memory Traffic

While we are at it, let’s look at other good things the Mill has.

Implicit Zero and Virtual Zero look like they will make the computing system more secure while at the same time reducing memory traffic, which is great because memory latency is bad. It also seems like they don’t add too much complexity.

So let’s steal those.

The Mill also uses something called Backless Memory and lazy memory allocation. If they end up being simple enough to implement and prove correct, we should use those as well because they reduce memory traffic even more.

Reducing Kernel Interaction

So far we have far reduced the memory traffic, the TLB thrashing, and the context switching on our hardware. Is that enough performance?

Possibly. But there is one thing we haven’t considered yet: we are going to be using a microkernel, not a monolithic kernel, and monolithic kernels are known to be faster, in general, than microkernels.

How do we make up the performance difference of using a microkernel?

Well, let’s first look at why microkernels are generally slower.

In a microkernel, device drivers run in user mode, and when a process wants data from a device, or wants to send data to the device, it must send a message to the driver.

In traditional microkernels, sending a message to another process consists of these steps:

  1. Call the kernel with the data to pass.
  2. The kernel takes control and figures out where to pass the data.
  3. The kernel passes the data to the recipient and gives control to the recipient.
  4. The recipient takes control and does something with the data.

In those four steps are two context switches, which makes everything even slower than it looks, especially since a monolithic kernel just has to receive a call from a process and go right to the device.

And it gets even worse because to send a message back, you need those four steps in reverse with another two context switches while a monolithic kernel only needs one more context switch.

Now, we have made those context switches unnecessary, but the processes still need to call the kernel and wait for data. What if we could get rid of that waiting for the kernel?

Let’s take a simple data structure, the circular buffer, and assume that our OS can set up a circular buffer for passing messages from one process to another.

For passing messages the other way, the OS could create another circular buffer.

We’ve made progress, but all we have done is added a buffer so that both can be reading and writing at the same time; the kernel still has to get involved.

And here is where the hardware can help us.

If we extend our hardware such that some memory can be allocated as a circular buffer, and a process on one end will get signals when there is data, and the other will get signals when the buffer is empty, and then, when each process reads or writes data, the hardware updates the circular buffer.

The hardware will also have to ensure that the processes cannot read the memory directly and can only access the data by certain instructions.

And here we have money because the only OS involvement is setting up the hardware circular buffer correctly. From then on, the processes can communicate with each other without involving the kernel at all.

Here, we have not only matched monolithic kernels, we have surpassed them! The reason for that is that to invoke a device driver in a monolithic kernel, you have to make one context switch, and to go back, you need another, but with hardware circular buffers and enough cores, we don’t need any context switches at all. In fact, if a process is already waiting for the other, it can start as soon as it receives the signal!

It also might be possible to allow one writer and multiple readers of such a buffer. That way, info can be passed to multiple processes at once. However, this could mean that a misbehaving process could fail to read and thus block the writer and all other readers, so I don’t know if this is a good idea.

But before we move on, we should stop and reconsider one of our decisions.

Since we have reduced the kernel interactions, it should be able to handle supervising more than 256 cores. Also, we should be able to increase the number of cores for another reason: most chips nowadays have 4 (physical) cores, which means we could probably have 4 times as many cores in our chip anyway.

So I think we could increase it to 1024 cores, and I will explain later why we use that number.

If 1024 is too much for the power budget, we can still go with 256 cores.

Improving Compiler Latitude

And that’s not all we can do for performance; there are small details we can change to improve it as well.

For example, we can make sure there is as little global state as possible. Both RISC-V and the Mill architecture attempt to do this (in different ways), and we should follow that example because it doesn’t just reduce data hazards, it allows the compiler wider optimization latitude by allowing it more freedom in scheduling instructions.

Another good idea that comes from the Mill is deferred loads. These are loads that actually take the amount of cycles that the hardware should take to perform the load. This allows the compiler even wider freedom in scheduling instructions because it can schedule loads before instructions that it conceptually depends on.

And the most important part of deferred loads is that they are defined to load the value as it is when the load retires. This does two things; first, it gives the compiler even wider latitude, but it also eliminates aliasing, which allows the compiler to be even more aggressive.

Yes, there will be a lot of responsibility on compilers to produce good code for this architecture. However, the sorts of optimizations that we are asking compilers to do here are the sorts of optimizations they are already good at; they can do these things efficiently and with great results.

In other words, the optimizations here are already well-known; we are just making them even easier for compilers.

Exposed Pipeline

And speaking of guaranteeing the length of time that a load takes, there is another guarantee we want the hardware to give.

In order to have real-time guarantees, the compiler must know the latency of all instructions. This is known as having an “exposed pipeline.”

Allocating Cache

Another thing we could do that could possibly improve performance is to have some way, in the hardware, for programs to allocate chunks of the cache for data that it knows will be used often.

This gives two benefits: first, the CPU will not have to guess as to what will be needed next, and second, it can also make real-time control easier if the data needed for responding to interrupts is always in cache.

Addressing Dark Silicon

Now, we come to the difficult part of the design; while I have little confidence in my hardware design in general, this is the part that I am least confident in, period.

We need to deal with the problem of dark silicon.

In short, the problem of dark silicon is that, as chips get smaller and with more transistors, more and more of the chip needs to be unpowered in order to stay within a power budget.

So, if we have a chip that has a lot of cores that are powered, how do we make sure that enough of it is unpowered/dark silicon?

I have several answers.

First, if we were to take current chips and just remove the speculative execution and replace it with lots of simple cores, and keep the caches as-is, we should be able to stay within the power budget because caches actually count as dark silicon.

That may not be a satisfactory answer, but I think that it is still important to consider.

And if that is not enough, we have a few other things we can do.

Concurrency Means Waiting

As the design of this chip currently stands, we can safely assume that a good chunk of the cores will be dark silicon at any one time. This is because the vast majority of programs will spend most of their time waiting, whether on devices or other processes that are supposed to give them data.

Waiting makes a core dark silicon only if the core is designed to be a “switching-limited” chip. (See the P.S. of this post.) I don’t know if designing a chip with so many switching-limited cores is possible. If it’s not, then my design falls flat on its face.

However, if it can be done, then all of the waiting that processes will do will make this a viable design.

I suspect that the amount of waiting that will be done will be sufficient to make enough of the chip dark.

Before I move on, I would like to address a possible criticism.

Linus Torvalds thinks (correctly) that parallelizing everything is a waste of time. However, he forgot that making something concurrent is something entirely different. And concurrent loads are actually the kind of load that this chip design should be good at.

The GPU CPU

But we have yet another few things we can do.

To start with, we now have lots of cores. That sounds a lot like a GPU.

Right now, the GPU takes up space and power on the chip; what if we could take that power and space back for the CPU and make it possible for the CPU to do the job of both a GPU and CPU?

This thought is radical, yes, but it might have a practical effect. You see, according to a post I linked to earlier, “The Bright Side of Dark Silicon”, if we can find a way to make symmetric multiprocessing (SMP) work, we will get:

  1. Better load balancing
  2. With less effort
  3. And more redundancy.

You see, what the computing industry has found out is that there are three main kinds of heavy computing loads: sequential/serial loads, parallel loads, and concurrent loads.

Right now, we use CPU’s to take care of sequential and concurrent loads, and we use GPU’s to take care of parallel loads. If we could somehow build a chip that was capable of all three, we could get rid of the vast majority of asymmetric multiprocessing (AMP).

There will still be some AMP; for example, quantum computing can’t really be put on the same chip as classical computers. And there are probably more such extreme specialized chips in our future.

However, the author of “The Bright Side of Dark Silicon,” Yossi Kreinin, goes on to talk about why SMP will inevitably lose to AMP. He says that it’s because of dark silicon.

Once again, dark silicon rears its head.

To be honest, I think he is right, if we don’t figure out what to do about it.

But that’s a very big if, so let’s see what we can figure out.

The first problem is that the GPU is a separate chip entirely; that makes it easier to bleed heat than it would be if the GPU and CPU were on the same chip.

Here’s a radical idea: what if we had separate chips as well? We would just make it so they would be able to communicate through memory, but they could all have their own caches, their own OS core, their own cooling, and their own set of 1024 (or 256) user space cores.

But that doesn’t quite solve the problem of making parallel loads possible; if all of the cores are running at once, we might exceed our thermal budget.

This problem is a little harder to solve, but luckily for us, we can take a page out of the book of GPU design.

In GPU’s, threads/cores are arranged in blocks, and we can probably do the same in each of our chips.

Since we have 1024 cores, we could split them up into 32 blocks of 32 threads (16 blocks of 16 if we only have 256). Each chip can have its own L3 cache, each block can have its own L2 caches, and each core could have its own L1 data and instruction caches.

Then when a process is running something in parallel, it is given exclusive access to all of the cores in one or more blocks. When doing so, we have two options to reduce the power usage of the cores running parallel:

  1. We can reduce their clock speeds.
  2. If we don’t need all of the threads in a block, we can switch the unneeded ones off.
  3. We can make all of the cores run in lock-step.

Number 1 is the best option, if it’s good enough, because it means that all of the cores retain their flexibility and can run as fast as possible to completion, which will also save power.

Reducing core frequency is also the best option because it can also be used when we are exceeding the power budget when running different processes on the cores. If we can reduce their frequency, we can prevent thermal failure even in concurrent mode.

If that is still not enough and/or the process needs more serial performance and less threads, we can instead run the necessary cores at full frequency while keeping the unneeded cores switched off.

However, if reducing frequency and turning off cores are not enough, then number 3 will be necessary.

What this means is that the cores in the block will use the same fetch and decode logic as all of the other cores in the block. If there are instructions that only some cores need to execute, they will execute them while the other cores wait.

If you think this sounds like what GPU’s do, you are correct.

The advantage of this is that by using a common set of fetch and decode logic, we can cut out the power usage by a significant amount, but the disadvantage is that our cores have now lost their flexibility.

However, we should still be within the power budget, especially if each core can have its own cooling apparatus.

Not only that, but with such a design, if a motherboard supports it, a user could just add more chips, giving them a configurable amount of SMP power.

If there are multiple chips, one of the OS cores in one of the chips will need to act as an OS supervisor, telling which chips what their jobs are. It would probably be good for the supervisor to be on the chip that is running the most parallel, and thus, easiest on the OS, load of all: driving the display.

What makes this design great, at least in my head, is that it is flexible enough that it can handle all of the loads normally thrown at a user-facing computer, whether in their pocket, on their desk, or running the website they are accessing.

For example, say a user is doing normal work. They require enough GPU capability to drive their display, and perhaps a few blocks are doing that, in parallel, as the other blocks drive application programs like their text editor, browser, etc.

Then, they take a lunch break, and during their lunch break, they play video games. When they boot up a game and start playing, the OS realizes that more power has to be given to driving the display since the video game has power hungry graphics. It moves the now-unused text editor and browser processes off of their cores, along with other unused processes, and allocates a few more blocks to drive the display, along with a block or two to drive the physics engine of the game.

With this system, you have as much GPU capability as you need and as much CPU capability as you need. And if you ever find yourself in need of more, just add another chip with 1024 more cores.

Constant Memory

GPU’s have another trick that we might find useful: constant memory. We could allow processes to mark certain portions of memory as constant, which would mean that other processes could cache the data without worrying about mutation.

Conclusion

So that, in a nutshell, is my design for a new computing system.

If there is one theme running throughout this entire design, it’s that building and programming a computing system needs to be easy to do right.

That is why I am trying to give programmers tools that they need to control the resources at their disposal. That is why I am trying to make SMP viable. It is also why I am trying to use hardware to increase the performance of a microkernel.

Once again, I am confident that the software portions are good, but since I am not a hardware designer, I cannot tell if those portions are feasible; I could be completely off my rocker there.

If you have comments, suggestions, or critiques, please contact me.