Assumed Audience: Programmers, especially Rustaceans and those interested in static analysis.

Epistemic Status: Fairly confident, but not as confident as I’d like because I do not have much time to implement these ideas.

This post is partially an ad!

Introduction

On the ninth anniversary of Rust 1.0, its creator, Graydon Hoare, appeared and wrote a blog post expanding on another by without.boats.

You see, these brilliant people are struggling with a problem: how do you maximize the static analysis capabilities and usability of a language?

Rust is a great step. But people want more.

Spoiler: the answer is structured concurrency.

In fact, the definition in that link is too loose; you need threads to be scoped to an actual scope, not just a parent object or something. And I would even restrict structured concurrency so that threadsets/nurseries/whatever are not first-class.

I’ll call this restricted structured concurrency, or RSC for short.

Sound crazy? It is, so let’s go back to the beginning.

Leakpocalypse

Leakpocalyspe is a term that only advanced Rustaceans (and language nerds like me) recognize, and I suspect it holds a special place in the Hades of the heart for members of the Rust team around 1.0.

I won’t go in detail since that has been done, but Leakpocalypse tl;dr: it turns out that some APIs are actually unsafe if Rust can’t guarantee that destructors will be called.

This month is the ninth anniversary of Rust 1.0, so that was May 2015. Check the date of the bug report that caused Leakpocalypse: April 10, 2015.

If you were on the Rust team in April 2015, you had less than two months to fix a design issue that could cause safe code to become, well, unsafe!

That lack of time required a shorcut, so the final decision was that Rust won’t guarantee that destructors will be called.

I think this was the correct decision at the time; possibly leaking memory on purpose seems like a small thing to keep Rust’s safety guarantees.

However, it means that Rust is suboptimal, and Rust’s failure is my opportunity.

shared^mut

That said, Leakpocalypse did not take away from Rust’s biggest successes. As much as I hate Rust, I certainly acknowledge that it is a leap ahead of my favorite usable language, which is C.

Graydon Hoare and I think without.boats correctly identifies the reason behind Rust’s success: the “shared-xor-mutable” rule.

For ease, I’ll call it shared^mut.

Yep! Rust’s greatest success is not the borrow checker; that is enabled by shared^mut.

As a C programmer, I used to think shared^mut was one of Rust’s most stupid decisions; after all, I’m used to using pointers wherever and whenever.

Though I am more disciplined than that.

But I read without.boats carefully, and I read Graydon Hoare carefully. They generally know what they are talking about, even if they tend to accept more complexity than I would.

And that is what I thought shared^mut added: complexity.

I mean, it does in some sense, but that’s mostly design complexity, and once you learn the patterns, it’s easy to start with a proper design.

Kinda like how Rustaceans change their design habits to satisfy the borrow checker.

Anyway, they convinced me. As of seven days ago. My language will have shared^mut.

To learn more about why, read both of their posts, first the one by boats, and then the one by Graydon.

But if you want a tl;dr, Graydon has one:

TL;DR: support for local reasoning is a big factor in the ability to do automated reasoning about programs. Formal verification involves such reasoning. Rust supports it better than many other imperative systems languages – even some impure functional ones!

And he even has support from a formal verification hacker:

rust’s avoidance of shared mutable state has deep consequences; when we formally verify programs in Rust we can use FOL and avoid separation logic since the type system protects us from mutable aliasing, while this is not true in caml despite being ‘functional’

Xavier Denis

Xavier has since expanded on that tweet.

Local Reasoning

So let’s talk about local reasoning.

Graydon says it’s awesome for static analysis, but why is it awesome for static analysis?

Well, let’s go back to my post about Turing-completeness:

The true importance of Turing-completeness is that there is at least one property that cannot be proven!

You may be able to prove some properties of a program, but if you cannot prove them all, that is the fault of Turing-completeness. And sometimes, you may be able to prove a property you care about, but if you cannot prove it every time, that is the fault of Turing-completeness.

Because of Turing-completeness, proving a property across an entire program is hard! Stupendously hard!

seL4, the formally proven microkernel, took 2.2 person years of effort and 20 times the proof code, for merely 10k LoC.

And in case you think the actual implementation was the biggest part, that only took two person months!

Famously, seL4 is written in C; the only way to prove properties about C is to look at the whole program, but what if we could prove properties for smaller units, like functions, and be confident that the properties hold for the entire program?

We can! And Rust is the proof.

This is what boats and Graydon are saying: in certain cases, if you have “local reasoning”, that means you can reason, or analyze, a program one function at a time. And then you use mathematical induction to prove that the properties hold across the entire program.

“Surely it’s not that easy!”

Generally, no, but you can design a language to make it so.

Here’s why: each function call is, essentially, its own program.

Think of the runtime function stack. What if you recorded the actual runtime values that the function receives and then write a program to execute just that function with just those runtime arguments.

Would that function execute just the same as it does when it is part of a larger program?

“No, it doesn’t, Gavin. There might be global variables, other threads, any kind of global state.”

You are correct.

However! What if you could take all of that out? What if the function touches none of that stuff?

If you do, that is when a function call is its own subprogram.

This is why programmers don’t like global variables. This is why they don’t like threading. This is why they don’t like global state.

“Local reasoning” is just a fancy term for “functions are their own subprograms.”

shared^mut and Local Reasoning

And it’s not a binary either! It depends on the property.

Rust has local reasoning about some properties, but not others. boats and Graydon are lamenting that it doesn’t have local reasoning in more things. They also think, and I agree, that the industry should move towards more local reasoning.

Anyway, most languages have poor local reasoning, and boats claims shared^mut is Rust’s secret sauce for having more.

This is why he refers to mutable references as jumps: if the value at a reference changes from underneath your code, a single function could find itself taking a different path than if the value did not change. In essence, that function jumped somewhere different because of the reference.

The arguments from boats and Graydon are sound; I can find no fault with them. So I am publicly saying that I was wrong and that shared^mut belongs in every new language.

Okay, not every new language. Feel free to play around!

But I would like to see it in every new mainstream language that does not have to do what C has to do. And maybe even in languages that have to do what C has to do.

The Problem

But we have a problem: notice that threads ruin the assumptions of functions as subprograms.

“Wait, how do they ruin functions as subprograms, Gavin?”

Say you have a function. That function starts a thread, then returns. That new thread continues running after the function returns.

In essence, part of the subprogram’s execution, the thread, is executing after the function returns, so the function itself isn’t the whole of the subprogram.

Threads leak out of functions.

So threads are right out, but we need some concurrency. How do we get it back?

Async, right?

Wrong!

Async functions return futures or promises, which are promises that the code will eventually be executed later. Code in the subprogram will also execute after the function returns.

Async leaks out of functions too.

Grr!

The Rust async book mentions three more methods:

  • Event-driven programming, but that doesn’t allow parallel execution without one of the other methods.
  • Coroutines, but they also leak out of functions.
  • Actor model, but later actors can last longer than earlier ones, which means they can also leak out of functions.

Uh…

So they all leak, or can only use one CPU at a time. Great. Just great.

The Solution

Are we going to give up there? Of course not! Because someone already came up with the answer.

That someone, as far as I can tell,1 is Nathaniel J. Smith, who woke up one day and thought, “You know, what if I just made functions that didn’t leak threads?”

Yep! That’s the solution: just don’t leak threads out of the function that creates them. It really is that simple.

And we only know this because then Nathaniel J. Smith thought, “I should tell people about this,” so he called it structured concurrency and wrote the seminal source on the what and why.

That post is so important to programming that he should be recognized by his initials, NJS, just like RMS, DJB, ESR, EWD, PG, RSC (not restricted structured concurrency!), and DHH.

Now, in his original definition, and my later definition, structured concurrency allows for a first-class value, which NJS calls a nursery and I call a threadset, to be passed around.

I have since realized that is a problem; after all, if a function returns a threadset in some way, just like an async function returns a future, then the threads inside will leak.

And that is why restricted structured concurrency is the missing piece: it lets all functions be their own subprograms (with concurrency!), so local reasoning holds and properties can be proved inductively.

Why Destructors Must Run

But for RSC to work, destructors must be guaranteed to run.

You can use a reference cycle to leak a JoinGuard and then the scoped thread can access freed memory

Leakpocalypse bug report

In other words, if destructors will not run, then RSC can cause use-after-free, one of the worst kinds of bugs.

This is another reason that threadsets cannot be first-class values and must be tied to a specific scope/lifetime.

This is hard to do if a language supports more concurrency primitives than RSC, but if RSC is the only way, then it’s actually easy to guarantee destructors will run: just add scope-based resource management (SBRM).

Like normal structured concurrency, normal RAII is too loose; you need hard SBRM.

Then, even in the presence of exceptions, destructors will always run.

This isn’t theoretical either! I’ve written a form of SBRM in C, and I’ve tied all of my threadsets to it. I’ve even added setjmp and longjmp for exceptions, and signals for thread cancellation.

With all of that, as long as I don’t allocate something outside of my SBRM (which I currently do and need to fix), all destructors will run. Deterministically too.

Of course, you do need owning references, but I’ve even managed to write a generic linked list that owns its data, so it’s possible.

Though you should not use setjmp and longjmp at home. Or use them with signals. You do not want to be forever cursed with that knowledge like I am!

But it does work for my bc as well, and POSIX says that the way I do it is safe.

The Good

However, once you have RSC and guaranteed destruction, then you lose two important problems: function colors and the asynchronous clean-up problem.

If every function call is its own subprogram, then you just don’t have function colors. And if you don’t have async, you don’t need to worry about async clean-up.

These problems have been recognized in Rust, with some saying that async Rust just doesn’t work, and I agree.

Though I am incredibly impressed by the brilliance of the Rust team in getting as close as they are!

But that’s not all!

Rust is a great leap, but it is also limiting: the Rust compiler will try to prove all of the properties it can prove for every function across the whole program.

And while those properties are the most important, there aren’t many of them.

The point of static analysis in a compiler is to prove as many properties as easily as possible, so it is suboptimal if the set of provable properties is fixed.

What if we wanted to prove more properties, and what if we wanted to prove only some properties for part of the program?

With RSC, both are possible.

In my language, Yao, it will be possible for libraries to add property annotations; to do so, they must define functions for the compiler to run on every function that will prove those properties for those functions under the subprogram assumption.

And then programmers can apply those annotations to whatever functions they wish.

Those annotations will be a part of the function’s type, and it can only call functions with at least the same annotations.

This is what will fulfill the subprogram assumption.

This will work at runtime too; if a function needs to run a callback, then it just needs to ensure that the calback parameter has the same annotations.

For example, Yao’s standard library will define pure and total. The first will mean that the function is a purely functional function, and the second will mean that the function will be guaranteed to terminate.

If you have a library that must be purely functional but needs to take a callback, just add pure to the callback parameter!

This will be an easy way for programmers to ensure they follow the functional core, imperative shell pattern!

And then, if Yao programmers want to prove more properties, they just need to write function provers in Yao, not separate formal verfiers, and distribute them. Even lazier Yao programmers will then be able to simply import a library, and they’ll have more formal verfication.

I don’t know of any formal verification that is as simple as importing a library and fixing compile errors!

Rust is beloved because it shifts software development left. Compiling Rust is formal verification.

So now we know why writing correct programs is hard: because it has to be. But while we cannot verify all programs all the time — regardless of how they’re written — there’s nothing stopping us from verifying some programs some of the time for some properties.

Ron Pressler, “Why Writing Correct Software Is Hard”

Or all programs in one language all of the time for some properties.

But Yao will be even more; it will prove some subprograms in one language all of the time for more properties.

And Yao will still take less time to compile! In a non-optimized debug build, Yao compiles the equivalent of a Rust crate in less than a second, even with the presence of Yao’s answer to Rust’s macros.

The Bad

Unfortunately, Yao’s formal verification will not be perfect.

There does not exist a generally useful programming language where every program is feasibly verifiable [for all properties].

Ron Pressler, “Why Writing Correct Software Is Hard”

I want Yao to be generally useful, so Yao programs won’t be completely verifiable.

But that is just an extension of Turing-completeness, which I talked about before. Some things just cannot be done at compile time.

Ron Pressler even preaches against sound formal verification for this reason.

I both agree and disagree.

I disagree because Rust has shown that formal verification in a compiler is worth it. If we can add more for little cost, and I think I can, we should.

In addition, the reason he argues against sound methods is that they do not compose:

The main result is that the verification most interesting properties that we’d like to verify does not compose. I.e. if we can prove a property for components P1…Pn, then proving that property for P1 ○ … ○ Pn (where ○ is some composition such as a subroutine call or message passing) does not just grow super-polynomially in n (which would be good!), but super-polynomially in the size of each of the components, i.e. it would be just as hard to prove if everything was not decomposed. I.e. correctness does not decompose.

Ron Pressler, Hacker News comment

And he is right, in the general case.

But we can design a programming language to make that easier, which is what I have done with function calls as subprograms in Yao and RSC.

That does mean there is a limitation on what kinds of properties we can prove; to be provable in Yao, a property must be provable by induction.

This means that for any other properties, you do need to do as Ron Pressler says and use unsound verification techniques, like testing, fuzzing, testing, mutation testing, testing, chaos engineering, and did I mention testing?

Seriously, people, test your code!

But that is not all of the disadvantages of RSC.

Another is that it takes extra memory.

Each thread takes memory (about 4 kb for a control block on Linux and a stack), and RSC will have more live threads than other concurrency methods.

Since each thread needs to outlive its children, its data lives longer than necessary, which takes memory.

And finally, I don’t know that it would be more efficient than things like async Rust, for example.

But in any case, it needs to be paired with event-driven programming to be efficient at scale.

The Ugly

Let’s talk about that last point because that’s where RSC gets ugly.

There are several patterns that people use for concurrent code now, and RSC does make them harder to implement.

Producer/Consumer Pattern

Take the producer/consumer pattern.

A naive implementation would look like this:

Naive producer consumer pattern under restricted structured concurrency

Diagrams in this post use:

  • Rectangle for the root of a thread.
  • A circle for a function call on a thread.
  • A solid line for a function call.
  • A dashed line for a the spawning of a child thread.
  • A dashed and dotted line for a dynamic DAG edge between stack frames (more on that later).

But that won’t work!

The problem is that the data produced by the producer needs to be passed to the consumer, but it would own that data, not the consumer.

If you struggle to figure out how to implement this in plain RSC, you’re not alone. In fact, I believe that without some sort of change plain RSC is incapable of expressing a lot of useful patterns.

So what’s the change? RSC removes the ability to pass threadsets around as first-class values. In other words, RSC is totally static, a purely compile-time construct, and it makes a tree of threads.

A tree is too limiting. What if we could make a DAG?

Say you have two threads. They may or may not be related (one a descendant of the other), but what if they could somehow be connected?

We would need to do in such a way that the subprogram assumption holds, of course.

To do that, we could have one of those two threads send a value to the other thread, a value that will block in its destructor until the sending thread returns from the stack frame that sent the value. The sending stack frame, not the receiving thread, then becomes a “descendant” of the receiving stack frame.

This means that although the message with the value goes from the sender to the receiver, the arrow in a diagram would go from the receiver to the sender.

If that’s confusing, just think that the receiver would send a request for that value first.

It is crucial that the value is tied to the particular stack frames too! This makes it so both threads can continue to work. But it also means that anything in the receiving stack above (speaking about stacks that grow downward) the receiving stack frame is accessible in the sending stack frame and all of its descendants.

Yes, this does mean that there are two subprograms that share the same subprogram (i.e., the “Allocator” stack frame shares a subprogram with the “Producer Function” stack frame), but as it turns out, this is still safe; it is still a subprogram, even if it’s a shared subprogram.

And yes, there needs to be synchronization, another disadvantage of RSC, but as boats showed in “The Scoped Task Trilemma”:

Any sound API can only provide at most two of the following three desirable properties:

  1. Concurrency: Child tasks proceed concurrently with the parent.
  2. Parallelizability: Child tasks can be made to proceed in parallel with the parent.
  3. Borrowing: Child tasks can borrow data from the parent without synchronization.

without.boats, “The Scoped Task Trilemma”

Anyway, for the producer/consumer pattern, adding those stack frame edges would look like this:

The proper producer/consumer pattern under restricted structured concurrency

The owner of the data is the stack frame labelled “Allocator”; it allocates the space for the data. Then it passes those allocations to “Producer Function” stack frame in producers.

Yes, this means that the destructor needs to know whether the data has been initialized by the producer or not; this is another weakness of RSC.

But once the data is initialized by producers, the consumers can use those allocations in the “Consumer Function” stack frame because the “Allocator” stack frame is an ancestor.

And then the “Allocator” stack frame can free/destroy the data after it is used.

Thread Pool Pattern

But there is an even harder problem: the C10M Problem. Is it possible to use RSC to handle 10 million connections, of which 1 million need some processing every second?

To be honest, I don’t know; I think it would need to be implemented and load tested to find out. Personally, I don’t think RSC can get that far because the original C10M Problem author thinks that you have to avoid calling the OS:

Today with C10M, we start with asynchronous code, but instead of better kernels, we move our code completely out of the kernel.

Robert Graham, “The C10M Problem”

This is another bad thing about RSC: because there are more threads, there will be more calling into the OS.

But I’ll give a sketch for a design that might get close enough for government work.

Yes, this means that for hyperscaling, you’ll still want to use C, C++, or maybe Rust; RSC and Yao will not serve well enough.

When I first tried to solve this problem, I failed, even with the static edges between threads that the producer/consumer pattern uses.

Matt Kline, the author of “Async Rust Is a Bad Language,” has a theory why:

Not every program can be expressed as a DAG…

Matt Kline, “Async Rust Is a Bad Language”.

I think he’s right.

And to beat the dead horse again, runtime is more powerful than comptime! RSC will always be painfully hampered if it is a comptime-only thing.

So the solution is to make it dynamic, to have some runtime component.

See, Matt Kline was probably thinking about static DAGs; what if we had dynamic DAGs? What if, instead of spawning threads and edges between them once and leaving it as-is, we allowed ourselves to spawn edges dynamically?

We’ll call this version of RSC Dynamic Restricted Structured Concurrency (DRSC).

How might it work?

First, you spawn a bunch of worker threads from a parent:

Thread pool spawning

There can be multiple worker threads, of course. The diagram only has one for clarity.

The parent thread then calls a function that runs a multiplexer that implements event-driven programming, receiving messages from the worker threads and also accepting new connections through the main socket:

Thread pool after the parent starts accepting connections

Meanwhile, each worker thread sets up its own multiplexer, which will listen for a new connection from the parent thread, as well as handle any connections that the worker is responsible for. It would look like this:

Thread pool after the worker threads start multiplexing

Then when the parent thread accepts a new connection, it acts as a load balancer, perhaps using the power of two random choices, and tells a worker thread to handle the connection.

Now, if each connection is completely stateless, this is all that needs to be done! Each connection file descriptor can be passed off to a worker thread, and that thread can multiplex on it, then close it when it is done.

But let’s say that the parent thread actually has to allocate some data for each connection, and when a connection ends, that data needs to be added to a database or log or something.

Yes, the worker thread could probably do it, but this is an example to show how DRSC can be used for anything. So let’s assume the worst.

When the parent thread sends a connection to a worker, it establishes a dynamic DAG edge:

Thread pool after the parent thread creates a dynamic DAG edge

It can do this for each connection, in any order. In other words, there can be multiple DAG edges:

Thread pool after the parent thread creates multiple dynamic DAG edges

And the edges can be created and destroyed in any order.

This is the dynamic we are looking for, and it’s what makes DRSC able to implement any pattern.

At that point, the only thing holding us back from closing in on C10M is purely hardware!

If the parent thread can spend a microsecond handling any event, it can handle 1,000,000 events, so maybe 500,000 new connections each second and 500,000 events from worker threads.

Then you just need 100 worker threads (say on a 128 core machine), each handling 10,000 connections (C10K), and while we haven’t reached C10M, we do have a million, with a load balancer, and that’s good enough.

By the way, if you think the multiplexer I mentioned is theoretical, it’s not; I implemented it, and it works for multiplexing on child threads and signals.

DRSC Is a Pain

“Ugh, Gavin! That seems like too much work!”

Yeah, I get it. If you have to implement all of that yourself, it’s too much.

However, because of the subprogram assumption, both patterns can be their own subprograms. In fact, any pattern could totally be packaged into APIs! You would just give them your callbacks, and they would handle the rest.

This is a key point: yes, DRSC is a pain, but that pain gives you great abstractions with minimal leaks. Only one crazy (me!) has to implement each pattern; after that, you can just enjoy using them, including with proper thread cancellation!

Of course, that’s not the only pain; Callback Hell is back, and that can only be alleviated with syntax sugar, not eliminated.

Also, your code would need to be broken into whole functions, not combined in one function with awaits. I’m used to that, but I understand that most are not.

Is It Worth It?

If DRSC were properly implemented, would it be worth it?

I don’t know.

I certainly think so, and I think the industry would be better off with it, but people have their preferences.

We could have had fast and memory-safe languages before Rust came along, but it was not until Rust when someone put in the engineering effort to make it work.

The above statement is from a blog post I once saw and lost. Can someone point me to the original source?

Until Rust, people just preferred C and C++. I’m sure that people will prefer Rust for a while, but preferences can still change.

So the only way to answer whether it will be worth it is to make it and let everyone else decide.

Conclusion

If I have convinced any of the Rust 1.0 team of the value of DRSC, they are depressed right now, and that’s because of some Rusty irony.

You see, Leakpocalypse was caused by a design bug found by the use of what Rust called thread::scoped(). This was, in essence, an early form of structured concurrency! And they dropped it for the easy solution!

If they had done the hard thing and delayed Rust 1.0 to truly fix the problem, Rust might have become the Über-language!

But again, Rust’s failure is my opportunity. I hope I can seize it.


  1. There might be someone before, but I haven’t been able to find one. ↩︎