I was on IRC recently, and I had a bunch of people encourage me to continue testing the feasibility of hardware pipes, an idea I talked about in my “Computing Is Broken” post.
So I am going to.
I first tried to be more faithful to the design in my head. In my head, the data about the pipe was stored along with the pipe itself.
Doing this yielded this code.
There are a few things about this version.
First, I added a way to specifically test both kernel and shared memory with affinity or not.
Second, I also added a mutex to be more fair to kernel code.
The result (with CFLAGS="-O3 -DNDEBUG"
and using clang
) is subpar:
SHMEM: 0.000000198016364374
KERNEL: 0.000000378083386707
AFFINITY SHMEM: 0.000000361057426823
AFFINITY KERNEL: 0.000000724044889828
Why does affinity make things worse for both of them? Yikes!
Alright, it looks like we can’t store the information in the same page of memory. Let’s try it without, using this code.
SHMEM: 0.000000211047351682
KERNEL: 0.000000378078009437
AFFINITY SHMEM: 0.000000384025538680
AFFINITY KERNEL: 0.000000757074890688
Oh, it’s worse. Let’s revert that and try again.
Maybe if we have two separate mutexes, one for the amount read and one for the amount sent. That gives us this code and these results:
SHMEM: 0.000000146007962696
KERNEL: 0.000000371084888967
AFFINITY SHMEM: 0.000000358001039768
AFFINITY KERNEL: 0.000000742095705799
Better, but it is still not good enough.
One thing that we must remember is that this will be implemented in hardware. Hardware doesn’t use mutexes; it uses atomics. Let’s try that.
That gives us this code and these results:
SHMEM: 0.000000053080943571
KERNEL: 0.000000381063954475
AFFINITY SHMEM: 0.000000105049110971
AFFINITY KERNEL: 0.000000722012004386
That’s starting to look better. But I think we can do even better.
I used C11’s atomic_store()
and atomic_load()
, and the documentation
says that they use the sequentially consistent memory ordering. That may be
too strict for our needs. Because each thread only reads from the value that the
other writes to, that means that the values are only ever written to from one
thread. That means that we may be able to get away with the relaxed memory
ordering.
I am not a concurrency expert, so I may be completely wrong here.
Using the relaxed memory ordering with this code, we get the following:
SHMEM: 0.000000017011882865
KERNEL: 0.000000388088790698
AFFINITY SHMEM: 0.000000030054539958
AFFINITY KERNEL: 0.000000753015660365
This is starting to look like what we had before. Let’s go ahead and make sure
that get_val()
and set_val()
are inlined with this code and get these
results (be sure to use gcc
or clang
):
SHMEM: 0.000000015078903156
KERNEL: 0.000000377010279314
AFFINITY SHMEM: 0.000000035017057365
AFFINITY KERNEL: 0.000000755089473185
A little better, but not much (as expected).
Let’s try once again with splitting the metadata from the data. That results in this code and these results:
SHMEM: 0.000000008056027505
KERNEL: 0.000000383023109761
AFFINITY SHMEM: 0.000000017025504902
AFFINITY KERNEL: 0.000000737008527431
Even faster. In fact, the code is transferring one size_t
per iteration, which
is 8 bytes on my x86_64
machine. This means that, on average, hardware pipes
could transfer one byte per nanosecond. That is truly fast!
But I should probably fix a small problem with this code. The results:
SHMEM: 0.000000023040892243
KERNEL: 0.000000392021246317
AFFINITY SHMEM: 0.000000009036547987
AFFINITY KERNEL: 0.000000383008791012
Ah, that was the reason affinity was weird…
But there is one more hangup: signals. The code I have used does not implement signalling between the processes. Let’s test how fast signals are by themselves first.
With this code and running sudo make run_signals
, we get:
SIGNALS: 0.000001706003786423
AFFINITY SIGNALS: 0.000001677064512179
That is terrible. We need to find a better method of signalling.
How about Linux futexes?
I am terrible at concurrency because my first try, with this code failed spectacularly. Not only did it not work (it failed the consistency check I have in place to catch race conditions), it took over a minute to run the benchmark, which translates to unacceptable performance.
Let’s try again.
Samuel Holland, an acquaintance on IRC, had a clever idea that I want to try.
It goes something like this: instead of storing the counts separately, or in the beginning of the buffer, I could store them inline. So the first bytes would be the write count that the reader would wait on a futex for. The next bytes, however many needed, would be the data. Once the data is written, the writer would update the count the reader is waiting on. Then it would wait on the spot corresponding to one just past the last items it wrote into the array. When the reader finishes, it will write its read count into that spot, and so they continue.
Unfortunately, I failed again, with this code. It failed the benchmark, as well as the consistency check. Oh well.
But without the hardware to enforce security, what good does this all do? Well, the fact is that by separating pipes into their own pages, the OS can use the MMU to enforce security. This does mean that we cannot put data for several pipes into one page because information will leak, but it does mean that hardware pipes are implementable on existing hardware.
Anyway, I learned a lot, like the fact that I fail at concurrency.