Please see the disclaimer.
So I am going to.
I first tried to be more faithful to the design in my head. In my head, the data about the pipe was stored along with the pipe itself.
Doing this yielded this code.
There are a few things about this version.
First, I added a way to specifically test both kernel and shared memory with affinity or not.
Second, I also added a mutex to be more fair to kernel code.
The result (with
CFLAGS="-O3 -DNDEBUG" and using
clang) is subpar:
SHMEM: 0.000000198016364374 KERNEL: 0.000000378083386707 AFFINITY SHMEM: 0.000000361057426823 AFFINITY KERNEL: 0.000000724044889828
Why does affinity make things worse for both of them? Yikes!
Alright, it looks like we can’t store the information in the same page of memory. Let’s try it without, using this code.
SHMEM: 0.000000211047351682 KERNEL: 0.000000378078009437 AFFINITY SHMEM: 0.000000384025538680 AFFINITY KERNEL: 0.000000757074890688
Oh, it’s worse. Let’s revert that and try again.
Maybe if we have two separate mutexes, one for the amount read and one for the amount sent. That gives us this code and these results:
SHMEM: 0.000000146007962696 KERNEL: 0.000000371084888967 AFFINITY SHMEM: 0.000000358001039768 AFFINITY KERNEL: 0.000000742095705799
Better, but it is still not good enough.
One thing that we must remember is that this will be implemented in hardware. Hardware doesn’t use mutexes; it uses atomics. Let’s try that.
That gives us this code and these results:
SHMEM: 0.000000053080943571 KERNEL: 0.000000381063954475 AFFINITY SHMEM: 0.000000105049110971 AFFINITY KERNEL: 0.000000722012004386
That’s starting to look better. But I think we can do even better.
I used C11’s
atomic_load(), and the documentation
says that they use the sequentially consistent memory ordering. That may be
too strict for our needs. Because each thread only reads from the value that the
other writes to, that means that the values are only ever written to from one
thread. That means that we may be able to get away with the relaxed memory
I am not a concurrency expert, so I may be completely wrong here.
Using the relaxed memory ordering with this code, we get the following:
SHMEM: 0.000000017011882865 KERNEL: 0.000000388088790698 AFFINITY SHMEM: 0.000000030054539958 AFFINITY KERNEL: 0.000000753015660365
This is starting to look like what we had before. Let’s go ahead and make sure
set_val() are inlined with this code and get these
results (be sure to use
SHMEM: 0.000000015078903156 KERNEL: 0.000000377010279314 AFFINITY SHMEM: 0.000000035017057365 AFFINITY KERNEL: 0.000000755089473185
A little better, but not much (as expected).
Let’s try once again with splitting the metadata from the data. That results in this code and these results:
SHMEM: 0.000000008056027505 KERNEL: 0.000000383023109761 AFFINITY SHMEM: 0.000000017025504902 AFFINITY KERNEL: 0.000000737008527431
Even faster. In fact, the code is transferring one
size_t per iteration, which
is 8 bytes on my
x86_64 machine. This means that, on average, hardware pipes
could transfer one byte per nanosecond. That is truly fast!
But I should probably fix a small problem with this code. The results:
SHMEM: 0.000000023040892243 KERNEL: 0.000000392021246317 AFFINITY SHMEM: 0.000000009036547987 AFFINITY KERNEL: 0.000000383008791012
Ah, that was the reason affinity was weird…
But there is one more hangup: signals. The code I have used does not implement signalling between the processes. Let’s test how fast signals are by themselves first.
With this code and running
sudo make run_signals, we get:
SIGNALS: 0.000001706003786423 AFFINITY SIGNALS: 0.000001677064512179
That is terrible. We need to find a better method of signalling.
How about Linux futexes?
I am terrible at concurrency because my first try, with this code failed spectacularly. Not only did it not work (it failed the consistency check I have in place to catch race conditions), it took over a minute to run the benchmark, which translates to unacceptable performance.
Let’s try again.
Samuel Holland, an acquaintance on IRC, had a clever idea that I want to try.
It goes something like this: instead of storing the counts separately, or in the beginning of the buffer, I could store them inline. So the first bytes would be the write count that the reader would wait on a futex for. The next bytes, however many needed, would be the data. Once the data is written, the writer would update the count the reader is waiting on. Then it would wait on the spot corresponding to one just past the last items it wrote into the array. When the reader finishes, it will write its read count into that spot, and so they continue.
Unfortunately, I failed again, with this code. It failed the benchmark, as well as the consistency check. Oh well.
But without the hardware to enforce security, what good does this all do? Well, the fact is that by separating pipes into their own pages, the OS can use the MMU to enforce security. This does mean that we cannot put data for several pipes into one page because information will leak, but it does mean that hardware pipes are implementable on existing hardware.
Anyway, I learned a lot, like the fact that I fail at concurrency.