tl;dr: Link-time optimization is better. It produces faster code, and if you have more than one core, it is also faster. When you only have one core, it is not significantly slower, either.

I had a conversation with an acquaintance on IRC recently about a blog post he wrote concerning SQLite and the decision by SQLite’s authors to put basically all of the C code into one file. The SQLite authors call it amalgamation.

My acquaintance is the maintainer of a small Linux distro, so compile times are important to him, and the amalgamation of SQLite increased its build time considerably.

We debated about whether link-time optimization (LTO) was better or not until I decided that measuring was the better idea.

I had two things I wanted to measure:

  1. Compile time
  2. Runtime of the generated code

Fortunately, I have a small C project handy that would make a useful test subject, especially since it is sensitive to optimization, including LTO.

So I started. First, I decided that I would run each test or benchmark (the commands with time -p in them) four times: once to make sure Linux cached relevant files, and three more times to actually measure.

Second, using Version 2.1.3, I created, by hand, this file by carefully copying all text from each header and C source file, removing all license headers (but one), and removing redundant #include statements. Then I created this file by modifying a Makefile generated by configure.sh.

Then (after running configure.sh with the same options for the normal bc), I ran the following command:

sh
time -p make > /dev/null

To make the single-file bc work, I also had to copy link.sh from the source tree into the folder where the bc.c and Makefile files were. On top of that, I ran make clean between each invocation.

The three times for a single-file bc:

real 3.51
user 3.40
sys 0.07

real 3.45
user 3.37
sys 0.05

real 3.44
user 3.39
sys 0.03

The three times for an LTO-optimized, many-file bc:

real 3.62
user 3.34
sys 0.24

real 3.62
user 3.33
sys 0.25

real 3.66
user 3.37
sys 0.24

It is obvious that a single file has the advantage in compile time, but what about runtime?

I ran the following command on both bc’s (where the test scripts were the ones in the source tree):

sh
echo "halt" | \
time -p bin/bc -lq tests/bc/scripts/add.bc      \
                   tests/bc/scripts/subtract.bc \
                   tests/bc/scripts/multiply.bc \
                   tests/bc/scripts/divide.bc   \
                   > /dev/null

The three times for a single-file bc:

real 4.84
user 4.81
sys 0.01

real 4.80
user 4.78
sys 0.01

real 4.79
user 4.78
sys 0.00

The three times for an LTO-optimized, many-file bc:

real 4.50
user 4.48
sys 0.01

real 4.28
user 4.27
sys 0.01

real 4.32
user 4.31
sys 0.00

In a surprising twist, LTO is better!

What I thought was happening here is that since the optimizer is running later in the process, it could optimize better for the particular machine that I ran the benchmarks on.

Let’s remove the -march=native to see if that makes a difference in the runtime. To do this, I removed -march=native from the single-file Makefile and re-ran ./configure.sh without that flag for the normal bc.

The three compile times for single-file bc without -march=native:

real 3.34
user 3.28
sys 0.04

real 3.40
user 3.32
sys 0.05

real 3.40
user 3.32
sys 0.05

The three compile times for many-file bc without -march=native:

real 3.57
user 3.29
sys 0.24

real 3.54
user 3.22
sys 0.28

real 3.52
user 3.25
sys 0.22

The many-file bc improved a little.

The three times of the single-file bc:

real 4.73
user 4.73
sys 0.00

real 4.75
user 4.75
sys 0.00

real 4.72
user 4.70
sys 0.01

In another astonishing twist, the single-file bc runs somewhat faster when not using -march=native!

And the three times of the many-file bc:

real 4.16
user 4.14
sys 0.01

real 4.24
user 4.21
sys 0.01

real 4.18
user 4.16
sys 0.01

Once again, compiling without -march=native made bc faster! But more importantly, LTO is still faster. Why? That is still a mystery to me.

However, these compile tests are not fair.

You see, the normal bc is at a disadvantage when it comes to building because it needs to not only build all of the source files, it needs to generate four files:

  1. The bc help
  2. The dc help
  3. The bc library
  4. The extended bc library

Each of these things is non-trivial to do, especially since bc also has to compile and run a small C program to do the generating. That means that the many-file bc has a disadvantage of at least five steps.

So to make things more fair, let’s try compiling again, this time with -march=native and once again with make clean inbetween.

The difference will be that after make clean for the normal, many-file bc, I will also run, and skip timing, the following commands:

make gen/strgen
make gen/bc_help.c
make gen/dc_help.c
make gen/lib.c
make gen/lib2.c

These commands will generate the C code that will then be fed to the rest of the build.

The three times for the single-file bc:

real 3.46
user 3.41
sys 0.03

real 3.51
user 3.43
sys 0.05

real 3.48
user 3.39
sys 0.07

The three times for the many-file bc:

real 3.61
user 3.36
sys 0.22

real 3.54
user 3.28
sys 0.24

real 3.54
user 3.27
sys 0.24

That is almost the same. In fact, it’s close enough to say that using multiple files with LTO is not significantly slower than using a single file. And since LTO produces better code, it should probably be done every time.

The sole exception that I can think of would be for libraries that are distributed as a single header.

But there is still one angle that we haven’t considered: building with multiple cores. And it is only fair to do so, since mulit-core builds are now the norm.

Let’s compile once again, this time with make -j2 (and with -march=native) and once again with make clean inbetween. Also, this time I will not run the extra make commands for the normal bc.

The three times for single-file bc:

real 3.47
user 3.39
sys 0.04

real 3.51
user 3.43
sys 0.06

real 3.54
user 3.42
sys 0.06

The three times for many-file bc:

real 2.77
user 3.41
sys 0.25

real 2.72
user 3.34
sys 0.24

real 2.76
user 3.41
sys 0.25

The normal, many-file bc suddenly comes out on top, in a big way. It seems that if you have more than one core, LTO is faster.

So not only does LTO produce better quality code for runtime, it is faster at compile time if you use more than one core.

But I think the discrepancy can get even bigger. Let’s try four cores.

Once again, each command is run with make clean between, but this time with make -j4.

The three times for single-file bc:

real 3.47
user 3.40
sys 0.05

real 3.44
user 3.39
sys 0.03

real 3.49
user 3.43
sys 0.03

The three times for many-file bc:

real 2.57
user 3.61
sys 0.35

real 2.56
user 3.49
sys 0.31

real 2.57
user 3.61
sys 0.31

Eh, it’s only a slight improvement. That doesn’t change the result, though.