tl;dr: Link-time optimization is better. It produces faster code, and if you have more than one core, it is also faster. When you only have one core, it is not significantly slower, either.
I had a conversation with an acquaintance on IRC recently about a blog post he wrote concerning SQLite and the decision by SQLite’s authors to put basically all of the C code into one file. The SQLite authors call it amalgamation.
My acquaintance is the maintainer of a small Linux distro, so compile times are important to him, and the amalgamation of SQLite increased its build time considerably.
We debated about whether link-time optimization (LTO) was better or not until I decided that measuring was the better idea.
I had two things I wanted to measure:
- Compile time
- Runtime of the generated code
Fortunately, I have a small C project handy that would make a useful test subject, especially since it is sensitive to optimization, including LTO.
So I started. First, I decided that I would run each test or benchmark (the
commands with time -p
in them) four times: once to make sure Linux cached
relevant files, and three more times to actually measure.
Second, using Version 2.1.3, I created, by hand, this file by
carefully copying all text from each header and C source file, removing all
license headers (but one), and removing redundant #include
statements. Then I
created this file by modifying a Makefile
generated by configure.sh
.
Then (after running configure.sh
with the same options for the normal bc
), I
ran the following command:
sh
time -p make > /dev/null
To make the single-file bc
work, I also had to copy link.sh
from the source
tree into the folder where the bc.c
and Makefile
files were. On top of that,
I ran make clean
between each invocation.
The three times for a single-file bc
:
real 3.51
user 3.40
sys 0.07
real 3.45
user 3.37
sys 0.05
real 3.44
user 3.39
sys 0.03
The three times for an LTO-optimized, many-file bc
:
real 3.62
user 3.34
sys 0.24
real 3.62
user 3.33
sys 0.25
real 3.66
user 3.37
sys 0.24
It is obvious that a single file has the advantage in compile time, but what about runtime?
I ran the following command on both bc
’s (where the test scripts were the ones
in the source tree):
sh
echo "halt" | \
time -p bin/bc -lq tests/bc/scripts/add.bc \
tests/bc/scripts/subtract.bc \
tests/bc/scripts/multiply.bc \
tests/bc/scripts/divide.bc \
> /dev/null
The three times for a single-file bc
:
real 4.84
user 4.81
sys 0.01
real 4.80
user 4.78
sys 0.01
real 4.79
user 4.78
sys 0.00
The three times for an LTO-optimized, many-file bc
:
real 4.50
user 4.48
sys 0.01
real 4.28
user 4.27
sys 0.01
real 4.32
user 4.31
sys 0.00
In a surprising twist, LTO is better!
What I thought was happening here is that since the optimizer is running later in the process, it could optimize better for the particular machine that I ran the benchmarks on.
Let’s remove the -march=native
to see if that makes a difference in the
runtime. To do this, I removed -march=native
from the single-file Makefile
and re-ran ./configure.sh
without that flag for the normal bc
.
The three compile times for single-file bc
without -march=native
:
real 3.34
user 3.28
sys 0.04
real 3.40
user 3.32
sys 0.05
real 3.40
user 3.32
sys 0.05
The three compile times for many-file bc
without -march=native
:
real 3.57
user 3.29
sys 0.24
real 3.54
user 3.22
sys 0.28
real 3.52
user 3.25
sys 0.22
The many-file bc
improved a little.
The three times of the single-file bc
:
real 4.73
user 4.73
sys 0.00
real 4.75
user 4.75
sys 0.00
real 4.72
user 4.70
sys 0.01
In another astonishing twist, the single-file bc
runs somewhat faster when
not using -march=native
!
And the three times of the many-file bc
:
real 4.16
user 4.14
sys 0.01
real 4.24
user 4.21
sys 0.01
real 4.18
user 4.16
sys 0.01
Once again, compiling without -march=native
made bc
faster! But more
importantly, LTO is still faster. Why? That is still a mystery to me.
However, these compile tests are not fair.
You see, the normal bc
is at a disadvantage when it comes to building because
it needs to not only build all of the source files, it needs to generate four
files:
- The
bc
help - The
dc
help - The
bc
library - The extended
bc
library
Each of these things is non-trivial to do, especially since bc
also has to
compile and run a small C program to do the generating. That means that the
many-file bc
has a disadvantage of at least five steps.
So to make things more fair, let’s try compiling again, this time with
-march=native
and once again with make clean
inbetween.
The difference will be that after make clean
for the normal, many-file bc
, I
will also run, and skip timing, the following commands:
make gen/strgen
make gen/bc_help.c
make gen/dc_help.c
make gen/lib.c
make gen/lib2.c
These commands will generate the C code that will then be fed to the rest of the build.
The three times for the single-file bc
:
real 3.46
user 3.41
sys 0.03
real 3.51
user 3.43
sys 0.05
real 3.48
user 3.39
sys 0.07
The three times for the many-file bc
:
real 3.61
user 3.36
sys 0.22
real 3.54
user 3.28
sys 0.24
real 3.54
user 3.27
sys 0.24
That is almost the same. In fact, it’s close enough to say that using multiple files with LTO is not significantly slower than using a single file. And since LTO produces better code, it should probably be done every time.
The sole exception that I can think of would be for libraries that are distributed as a single header.
But there is still one angle that we haven’t considered: building with multiple cores. And it is only fair to do so, since mulit-core builds are now the norm.
Let’s compile once again, this time with make -j2
(and with -march=native
)
and once again with make clean
inbetween. Also, this time I will not run the
extra make
commands for the normal bc
.
The three times for single-file bc
:
real 3.47
user 3.39
sys 0.04
real 3.51
user 3.43
sys 0.06
real 3.54
user 3.42
sys 0.06
The three times for many-file bc
:
real 2.77
user 3.41
sys 0.25
real 2.72
user 3.34
sys 0.24
real 2.76
user 3.41
sys 0.25
The normal, many-file bc
suddenly comes out on top, in a big way. It seems
that if you have more than one core, LTO is faster.
So not only does LTO produce better quality code for runtime, it is faster at compile time if you use more than one core.
But I think the discrepancy can get even bigger. Let’s try four cores.
Once again, each command is run with make clean
between, but this time with
make -j4
.
The three times for single-file bc
:
real 3.47
user 3.40
sys 0.05
real 3.44
user 3.39
sys 0.03
real 3.49
user 3.43
sys 0.03
The three times for many-file bc
:
real 2.57
user 3.61
sys 0.35
real 2.56
user 3.49
sys 0.31
real 2.57
user 3.61
sys 0.31
Eh, it’s only a slight improvement. That doesn’t change the result, though.