This is a follow up from my previous post. In that post, I conjectured that turning on LTO (Link Time Optimization) and increasing the optimization levels would increase the binary size. My thinking was that the optimizer would more aggressively inline the code since function calls add performance overhead, but after speaking with a few individuals about this I realized I might be wrong (which wouldn’t be the first time by a long shot 🙂 ).

Some may say this is common sense, but when coming from higher level languages it’s not always obvious. Like many things, Rust gives you full control over your code, and lets you pick which parts you care about. Do you care about raw speed, or binary size. In this post we’ll talk about how we can tune our binaries to fit our individual need without changing the code itself!

So let’s test this!

As a quick aside; @RazrFalcon has been amazing and adding even more options to cargo-bloat ! There are options to filter specific crates, split the stdlib , pass through unstable rustc flags and more! They have also added a new column to the output which lists percentages of the entire binary size along with the .text section! This tool is getting better and better!
Preface

We’ll be talking about the different optimization levels in Rust, as well as mentioning LTO quite frequently. A super quick rundown is rustc offers 5 different optimization levels, 1 , 2 , 3 , s , and z . Levels 1 through 3 are fewest optimizations through, “Do everything you possibly can to make this code as fast as possible!” while levels s and z apparently mean, optimize for Size and SizeMin (even less size) which is good to know if all you care about is the absolute smallest binary size.

LTO on the other hand performs additional optimizations at link time (as the name might suggest). This allows the compiler to take all the libraries and crates into account when optimizing them, and optimize them as a single unit, rather than individually. I.e. things like inlining across crate bounds become possible. This typically increase the performance, but in some circumstances drastically increases compile time.

I should probably also define inlining real quick. I’ve used the term a lot, even in the last post. Inlining is just placing the code from a function call, inline with the calling code (i.e. make it as if the code was just copy/pasted directly at the calling site, instead of going through a function call). This is quicker, because no registers operations, no stack operations, the compiler has a more complete picture of the code, etc. But it also means there’s lots of code duplication.

There has been some recent talk on ThinLTO as well, which allows some light LTO like optimizations between various codegen-units (parallel compilation units, which speed up compilation, but reduce possible optimizations). In this post we don’t test that, as well as we’re not testing the impact of setting different codegen-units . We’ll let the curious reading continue our testing if interested.
Getting our Baseline

Again, I’ll be using ripgrep as the test binary, and clap as the crate to measure.

Rust let’s you set many of your build options in your Cargo.toml , so we’ll be doing that since we’re using a third party cargo plugin and can’t pass the build options directly on the command line.

In side ripgrep ‘s Cargo.toml we see the section for setting build options when building in release mode

[profile.release]
debug = true

This says, “include debug symbols even though we’re compiling in release mode.” Which as we saw in the earlier post, increases the binary size but also has tons of added benefits like easier debugging which things go wrong. Plus the increase in size is pretty minuscule by today’s standards.

So we want to add two more fields to this table, one for turning LTO on or off, and one for the various optimization levels that rustc offers. We’ll start from the most performant and work our way down the list to see what impact it has on the code size of individual crates.

Here’s what we’ll start with:

[profile.release]
opt-level=3
lto=true
debug = true

Now when we run cargo-blaot this is what we see:

kevin@beefcake: ~/Projects/ripgrep
   Compiling ripgrep v0.7.1 (file:///home/kevin/Projects/ripgrep)
    Finished release [optimized + debuginfo] target(s) in 14.63 secs
File  .text     Size Name
2.5%  33.2% 618.4KiB [Unknown]
2.3%  30.2% 563.3KiB std
0.8%  10.9% 202.3KiB regex
0.8%  10.6% 197.4KiB clap
0.3%   4.1%  76.9KiB ignore
0.2%   2.9%  53.5KiB regex_syntax
0.1%   2.0%  36.8KiB globset
0.1%   1.7%  31.5KiB encoding_rs
0.1%   0.9%  17.3KiB aho_corasick
0.1%   0.8%  15.2KiB grep
0.0%   0.5%   8.9KiB walkdir
0.0%   0.4%   8.2KiB crossbeam
0.0%   0.3%   5.0KiB textwrap
0.0%   0.2%   4.0KiB env_logger
0.0%   0.2%   3.9KiB termcolor
0.0%   0.2%   3.3KiB thread_local
0.0%   0.2%   2.9KiB ansi_term
0.0%   0.1%   2.7KiB same_file
0.0%   0.1%   2.0KiB strsim
0.0%   0.1%   2.0KiB vec_map
7.5% 100.0%   1.8MiB .text section size, the file size is 24.2MiB

clap is highlighted above. (Also notice the awesome new column showing percentages of the total binary size, and the description at the bottom which differentiates between the .text section and entire binary! I love it!)

“Wait a second!” you say. “In your last post clap was 189.3KiB! What changed?!”

Well, Rust by default doesn’t turn on LTO. When I was listing the output in the previous post I was using the Rust defaults for release mode (plus debug=true ), which is actually opt-level=3 and lto=false . Since that is our very next opt-level and LTO combination, let’s test it and make sure!

Changing the Cargo.toml to the following is all we have to do:

[profile.release] 
opt-level=3 
lto=false 
debug = true

Fingers crossed for the 189.3KiB we saw earlier…

kevin@beefcake: ~/Projects/ripgrep 
➜ cargo bloat --release --crates
   Compiling ripgrep v0.7.1 (file:///home/kevin/Projects/ripgrep)
    Finished release [optimized + debuginfo] target(s) in 7.32 secs
File  .text     Size Name
1.7%  31.4% 567.0KiB [Unknown]
1.7%  31.4% 566.8KiB std
0.6%  10.9% 196.0KiB regex
0.6%  10.5% 189.3KiB clap
0.2%   4.0%  72.6KiB ignore
0.2%   3.6%  65.0KiB regex_syntax
0.1%   1.8%  32.1KiB globset
0.1%   1.8%  32.0KiB encoding_rs
0.0%   0.9%  16.1KiB aho_corasick
0.0%   0.8%  13.7KiB grep
0.0%   0.5%   9.5KiB crossbeam
0.0%   0.5%   9.0KiB walkdir
0.0%   0.3%   5.1KiB termcolor
0.0%   0.3%   4.9KiB textwrap
0.0%   0.2%   4.2KiB env_logger
0.0%   0.2%   3.3KiB thread_local
0.0%   0.2%   3.2KiB ansi_term
0.0%   0.1%   2.7KiB same_file
0.0%   0.1%   2.1KiB strsim
0.0%   0.1%   2.0KiB vec_map
5.5% 100.0%   1.8MiB .text section size, the file size is 31.8MiB

Yesss! OK, so default is opt-level=3 , lto=false for release builds. We can check that by simply removing those two lines in the Cargo.toml and re-building but I’ll leave that to you.
Fast Forward Please

So that we’re not looking at the same output over and over, I’ve gone ahead and compiled a table of the different opt-levels and LTO combinations.

:drumroll:

Size (in KiB) LTO Opt-Level
197.4 True 3
189.3 False 3
186.5 True 2
183.9 False 2
137.3 True 1
139.3 False 1
152.5 True s
155.0 False s
122.9 True z
125.5 False z

It does indeed look like turning up the optimization levels increases the size. LTO seems to increase the size a little at higher optimization levels, and decrease it slightly at lower levels. I’m not sure why there is a decrease though, as that seems counter intuitive.
Conclusion

You should now feel comfortable deciding, or at least testing, your code in order to decide whether or not you care more about size or speed. There are many cases where speed of the code is far less important than size, perhaps the code is I/O bound, or network bound, maybe the code needs to be sent across the network prior to execution, etc.

An interesting note about the examples: turning LTO on increased clap ‘s size, but pretty drastically decreased the entire binary size in the first two examples above. This may be because certain parts are dwarfed by the debug symbols and they no longer need to be included when their functions have been inlined? I’m not sure, but it’s still an interesting note. The point is, optimize and test for your entire use case, not just a single crate 😉

I am curious how much changing the codegen-units and using ThinLTO affects the various settings as well, but I’ll have to leave that for another day.

Kevin

Leave a Reply

Your email address will not be published. Required fields are marked *