This is a follow up from my previous post. In that post, I conjectured that turning on LTO (Link Time Optimization) and increasing the optimization levels would increase the binary size. My thinking was that the optimizer would more aggressively inline the code since function calls add performance overhead, but after speaking with a few individuals about this I realized I might be wrong (which wouldn’t be the first time by a long shot 🙂 ).

Some may say this is common sense, but when coming from higher level languages it’s not always obvious. Like many things, Rust gives you full control over your code, and lets you pick which parts you care about. Do you care about raw speed, or binary size. In this post we’ll talk about how we can tune our binaries to fit our individual need without changing the code itself!

So let’s test this!

As a quick aside; @RazrFalcon has been amazing and adding even more options to cargo-bloat ! There are options to filter specific crates, split the stdlib , pass through unstable rustc  flags and more! They have also added a new column to the output which lists percentages of the entire binary size along with the .text section! This tool is getting better and better!

Preface

We’ll be talking about the different optimization levels in Rust, as well as mentioning LTO quite frequently. A super quick rundown is rustc  offers 5 different optimization levels, 1 , 2 , 3 , s , and z . Levels 1 through 3  are fewest optimizations through, “Do everything you possibly can to make this code as fast as possible!” while levels s  and z  apparently mean, optimize for   Size  and SizeMin  (even less size) which is good to know if all you care about is the absolute smallest binary size.

LTO on the other hand performs additional optimizations at link time (as the name might suggest). This allows the compiler to take all the libraries and crates into account when optimizing them, and optimize them as a single unit, rather than individually. I.e. things like inlining across crate bounds become possible. This typically increase the performance, but in some circumstances drastically increases compile time.

I should probably also define inlining real quick. I’ve used the term a lot, even in the last post. Inlining is just placing the code from a function call, inline with the calling code (i.e. make it as if the code was just copy/pasted directly at the calling site, instead of going through a function call). This is quicker, because no registers operations, no stack operations, the compiler has a more complete picture of the code, etc. But it also means there’s lots of code duplication.

There has been some recent talk on ThinLTO as well, which allows some light LTO like optimizations between various codegen-units  (parallel compilation units, which speed up compilation, but reduce possible optimizations). In this post we don’t test that, as well as we’re not testing the impact of setting different codegen-units . We’ll let the curious reading continue our testing if interested.

Getting our Baseline

Again, I’ll be using ripgrep  as the test binary, and clap  as the crate to measure.

Rust let’s you set many of your build options in your Cargo.toml , so we’ll be doing that since we’re using a third party cargo  plugin and can’t pass the build options directly on the command line.

In side ripgrep ‘s Cargo.toml  we see the section for setting build options when building in release mode

This says, “include debug symbols even though we’re compiling in release mode.” Which as we saw in the earlier post, increases the binary size but also has tons of added benefits like easier debugging which things go wrong. Plus the increase in size is pretty minuscule by today’s standards.

So we want to add two more fields to this table, one for turning LTO on or off, and one for the various optimization levels that rustc  offers. We’ll start from the most performant and work our way down the list to see what impact it has on the code size of individual crates.

Here’s what we’ll start with:

Now when we run cargo-blaot  this is what we see:

clap  is highlighted above. (Also notice the awesome new column showing percentages of the total binary size, and the description at the bottom which differentiates between the .text section and entire binary! I love it!)

“Wait a second!” you say. “In your last post clap  was 189.3KiB! What changed?!”

Well, Rust by default doesn’t turn on LTO. When I was listing the output in the previous post I was using the Rust defaults for release mode (plus debug=true ), which is actually opt-level=3  and lto=false . Since that is our very next opt-level  and LTO combination, let’s test it and make sure!

Changing the Cargo.toml  to the following is all we have to do:

Fingers crossed for the 189.3KiB we saw earlier…

Yesss! OK, so default is opt-level=3 , lto=false  for release builds. We can check that by simply removing those two lines in the Cargo.toml  and re-building but I’ll leave that to you.

Fast Forward Please

So that we’re not looking at the same output over and over, I’ve gone ahead and compiled a table of the different opt-levels and LTO combinations.

:drumroll:

Size (in KiB)LTOOpt-Level
197.4True3
189.3False3
186.5True2
183.9False2
137.3True1
139.3False1
152.5Trues
155.0Falses
122.9Truez
125.5Falsez

It does indeed look like turning up the optimization levels increases the size. LTO seems to increase the size a little at higher optimization levels, and decrease it slightly at lower levels. I’m not sure why there is a decrease though, as that seems counter intuitive.

Conclusion

You should now feel comfortable deciding, or at least testing, your code in order to decide whether or not you care more about size or speed. There are many cases where speed of the code is far less important than size, perhaps the code is I/O bound, or network bound, maybe the code needs to be sent across the network prior to execution, etc.

An interesting note about the examples: turning LTO on increased clap ‘s size, but pretty drastically decreased the entire binary size in the first two examples above. This may be because certain parts are dwarfed by the debug symbols and they no longer need to be included when their functions have been inlined? I’m not sure, but it’s still an interesting note. The point is, optimize and test for your entire use case, not just a single crate 😉

I am curious how much changing the codegen-units  and using ThinLTO affects the various settings as well, but I’ll have to leave that for another day.

Kevin

4 Comments

  • Daniel

    I find it totally obvious why LTO very often decreases size: the generated files not only contain the compiled object code (ready for linking) but also a low level representation of the code itself. During the final step (usually) the compiler looks up that representation and can make a lot of optimisation decisions, e.g. if it knows that when a function is called with certain parameters it will yield a certain result. That enables dead-code removal but also offers new inlining possibilities which is exactly the reason why at higher optimisation levels the size may increase: due to increased inlining (and in case of Rust even more monomorphisation).

    Due to that LTO is often a MUST-have feature for microcontroller code, especially if more or less generic (C/C++) frameworks are used.

    • kbknapp

      In the general case, it seems *increase* code size. After some explanations on reddit and yours, I do see how it could decrease code size in some circumstances (small function inlining across crate bounds, dead code elimination across crates bounds as in your example, etc.). Thanks for the explanations!

  • Arne Brix

    As i understand it, LTO can remove globally unused code from the binary. That should usually result in a significant reduction of final executable size.

    • kbknapp

      True dead code should already be eliminated by rustc prior to any LTO taking place. However, like Daniel above points out, there could be circumstances where if the same function with the same parameters are passed repeatedly across crate bounds, the compiler could simply replace all the code with the function result. I’d be surprised if this was a common enough occurrence to make a significant difference, though.

      There are also cases, as pointed out by /u/SelfDistinction on reddit [1] that inlining of small functions across crate bounds can also reduce code size. Again, I’d suspect this is somewhat rare, but still happens.

      For the most part, LTO seems to increase code size in the general case.

      [1]: https://www.reddit.com/r/rust/comments/7s7rja/tuning_your_weight_loss_vs_performance/dt2rq66/

Leave a Reply

Your email address will not be published. Required fields are marked *