Tuning your Weight Loss vs Performance

clap, Rust

This is a follow up from my previous post. In that post, I conjectured that turning on LTO (Link Time Optimization) and increasing the optimization levels would increase the binary size. My thinking was that the optimizer would more aggressively inline the code since function calls add performance overhead, but after speaking with a few individuals about this I realized I might be wrong (which wouldn’t be the first time by a long shot 🙂 ).

Some may say this is common sense, but when coming from higher level languages it’s not always obvious. Like many things, Rust gives you full control over your code, and lets you pick which parts you care about. Do you care about raw speed, or binary size. In this post we’ll talk about how we can tune our binaries to fit our individual need without changing the code itself!

So let’s test this!

As a quick aside; @RazrFalcon has been amazing and adding even more options to cargo-bloat ! There are options to filter specific crates, split the stdlib , pass through unstable rustc  flags and more! They have also added a new column to the output which lists percentages of the entire binary size along with the .text section! This tool is getting better and better!

Preface

We’ll be talking about the different optimization levels in Rust, as well as mentioning LTO quite frequently. A super quick rundown is rustc  offers 5 different optimization levels, 1 , 2 , 3 , s , and z . Levels 1 through 3  are fewest optimizations through, “Do everything you possibly can to make this code as fast as possible!” while levels s  and z  apparently mean, optimize for   Size  and SizeMin  (even less size) which is good to know if all you care about is the absolute smallest binary size.

LTO on the other hand performs additional optimizations at link time (as the name might suggest). This allows the compiler to take all the libraries and crates into account when optimizing them, and optimize them as a single unit, rather than individually. I.e. things like inlining across crate bounds become possible. This typically increase the performance, but in some circumstances drastically increases compile time.

I should probably also define inlining real quick. I’ve used the term a lot, even in the last post. Inlining is just placing the code from a function call, inline with the calling code (i.e. make it as if the code was just copy/pasted directly at the calling site, instead of going through a function call). This is quicker, because no registers operations, no stack operations, the compiler has a more complete picture of the code, etc. But it also means there’s lots of code duplication.

There has been some recent talk on ThinLTO as well, which allows some light LTO like optimizations between various codegen-units  (parallel compilation units, which speed up compilation, but reduce possible optimizations). In this post we don’t test that, as well as we’re not testing the impact of setting different codegen-units . We’ll let the curious reading continue our testing if interested.

Getting our Baseline

Again, I’ll be using ripgrep  as the test binary, and clap  as the crate to measure.

Rust let’s you set many of your build options in your Cargo.toml , so we’ll be doing that since we’re using a third party cargo  plugin and can’t pass the build options directly on the command line.

In side ripgrep ‘s Cargo.toml  we see the section for setting build options when building in release mode

This says, “include debug symbols even though we’re compiling in release mode.” Which as we saw in the earlier post, increases the binary size but also has tons of added benefits like easier debugging which things go wrong. Plus the increase in size is pretty minuscule by today’s standards.

So we want to add two more fields to this table, one for turning LTO on or off, and one for the various optimization levels that rustc  offers. We’ll start from the most performant and work our way down the list to see what impact it has on the code size of individual crates.

Here’s what we’ll start with:

Now when we run cargo-blaot  this is what we see:

clap  is highlighted above. (Also notice the awesome new column showing percentages of the total binary size, and the description at the bottom which differentiates between the .text section and entire binary! I love it!)

“Wait a second!” you say. “In your last post clap  was 189.3KiB! What changed?!”

Well, Rust by default doesn’t turn on LTO. When I was listing the output in the previous post I was using the Rust defaults for release mode (plus debug=true ), which is actually opt-level=3  and lto=false . Since that is our very next opt-level  and LTO combination, let’s test it and make sure!

Changing the Cargo.toml  to the following is all we have to do:

Fingers crossed for the 189.3KiB we saw earlier…

Yesss! OK, so default is opt-level=3 , lto=false  for release builds. We can check that by simply removing those two lines in the Cargo.toml  and re-building but I’ll leave that to you.

Fast Forward Please

So that we’re not looking at the same output over and over, I’ve gone ahead and compiled a table of the different opt-levels and LTO combinations.

:drumroll:

Size (in KiB)LTOOpt-Level
197.4True3
189.3False3
186.5True2
183.9False2
137.3True1
139.3False1
152.5Trues
155.0Falses
122.9Truez
125.5Falsez

It does indeed look like turning up the optimization levels increases the size. LTO seems to increase the size a little at higher optimization levels, and decrease it slightly at lower levels. I’m not sure why there is a decrease though, as that seems counter intuitive.

Conclusion

You should now feel comfortable deciding, or at least testing, your code in order to decide whether or not you care more about size or speed. There are many cases where speed of the code is far less important than size, perhaps the code is I/O bound, or network bound, maybe the code needs to be sent across the network prior to execution, etc.

An interesting note about the examples: turning LTO on increased clap ‘s size, but pretty drastically decreased the entire binary size in the first two examples above. This may be because certain parts are dwarfed by the debug symbols and they no longer need to be included when their functions have been inlined? I’m not sure, but it’s still an interesting note. The point is, optimize and test for your entire use case, not just a single crate 😉

I am curious how much changing the codegen-units  and using ThinLTO affects the various settings as well, but I’ll have to leave that for another day.

Kevin

New Years Weight Loss

clap, Rust

Preface: This post is somewhat hard to write, because it contains code that I’m not proud of, it’s messy…bloated…it’s real. I’m exposing something, in a project that I’m proud of, that isn’t great. Sometimes this is hard and painful. But I’m also a firm believer that this is how we grow. I think what I’ve learned while going through this experience will make me better in the future. Maybe someone else can learn something from me and become better too.

It’s common practice in the U.S. to make a new years resolution at the start of every calendar year. Top of many people’s list is to lose some unwanted weight. For 2018, clap  is starting to lose some weight too!

Recently, @razrfalcon released a new tool to the Rust community in the form of a cargo  plugin called cargo-bloat which is inspired by google/bloaty. It looks at the .text  section of ELF binaries and adds up the total “weight” (in code size) associated with the various functions. It can then combine all functions from a particular crate, and give the sum size of said crate in relation to other crates.

So let’s take it for a spin and see how we can start this resolution off right!

The Weigh In

First, we need a binary that depends on clap . Of course we could make one, but to keep tests simple and real-world I choose to use the popular search tool  ripgrep. Let’s fire it up!

We need to install  cargo-bloat  which at the time of this writing requires using a git repository.

Next we can check how big clap  is.

By default, cargo-bloat  only displays the top 20 lines, -n 0  says to display them all. The --crates tells cargo-bloat to sum all functions per crate to see total crate size, and --release  says to compile the project in release mode. Release mode probably makes the code a little bigger, due to inlining and if LTO is turned on or not. But that’s what we want to see. (In the next post, I address this claim and see if it’s true or not.)

The highlighted line above shows clap  at 457KiB, ouch! It should be noted, that these percentages are of the .text  section, not the entire binary. This is one thing I’m not a huge fan of with cargo-bloat , I’d prefer the percentages either be of the whole binary, not exist, or state what they’re calculating. (2018-01-15 update: @RazrFalcon has released a new version which lists both the percentages of the entire binary and .text  sections!)But I digress. Note that 100% is 2MiB, and we can see that ripgrep  is about 35mb on my machine (because by default debug symbols are included with the binary which on modern hardware is no issue).

And we can see that as we said most of the size is the debug symbols and the .text  section is indeed ~2MiB

Making the Diet Plan

So now that we know how much clap  weighs, how can we find some easy ways to trim some fat? Remember how I said cargo-bloat  will display the size of each function? Let’s start there!

I don’t like the mangling that Rust includes in functions (well, I like why it’s there, but I don’t like how it looks in user consumable output), so let’s clean that up.

Much nicer! The --trim-fn  does exactly what we wanted! It just removes the hash mangle.

Ok, so now we’re starting to see some functions we could take a look at. But really, we’re just concerned with clap , so let’s clear out everything else. By default only the top 20 lines are displayed, which is probably fine since we’re looking for easy wins and not nitpicking functions apart. But simply (rip)grepping through these may not give us functions we can work with. I’d like to look at the top 25 clap  only functions. So to see all output, then (rip)grep through it we run the following:

A few things jumped out at me. I expected  clap::app::parser::Parser::get_matches_with  to be the biggest, because it’s the meat and potatoes of the parsing, but it wasn’t! Another thing I didn’t expect was several smaller functions to be so big. Since I work on clap  all the time, and am familiar with how it works internally, I noticed the highlighted lines above shouldn’t have been so big, since they’re meant to be small…so what gives?

Notice there are also what appear to be duplicates. These are from using generics, because Rust builds a separate function for each possible generic Type. Aka monomorphization and static dispatch.

When I took a look at  clap::app::parser::Parser::add_defaults  I saw something that caught my eye…macros.

The full code is here. But essentially, this is what my eyes saw:

Now this isn’t the actual code, because the above could be solved in a multitude of trivial ways. The point I’m making is I defined a macro, which needed to be called over two distinct types and didn’t want to duplicate the code. But iniside that macro, I called even more macros, and more macros, again, and again. It was like macro-ception! I thought this would be simple to fix!

So…no more macros?

Macros themselves aren’t the issue. They’re extremely handy. I tend to use them instead of duplicating code, when borrowck  complains about the exact same code living in a function (because borrowck  can’t peek into functions). The problem with doing this is that it’s basically SUPER aggressive inlining.

It was like a gateway drug. Copy one line and everything works? Sure! Turns out that one line expands into several hundred…

Looking at this code got me to think about my use of macros. It caused me to actually think about what is being expanded. With this new found suspicion of macros, I decided to look at my use cases more carefully. I started with the ones inside add_defaults . The most used macro happened to be arg_post_processing!  which did things like validate values, number of occurrences, group requirements, etc. I noticed it was in some redundant places, much like the bogus example above, where im_bigger!  could simply be lifted out of the if / else block, so could arg_post_processing! .

So I pulled those out. Easy win. But what else? Well if I had some redundant code already, there could be more.

…wait…a…second…:lightbulb:

Cutting Carbs

I noticed something more sinister than macros, but when combined with macros uncanny ability to expand into hundreds of lines of code could quickly up clap ‘s weight. I was validating each argument/value as they were parsed. I.e. trying to get an early error if there was one. This is exactly what arg_post_processing!  was doing. So why not just accept all args and values, then validate them once at the end?

By not trying to validate arguments early every use of arg_post_processing! was eliminated, as was every macro it called on. Turned out those top functions I listed a while ago as “should have have been small but weren’t” were also users of arg_post_processing!  and company.

There were some refactoring pains as I moved to this new paradigm of lazy validation, but overall it was pretty straight forward.

Let’s see if it worked! :fingerscrossed:

Woooo! That was pretty painless! But can we do more?

Remember how I pointed out the generics earlier (all the duplicate functions)? When I looked back at the list of all clap  functions (which is too long to post here), I noticed how many  clap::errors::Error::*  functions there were! These are also all in failure path, so there isn’t a huge reason to use any performance benefits of monomorphization/static dispatch and generics. Why not just use trait objects and dynamic dispatch which would reduce the code duplication? Where there are three of the same function, it would go down to a single function, and all those added together could equal another easy win.

Trait Objects

I’ve heard it said that the Rust community is far too afraid of dynamic dispatch and heap allocation. I think this is probably true, but usually not a bad thing. In my case, however, an argument parsing library should probably try to be small, and optimizing the failure case is probably silly. I’ve always leaned away from dynamic dispatch because <reasons> . So I’m guilty of it too. Maybe I could try this out and see if it’s a place that could benefit.

To be clear, there are issues when working with trait objects, but this particular case is about as simple a case as it gets. So there shouldn’t be troubles. I’ve also heard it said that there are times when dynamic dispatch can outperform static dispatch because of instruction caches. But again, my particular case is in the failure path, so performance isn’t the paramount concern.

So I set out to change my error functions from something like this…

…to this…

Notice the removal of the generic type A  and its trait bound. Instead, there is a trait object &AnyArg  which essentially does the same thing from the user point of view, but instead of a concrete type compiled into a specialized function, we’ll be using two pointers (one to “data” and one to a vtable) at runtime. Sometimes this is referred to as ‘type erasure’ because it erases the compilers knowledge of the type that will be used, and instead relies on runtime behavior.

There is a great writeup on Trait Objects here.

Current Weight

Alright, so with these fixes complete…what did we weigh in at?

Nice! That’s nearly a 60% decrease in size! Also, because we’re running far fewer instructions, we’ve actually received a slight performance boost too! Add to this, we really didn’t have to do that much work!

Still to Come

So what’s left? TONS! This was just two quick issues, I’m sure there is SO. MUCH. MORE. that could be optimized and reduced. In the back of my mind I still have the v3 changes where I know a lot of this has already been fixed and de-duplicated too. I’m excited to see what else I can do to tune this library.

If you’re interested in helping, stop by the repository or Gitter chat! I’d love to help mentor some new contributors!

Conclusion

This is all a big thanks to cargo-bloat . It pointed me in the right direction to find some logic duplication and caused me to look at some crazy macro expansions. I’ll still use macros, and probably still duplicate logic from time to time, but cargo-bloat really, really helped point this out to me. I’m still looking at the function list output to see if there are any more easy wins, and those will come with time.

For now, you can use v2.29.1 to get the latest and greatest lean mean clap .

I encourage everyone to take a look at their projects and see if they too can lose a little weight at the start of this year 😉

Kevin