Preface: This post is somewhat hard to write, because it contains code that I’m not proud of, it’s messy…bloated…it’s real. I’m exposing something, in a project that I’m proud of, that isn’t great. Sometimes this is hard and painful. But I’m also a firm believer that this is how we grow. I think what I’ve learned while going through this experience will make me better in the future. Maybe someone else can learn something from me and become better too.

It’s common practice in the U.S. to make a new years resolution at the start of every calendar year. Top of many people’s list is to lose some unwanted weight. For 2018, clap  is starting to lose some weight too!

Recently, @razrfalcon released a new tool to the Rust community in the form of a cargo  plugin called cargo-bloat which is inspired by google/bloaty. It looks at the .text  section of ELF binaries and adds up the total “weight” (in code size) associated with the various functions. It can then combine all functions from a particular crate, and give the sum size of said crate in relation to other crates.

So let’s take it for a spin and see how we can start this resolution off right!

The Weigh In

First, we need a binary that depends on clap . Of course we could make one, but to keep tests simple and real-world I choose to use the popular search tool  ripgrep. Let’s fire it up!

We need to install  cargo-bloat  which at the time of this writing requires using a git repository.

Next we can check how big clap  is.

By default, cargo-bloat  only displays the top 20 lines, -n 0  says to display them all. The --crates tells cargo-bloat to sum all functions per crate to see total crate size, and --release  says to compile the project in release mode. Release mode probably makes the code a little bigger, due to inlining and if LTO is turned on or not. But that’s what we want to see. (In the next post, I address this claim and see if it’s true or not.)

The highlighted line above shows clap  at 457KiB, ouch! It should be noted, that these percentages are of the .text  section, not the entire binary. This is one thing I’m not a huge fan of with cargo-bloat , I’d prefer the percentages either be of the whole binary, not exist, or state what they’re calculating. (2018-01-15 update: @RazrFalcon has released a new version which lists both the percentages of the entire binary and .text  sections!)But I digress. Note that 100% is 2MiB, and we can see that ripgrep  is about 35mb on my machine (because by default debug symbols are included with the binary which on modern hardware is no issue).

And we can see that as we said most of the size is the debug symbols and the .text  section is indeed ~2MiB

Making the Diet Plan

So now that we know how much clap  weighs, how can we find some easy ways to trim some fat? Remember how I said cargo-bloat  will display the size of each function? Let’s start there!

I don’t like the mangling that Rust includes in functions (well, I like why it’s there, but I don’t like how it looks in user consumable output), so let’s clean that up.

Much nicer! The --trim-fn  does exactly what we wanted! It just removes the hash mangle.

Ok, so now we’re starting to see some functions we could take a look at. But really, we’re just concerned with clap , so let’s clear out everything else. By default only the top 20 lines are displayed, which is probably fine since we’re looking for easy wins and not nitpicking functions apart. But simply (rip)grepping through these may not give us functions we can work with. I’d like to look at the top 25 clap  only functions. So to see all output, then (rip)grep through it we run the following:

A few things jumped out at me. I expected  clap::app::parser::Parser::get_matches_with  to be the biggest, because it’s the meat and potatoes of the parsing, but it wasn’t! Another thing I didn’t expect was several smaller functions to be so big. Since I work on clap  all the time, and am familiar with how it works internally, I noticed the highlighted lines above shouldn’t have been so big, since they’re meant to be small…so what gives?

Notice there are also what appear to be duplicates. These are from using generics, because Rust builds a separate function for each possible generic Type. Aka monomorphization and static dispatch.

When I took a look at  clap::app::parser::Parser::add_defaults  I saw something that caught my eye…macros.

The full code is here. But essentially, this is what my eyes saw:

Now this isn’t the actual code, because the above could be solved in a multitude of trivial ways. The point I’m making is I defined a macro, which needed to be called over two distinct types and didn’t want to duplicate the code. But iniside that macro, I called even more macros, and more macros, again, and again. It was like macro-ception! I thought this would be simple to fix!

So…no more macros?

Macros themselves aren’t the issue. They’re extremely handy. I tend to use them instead of duplicating code, when borrowck  complains about the exact same code living in a function (because borrowck  can’t peek into functions). The problem with doing this is that it’s basically SUPER aggressive inlining.

It was like a gateway drug. Copy one line and everything works? Sure! Turns out that one line expands into several hundred…

Looking at this code got me to think about my use of macros. It caused me to actually think about what is being expanded. With this new found suspicion of macros, I decided to look at my use cases more carefully. I started with the ones inside add_defaults . The most used macro happened to be arg_post_processing!  which did things like validate values, number of occurrences, group requirements, etc. I noticed it was in some redundant places, much like the bogus example above, where im_bigger!  could simply be lifted out of the if / else block, so could arg_post_processing! .

So I pulled those out. Easy win. But what else? Well if I had some redundant code already, there could be more.

…wait…a…second…:lightbulb:

Cutting Carbs

I noticed something more sinister than macros, but when combined with macros uncanny ability to expand into hundreds of lines of code could quickly up clap ‘s weight. I was validating each argument/value as they were parsed. I.e. trying to get an early error if there was one. This is exactly what arg_post_processing!  was doing. So why not just accept all args and values, then validate them once at the end?

By not trying to validate arguments early every use of arg_post_processing! was eliminated, as was every macro it called on. Turned out those top functions I listed a while ago as “should have have been small but weren’t” were also users of arg_post_processing!  and company.

There were some refactoring pains as I moved to this new paradigm of lazy validation, but overall it was pretty straight forward.

Let’s see if it worked! :fingerscrossed:

Woooo! That was pretty painless! But can we do more?

Remember how I pointed out the generics earlier (all the duplicate functions)? When I looked back at the list of all clap  functions (which is too long to post here), I noticed how many  clap::errors::Error::*  functions there were! These are also all in failure path, so there isn’t a huge reason to use any performance benefits of monomorphization/static dispatch and generics. Why not just use trait objects and dynamic dispatch which would reduce the code duplication? Where there are three of the same function, it would go down to a single function, and all those added together could equal another easy win.

Trait Objects

I’ve heard it said that the Rust community is far too afraid of dynamic dispatch and heap allocation. I think this is probably true, but usually not a bad thing. In my case, however, an argument parsing library should probably try to be small, and optimizing the failure case is probably silly. I’ve always leaned away from dynamic dispatch because <reasons> . So I’m guilty of it too. Maybe I could try this out and see if it’s a place that could benefit.

To be clear, there are issues when working with trait objects, but this particular case is about as simple a case as it gets. So there shouldn’t be troubles. I’ve also heard it said that there are times when dynamic dispatch can outperform static dispatch because of instruction caches. But again, my particular case is in the failure path, so performance isn’t the paramount concern.

So I set out to change my error functions from something like this…

…to this…

Notice the removal of the generic type A  and its trait bound. Instead, there is a trait object &AnyArg  which essentially does the same thing from the user point of view, but instead of a concrete type compiled into a specialized function, we’ll be using two pointers (one to “data” and one to a vtable) at runtime. Sometimes this is referred to as ‘type erasure’ because it erases the compilers knowledge of the type that will be used, and instead relies on runtime behavior.

There is a great writeup on Trait Objects here.

Current Weight

Alright, so with these fixes complete…what did we weigh in at?

Nice! That’s nearly a 60% decrease in size! Also, because we’re running far fewer instructions, we’ve actually received a slight performance boost too! Add to this, we really didn’t have to do that much work!

Still to Come

So what’s left? TONS! This was just two quick issues, I’m sure there is SO. MUCH. MORE. that could be optimized and reduced. In the back of my mind I still have the v3 changes where I know a lot of this has already been fixed and de-duplicated too. I’m excited to see what else I can do to tune this library.

If you’re interested in helping, stop by the repository or Gitter chat! I’d love to help mentor some new contributors!

Conclusion

This is all a big thanks to cargo-bloat . It pointed me in the right direction to find some logic duplication and caused me to look at some crazy macro expansions. I’ll still use macros, and probably still duplicate logic from time to time, but cargo-bloat really, really helped point this out to me. I’m still looking at the function list output to see if there are any more easy wins, and those will come with time.

For now, you can use v2.29.1 to get the latest and greatest lean mean clap .

I encourage everyone to take a look at their projects and see if they too can lose a little weight at the start of this year 😉

Kevin

7 Comments

  • Hey there would you mind stating which blog platform you’re working with?

    I’m going to start my own blog soon but I’m having a hard time making a decision between BlogEngine/Wordpress/B2evolution and
    Drupal. The reason I ask is because your layout seems different then most blogs and
    I’m looking for something completely unique. P.S Apologies for getting off-topic
    but I had to ask!

  • Nikolai

    Hello,
    have you made any performance tests before and after? Bot sure if clap performance is critical, but still, it’s interesting to know the performance impact of these changes…

    • kbknapp

      There is only a slight performance boost, I’m assuming from running fewer instructions. However, but made biggest difference was de-duplicating the logic. Smaller binaries does help though, as every little bit adds up!

  • Allegretto

    I really enjoyed reading this post but I have to admit that the part what gave me the most was the preface. I’d like to start a blog for a while now, but since usually I’m working on prototyping my code usually isn’t very pretty, but your preface showed me that it doesn’t have to be well-polished. Thank you, keep up!

Comments are closed.