Preface: This post is somewhat hard to write, because it contains code that I’m not proud of, it’s messy…bloated…it’s real. I’m exposing something, in a project that I’m proud of, that isn’t great. Sometimes this is hard and painful. But I’m also a firm believer that this is how we grow. I think what I’ve learned while going through this experience will make me better in the future. Maybe someone else can learn something from me and become better too.

It’s common practice in the U.S. to make a new years resolution at the start of every calendar year. Top of many people’s list is to lose some unwanted weight. For 2018, clap is starting to lose some weight too!

Recently, @razrfalcon released a new tool to the Rust community in the form of a cargo plugin called cargo-bloat which is inspired by google/bloaty. It looks at the .text section of ELF binaries and adds up the total “weight” (in code size) associated with the various functions. It can then combine all functions from a particular crate, and give the sum size of said crate in relation to other crates.

So let’s take it for a spin and see how we can start this resolution off right!
The Weigh In

First, we need a binary that depends on clap . Of course we could make one, but to keep tests simple and real-world I choose to use the popular search tool ripgrep. Let’s fire it up!

We need to install cargo-bloat which at the time of this writing requires using a git repository.

$ cargo install --force --git https://github.com/RazrFalcon/cargo-bloat.git

Next we can check how big clap is.

$ git clone https://github.com/BurntSushi/ripgrep
  [..snip]
$ cd ripgrep
$ cargo bloat --release --crates -n 0
   [..snip compiling..]
 27.7% 574.3KiB [Unknown]
 27.3% 565.4KiB std
 22.0% 457.3KiB clap
  9.4% 196.0KiB regex
  3.4%  70.0KiB ignore
  3.1%  65.0KiB regex_syntax
  1.5%  32.1KiB encoding_rs
  1.5%  31.6KiB globset
  0.8%  16.0KiB aho_corasick
  0.7%  13.7KiB grep
  0.5%   9.5KiB crossbeam
  0.4%   9.0KiB walkdir
  0.2%   5.1KiB termcolor
  0.2%   4.9KiB textwrap
  0.2%   4.2KiB env_logger
  0.2%   3.3KiB thread_local
  0.2%   3.2KiB ansi_term
  0.1%   2.6KiB same_file
  0.1%   2.1KiB strsim
  0.1%   2.0KiB vec_map
100.0%   2.0MiB Total

By default, cargo-bloat only displays the top 20 lines, -n 0 says to display them all. The --crates tells cargo-bloat to sum all functions per crate to see total crate size, and --release says to compile the project in release mode. Release mode probably makes the code a little bigger, due to inlining and if LTO is turned on or not. But that’s what we want to see. (In the next post, I address this claim and see if it’s true or not.)

The highlighted line above shows clap at 457KiB, ouch! It should be noted, that these percentages are of the .text section, not the entire binary. This is one thing I’m not a huge fan of with cargo-bloat , I’d prefer the percentages either be of the whole binary, not exist, or state what they’re calculating. (2018-01-15 update: @RazrFalcon has released a new version which lists both the percentages of the entire binary and .text sections!)But I digress. Note that 100% is 2MiB, and we can see that ripgrep is about 35mb on my machine (because by default debug symbols are included with the binary which on modern hardware is no issue).

$ ls -lh target/release/rg 
-rwxrwxr-x 2 kevin kevin 35M Jan  9 10:57 target/release/rg

And we can see that as we said most of the size is the debug symbols and the .text section is indeed ~2MiB

$ size -A -d -t ./target/release/rg
./target/release/rg  :
section                  size      addr
.interp                    28       624
.note.ABI-tag              32       652
.note.gnu.build-id         36       684
.gnu.hash                 176       720
.dynsym                  3048       896
.dynstr                  1924      3944
.gnu.version              254      5868
.gnu.version_r            336      6128
.rela.dyn               97296      6464
.rela.plt                2568    103760
.init                      23    106328
.plt                     1728    106352
.plt.got                   16    108080
.text                 2161744    108096
.fini                       9   2269840
.rodata                402249   2269888
.debug_gdb_scripts         34   2672137
.eh_frame_hdr           39396   2672172
.eh_frame              168336   2711568
.gcc_except_table      164788   2879904
.tdata                    328   5141984
.init_array                16   5142312
.fini_array                 8   5142328
.data.rel.ro            74392   5142336
.dynamic                  576   5216728
.got                      984   5217304
.data                     881   5218304
.bss                     4544   5219200
.comment                   96         0
.debug_aranges           8896         0
.debug_pubnames       2375721         0
.debug_info          11389765         0
.debug_abbrev          230412         0
.debug_line           1839239         0
.debug_frame             1192         0
.debug_str            3996720         0
.debug_loc            6097180         0
.debug_macinfo            312         0
.debug_pubtypes       3138431         0
.debug_ranges         3462160         0
.debug_macro            81253         0
Total                35747127

Making the Diet Plan

So now that we know how much clap weighs, how can we find some easy ways to trim some fat? Remember how I said cargo-bloat will display the size of each function? Let’s start there!

$ cargo bloat --release
    Finished release [optimized + debuginfo] target(s) in 0.0 secs
 72.5%   1.5MiB [4909 Others]
  5.0% 103.0KiB clap::app::parser::Parser::add_defaults::h34c24eccb249c227
  3.2%  67.1KiB clap::app::parser::Parser::add_env::hbdc05d6940ebd42c
  2.4%  50.7KiB <regex::exec::ExecNoSync<'c> as regex::re_trait::RegularExpression>::read_captures_at::he754a06aa1736839
  1.9%  39.8KiB clap::app::parser::Parser::parse_short_arg::hc6f00d002e1ff1b7
  1.9%  39.0KiB clap::app::parser::Parser::get_matches_with::hf320109d7e2d7221
  1.8%  38.2KiB clap::app::parser::Parser::parse_long_arg::hae4f86cc567f19d8
  1.7%  34.4KiB rg::args::ArgMatches::to_args::h293ff234cea2e1ac
  1.4%  28.1KiB _ZN12regex_syntax6parser6Parser10parse_expr17he8c342fc4e87f2b8E.llvm.70AE5790
  1.2%  25.4KiB rg::main::h7646776e40b270db
  0.8%  16.5KiB <regex::re_trait::Matches<'t, R> as core::iter::iterator::Iterator>::next::h049d1f466bcc1721
  0.8%  16.1KiB regex::re_bytes::Regex::find_at::h0b7f4ba22741e50d
  0.7%  13.7KiB clap::app::help::Help::write_arg::h9fc1e6aeef3340a6
  0.7%  13.7KiB rg::app::app::h61988c69e077e563
  0.6%  12.9KiB clap::app::validator::Validator::validate::hc27d7700402e9434
  0.6%  12.8KiB _ZN4clap3app4help4Help10write_help17hdd91a18346a72322E.llvm.25366C
  0.6%  12.7KiB clap::app::validator::Validator::validate_matched_args::h172897e493e55d96
  0.6%  12.6KiB je_arena_boot
  0.6%  11.7KiB clap::app::usage::get_required_usage_from::hff617cf2388f9237
  0.6%  11.6KiB stats_arena_print
  0.5%  11.0KiB regex::re_bytes::Regex::shortest_match_at::h2b1f11351112fb90
100.0%   2.0MiB Total

I don’t like the mangling that Rust includes in functions (well, I like why it’s there, but I don’t like how it looks in user consumable output), so let’s clean that up.

$ cargo bloat --release --trim-fn 
    Finished release [optimized + debuginfo] target(s) in 0.0 secs
 72.5%   1.5MiB [4909 Others]
  5.0% 103.0KiB clap::app::parser::Parser::add_defaults
  3.2%  67.1KiB clap::app::parser::Parser::add_env
  2.4%  50.7KiB <regex::exec::ExecNoSync<'c> as regex::re_trait::RegularExpression>::read_captures_at
  1.9%  39.8KiB clap::app::parser::Parser::parse_short_arg
  1.9%  39.0KiB clap::app::parser::Parser::get_matches_with
  1.8%  38.2KiB clap::app::parser::Parser::parse_long_arg
  1.7%  34.4KiB rg::args::ArgMatches::to_args
  1.4%  28.1KiB _ZN12regex_syntax6parser6Parser10parse_expr17he8c342fc4e87f2b8E.llvm.70AE5790
  1.2%  25.4KiB rg::main
  0.8%  16.5KiB <regex::re_trait::Matches<'t, R> as core::iter::iterator::Iterator>::next
  0.8%  16.1KiB regex::re_bytes::Regex::find_at
  0.7%  13.7KiB clap::app::help::Help::write_arg
  0.7%  13.7KiB rg::app::app
  0.6%  12.9KiB clap::app::validator::Validator::validate
  0.6%  12.8KiB _ZN4clap3app4help4Help10write_help17hdd91a18346a72322E.llvm.25366C
  0.6%  12.7KiB clap::app::validator::Validator::validate_matched_args
  0.6%  12.6KiB je_arena_boot
  0.6%  11.7KiB clap::app::usage::get_required_usage_from
  0.6%  11.6KiB stats_arena_print
  0.5%  11.0KiB regex::re_bytes::Regex::shortest_match_at
100.0%   2.0MiB Total

Much nicer! The --trim-fn does exactly what we wanted! It just removes the hash mangle.

Ok, so now we’re starting to see some functions we could take a look at. But really, we’re just concerned with clap , so let’s clear out everything else. By default only the top 20 lines are displayed, which is probably fine since we’re looking for easy wins and not nitpicking functions apart. But simply (rip)grepping through these may not give us functions we can work with. I’d like to look at the top 25 clap only functions. So to see all output, then (rip)grep through it we run the following:

$ cargo bloat --release --trim-fn -n 0 | rg clap | head -n 25
    Finished release [optimized + debuginfo] target(s) in 0.0 secs
  5.0% 103.0KiB clap::app::parser::Parser::add_defaults
  3.2%  67.1KiB clap::app::parser::Parser::add_env
  1.9%  39.8KiB clap::app::parser::Parser::parse_short_arg
  1.9%  39.0KiB clap::app::parser::Parser::get_matches_with
  1.8%  38.2KiB clap::app::parser::Parser::parse_long_arg
  0.7%  13.7KiB clap::app::help::Help::write_arg
  0.6%  12.9KiB clap::app::validator::Validator::validate
  0.6%  12.8KiB _ZN4clap3app4help4Help10write_help17hdd91a18346a72322E.llvm.25366C
  0.6%  12.7KiB clap::app::validator::Validator::validate_matched_args
  0.6%  11.7KiB clap::app::usage::get_required_usage_from
  0.4%   7.5KiB clap::app::usage::create_help_usage
  0.2%   4.4KiB clap::app::parser::Parser::_version
  0.2%   4.3KiB clap::app::help::Help::write_all_args
  0.2%   3.6KiB clap::app::validator::Validator::validate_required
  0.2%   3.3KiB clap::app::parser::Parser::args_in_group
  0.1%   2.1KiB <clap::args::arg::Arg<'a, 'b> as core::convert::From<&'z clap::args::arg::Arg<'a, 'b>>>::from
  0.1%   2.1KiB clap::errors::Error::too_few_values
  0.1%   2.1KiB clap::errors::Error::too_few_values
  0.1%   2.1KiB clap::errors::Error::wrong_number_of_values
  0.1%   2.1KiB clap::errors::Error::wrong_number_of_values
  0.1%   2.0KiB clap::app::help::Help::write_args
  0.1%   2.0KiB clap::suggestions::did_you_mean
  0.1%   2.0KiB clap::suggestions::did_you_mean
  0.1%   2.0KiB clap::suggestions::did_you_mean
  0.1%   2.0KiB clap::errors::Error::invalid_value
  0.1%   2.0KiB clap::errors::Error::invalid_value
  0.1%   1.9KiB clap::app::help::Help::write_bin_name
  0.1%   1.8KiB clap::suggestions::did_you_mean_flag_suffix
  0.1%   1.8KiB clap::app::parser::Parser::create_help_and_version
  0.1%   1.8KiB clap::app::parser::Parser::add_arg_ref

A few things jumped out at me. I expected clap::app::parser::Parser::get_matches_with to be the biggest, because it’s the meat and potatoes of the parsing, but it wasn’t! Another thing I didn’t expect was several smaller functions to be so big. Since I work on clap all the time, and am familiar with how it works internally, I noticed the highlighted lines above shouldn’t have been so big, since they’re meant to be small…so what gives?

Notice there are also what appear to be duplicates. These are from using generics, because Rust builds a separate function for each possible generic Type. Aka monomorphization and static dispatch.

When I took a look at clap::app::parser::Parser::add_defaults I saw something that caught my eye…macros.

The full code is here. But essentially, this is what my eyes saw:

fn add_defaults(&mut self, matcher: &mut ArgMatcher<'a>) -> ClapResult<()> {
    macro_rules! im_huge {
        ($foo:ident) => {
            /* 
                statements 
            */
            if bar {
                im_bigger!()
            } else {
                im_bigger!()
            }
            /* 
                statements 
            */
        }
    }
    for i in 0..10 {
        im_huge!()
    }
    for i in &some_vec {
        im_huge!()
    }
}

Now this isn’t the actual code, because the above could be solved in a multitude of trivial ways. The point I’m making is I defined a macro, which needed to be called over two distinct types and didn’t want to duplicate the code. But iniside that macro, I called even more macros, and more macros, again, and again. It was like macro-ception! I thought this would be simple to fix!
So…no more macros?

Macros themselves aren’t the issue. They’re extremely handy. I tend to use them instead of duplicating code, when borrowck complains about the exact same code living in a function (because borrowck can’t peek into functions). The problem with doing this is that it’s basically SUPER aggressive inlining.

It was like a gateway drug. Copy one line and everything works? Sure! Turns out that one line expands into several hundred…

Looking at this code got me to think about my use of macros. It caused me to actually think about what is being expanded. With this new found suspicion of macros, I decided to look at my use cases more carefully. I started with the ones inside add_defaults . The most used macro happened to be arg_post_processing! which did things like validate values, number of occurrences, group requirements, etc. I noticed it was in some redundant places, much like the bogus example above, where im_bigger! could simply be lifted out of the if / else block, so could arg_post_processing! .

So I pulled those out. Easy win. But what else? Well if I had some redundant code already, there could be more.

…wait…a…second…:lightbulb:
Cutting Carbs

I noticed something more sinister than macros, but when combined with macros uncanny ability to expand into hundreds of lines of code could quickly up clap ‘s weight. I was validating each argument/value as they were parsed. I.e. trying to get an early error if there was one. This is exactly what arg_post_processing! was doing. So why not just accept all args and values, then validate them once at the end?

By not trying to validate arguments early every use of arg_post_processing! was eliminated, as was every macro it called on. Turned out those top functions I listed a while ago as “should have have been small but weren’t” were also users of arg_post_processing! and company.

There were some refactoring pains as I moved to this new paradigm of lazy validation, but overall it was pretty straight forward.

Let’s see if it worked! :fingerscrossed:

$ cargo bloat --release --crates 
    Finished release [optimized + debuginfo] target(s) in 0.0 secs
 31.8% 574.0KiB [Unknown]
 31.3% 564.9KiB std
 12.3% 205.0KiB clap
 10.9% 196.0KiB regex
  3.9%  70.0KiB ignore
  3.6%  65.0KiB regex_syntax
  1.8%  32.1KiB encoding_rs
  1.7%  31.6KiB globset
  0.9%  16.0KiB aho_corasick
  0.8%  13.7KiB grep
  0.5%   9.5KiB crossbeam
  0.5%   9.0KiB walkdir
  0.3%   5.1KiB termcolor
  0.3%   4.9KiB textwrap
  0.2%   4.2KiB env_logger
  0.2%   3.3KiB thread_local
  0.2%   3.2KiB ansi_term
  0.1%   2.6KiB same_file
  0.1%   2.1KiB strsim
  0.1%   2.0KiB vec_map
100.0%   1.8MiB Total

Woooo! That was pretty painless! But can we do more?

Remember how I pointed out the generics earlier (all the duplicate functions)? When I looked back at the list of all clap functions (which is too long to post here), I noticed how many clap::errors::Error::* functions there were! These are also all in failure path, so there isn’t a huge reason to use any performance benefits of monomorphization/static dispatch and generics. Why not just use trait objects and dynamic dispatch which would reduce the code duplication? Where there are three of the same function, it would go down to a single function, and all those added together could equal another easy win.
Trait Objects

I’ve heard it said that the Rust community is far too afraid of dynamic dispatch and heap allocation. I think this is probably true, but usually not a bad thing. In my case, however, an argument parsing library should probably try to be small, and optimizing the failure case is probably silly. I’ve always leaned away from dynamic dispatch because . So I’m guilty of it too. Maybe I could try this out and see if it’s a place that could benefit.

To be clear, there are issues when working with trait objects, but this particular case is about as simple a case as it gets. So there shouldn’t be troubles. I’ve also heard it said that there are times when dynamic dispatch can outperform static dispatch because of instruction caches. But again, my particular case is in the failure path, so performance isn’t the paramount concern.

So I set out to change my error functions from something like this…

#[doc(hidden)]
    pub fn empty_value<'a, 'b, A, U>(arg: &A, usage: U, color: ColorWhen) -> Self
    where
        A: AnyArg<'a, 'b> + Display,
        U: Display,
    {
        let c = Colorizer::new(ColorizerOption {
            use_stderr: true,
            when: color,
        });
        Error {
            message: format!(
                "{} The argument '{}' requires a value but none was supplied\
                 \n\n\
                 {}\n\n\
                 For more information try {}",
                c.error("error:"),
                c.warning(arg.to_string()),
                usage,
                c.good("--help")
            ),
            kind: ErrorKind::EmptyValue,
            info: Some(vec![arg.name().to_owned()]),
        }
}

…to this…

#[doc(hidden)]
    pub fn empty_value<'a, 'b, U>(arg: &AnyArg, usage: U, color: ColorWhen) -> Self
    where
        U: Display,
    {
        let c = Colorizer::new(ColorizerOption {
            use_stderr: true,
            when: color,
        });
        Error {
            message: format!(
                "{} The argument '{}' requires a value but none was supplied\
                 \n\n\
                 {}\n\n\
                 For more information try {}",
                c.error("error:"),
                c.warning(arg.to_string()),
                usage,
                c.good("--help")
            ),
            kind: ErrorKind::EmptyValue,
            info: Some(vec![arg.name().to_owned()]),
        }
}

Notice the removal of the generic type A and its trait bound. Instead, there is a trait object &AnyArg which essentially does the same thing from the user point of view, but instead of a concrete type compiled into a specialized function, we’ll be using two pointers (one to “data” and one to a vtable) at runtime. Sometimes this is referred to as ‘type erasure’ because it erases the compilers knowledge of the type that will be used, and instead relies on runtime behavior.

There is a great writeup on Trait Objects here.
Current Weight

Alright, so with these fixes complete…what did we weigh in at?

$ cargo bloat --release --crates
    Finished release [optimized + debuginfo] target(s) in 0.0 secs
 31.8% 574.0KiB [Unknown]
 31.3% 564.9KiB std
 10.9% 196.0KiB regex
 10.5% 189.1KiB clap
  3.9%  70.0KiB ignore
  3.6%  65.0KiB regex_syntax
  1.8%  32.1KiB encoding_rs
  1.7%  31.6KiB globset
  0.9%  16.0KiB aho_corasick
  0.8%  13.7KiB grep
  0.5%   9.5KiB crossbeam
  0.5%   9.0KiB walkdir
  0.3%   5.1KiB termcolor
  0.3%   4.9KiB textwrap
  0.2%   4.2KiB env_logger
  0.2%   3.3KiB thread_local
  0.2%   3.2KiB ansi_term
  0.1%   2.6KiB same_file
  0.1%   2.1KiB strsim
  0.1%   2.0KiB vec_map
100.0%   1.8MiB Total

Nice! That’s nearly a 60% decrease in size! Also, because we’re running far fewer instructions, we’ve actually received a slight performance boost too! Add to this, we really didn’t have to do that much work!
Still to Come

So what’s left? TONS! This was just two quick issues, I’m sure there is SO. MUCH. MORE. that could be optimized and reduced. In the back of my mind I still have the v3 changes where I know a lot of this has already been fixed and de-duplicated too. I’m excited to see what else I can do to tune this library.

If you’re interested in helping, stop by the repository or Gitter chat! I’d love to help mentor some new contributors!
Conclusion

This is all a big thanks to cargo-bloat . It pointed me in the right direction to find some logic duplication and caused me to look at some crazy macro expansions. I’ll still use macros, and probably still duplicate logic from time to time, but cargo-bloat really, really helped point this out to me. I’m still looking at the function list output to see if there are any more easy wins, and those will come with time.

For now, you can use v2.29.1 to get the latest and greatest lean mean clap .

I encourage everyone to take a look at their projects and see if they too can lose a little weight at the start of this year 😉

Kevin

Leave a Reply

Your email address will not be published. Required fields are marked *