Saturday, September 6, 2025

Trying out import std

Since C++ compilers are starting to support import std, I ran a few experiments to see what the status of that is. GCC 15 on latest Ubuntu was used for all of the following.

The goal

One of the main goals of a working module implementation is to be able to support the following workflow:

  • Suppose we have an executable E
  • It uses a library L
  • L and E are made by different people and have different Git repositories and all that
  • We want to take the unaltered source code of L, put it inside E and build the whole thing (in Meson parlance this is known as subproject)
  • Build files do not need to be edited, i.e. the source of L is immutable
  • Make the build as fast as reasonably possible

The simple start

We'll start with a helloworld example.

This requires two two compiler invocations.

g++-15 -std=c++26 -c -fmodules -fmodule-only -fsearch-include-path bits/std.cc

g++-15 -std=c++26 -fmodules standalone.cpp -o standalone

The first invocation compiles the std module and the second one uses it. There is already some wonkiness here. For example the documentation for -fmodule-only says that it only produces the module output, not an object file. However it also tries to link the result into an executable so you have to give it the -c argument to tell it to only create the object file, which the other flag then tells it not to create.

Building the std module takes 3 seconds and the program itself takes 0.65 seconds. Compiling without modules takes about a second, but only 0.2 seconds if you use iostream instead of println.

The module file itself goes to a directory called gcm.cache in the current working dir:

All in all this is fairly painless so far.

So ... ship it?

Not so fast. Let's see what happens if you build the module with a different standards version than the consuming executable.

It detects the mismatch and errors out. Which is good, but also raises questions. For example what happens if you build the module without definitions but the consuming app with -DNDEBUG. In my testing it worked, but is it just a case of getting lucky with the UB slot machine? I don't know. What should happen? I don't know that either. Unfortunately there is an even bigger issue lurking about.

Clash of File Name Clans

If you are compiling with Ninja (and you should) all compiler invocations are made from the same directory (the build tree root). GCC also does not seem to provide a compiler flag to change the location of the gcm.cache directory (or at least it does not seem to be in the docs). Thus if you have two targets that both use import std, their compiled modules get the same output file name. They would clobber each other, so Ninja will refuse to build them (Make probably ignores this, so the files end up clobbering each other and, if you are very lucky, only causes a build error).

Assuming that you can detect this and deduplicate building the std module, the end result still has a major limitation. You can only ever have one standard library module across all of your build targets. Personally I would be all for forcing this over the entire build tree, but sadly it is a limitation that can't really be imposed on existing projects. Sadly I know this from experience. People are doing weird things out there and they want to keep on weirding on. Sometimes even for valid technical reasons.

Even if this issue was fixed, it does not really help. As you could probably tell, this clashing will happen for all modules. So if your ever has two modules called utils, no matter where they are or who wrote them, they will both try to write gcm.cache/utils.gcm and either fail to build, fail on import or invoke UB.

Having the build system work around this by changing the working directory to implicitly make the cache directory go elsewhere (and repoint all paths at the same time) is not an option. All process invocations must be doable from the top level directory. This is the hill I will die on if I must!

Instead what is needed is something like the target private directory proposal I made ages ago. With that you'd end up with command line arguments roughly like this:

g++-15 <other args> --target-private-dir=path/to/foo.priv --project-private-dir=toplevel.priv

The build system would guarantee that all compilations for a single target (library, exe, etc) have the same target private directory and all compilations in the build tree get the same top level private directory. This allows the compiler to do some build optimizations behind the scenes. For example if it needs to build a std module, it could copy it in the top level private directory and other targets could copy it from there instead of building it from scratch (assuming it is compatible and all that).

Sunday, August 31, 2025

We need to seriously think about what to do with C++ modules

Note: Everything that follows is purely my personal opinion as an individual. It should not be seen as any sort of policy of the Meson build system or any other person or organization. It is also not my intention to throw anyone involved in this work under a bus. Many people have worked to the best of their abilities on C++ modules, but that does not mean we can't analyze the current situation with a critical eye.

The lead on this post is a bit pessimistic, so let's just get it out of the way.

If C++ modules can not show a 5× compilation time speedup (preferably 10×) on multiple existing open source code base, modules should be killed and taken out of the standard. Without this speedup pouring any more resources into modules is just feeding the sunk cost fallacy. 

That seems like a harsh thing to say for such a massive undertaking that promises to make things so much better. It is not something that you can just belt out and then mic drop yourself out. So let's examine the whole thing in unnecessarily deep detail. You might want to grab a cup of $beverage before continuing, this is going to take a while.

What do we want?

For the average developer the main visible advantages would be the following, ordered from the most important to the least.

  1. Much faster compilation times.
If you look at old presentations and posts from back in the day when modules were voted in (approximately 2018-2019), this is the big talking point. This makes perfect sense, as the "header inclusion" way is an O(N²) algorithm and parsing C++ source code is slow. Splitting the code between source and header files is busywork one could do without. The core idea behind modules is that if you can store the "headery" bit in a preprocessed binary format that can be loaded from disk, things become massively faster.

Then, little by little, build speed seems to fall by the wayside and the focus starts shifting towards "build isolation". This means avoiding bugs caused by things like macro leakage, weird namespace lookup issues and so on. Performance is still kind of there, but the numbers are a lot smaller, spoken aloud much more rarely and often omitted entirely. Now, getting rid of these sorts of bugs is fundamentally a good thing. However it might not be the most efficient use of resources. Compiler developer time is, sadly, a zero sum game so we should focus their skills and effort on things that provide the best results.

Macro leakage and other related issues are icky but they are on average fairly rare. I have encountered a bug caused by them maybe once or twice a year. They are just not that common for the average developer. Things are probably different for people doing deep low level metaprogramming hackery, but they are a minuscule fraction of the total developer base. On the other hand slow build times are the bane of existence of every single C++ developer every single day. It is, without question, the narrowest bottleneck for developer productivity today and is the main issue modules were designed to solve. They don't seem to be doing that nowadays.

How did we end up here in the first place?

C++ modules were a C++ 20 feature. If a feature takes over five years of implementation work to get even somewhat working, you might ponder how it was accepted in the standard in the first place. As I was not there when it happened, I do not really know. However I have spoken to people who were present at the actual meetings where things were discussed and voted on. Their comments have been enlightening to say the least.

Apparently there were people who knew about the implementation difficulty and other fundamental problems and were quite vocal that modules as specified are borderline unimplementable. They were shot down by a group of "higher up" people saying that "modules are such an important feature that we absolutely must have them in C++ 20".

One person who was present told me: "that happened seven years ago [there is a fair bit of lead time in ISO standards] and [in practice] we still have nothing. In another seven years, if we are very lucky, we might have something that sort of works".

The integration task from hell

What sets modules apart from almost all other features is that they require very tight integration between compilers and build systems. This means coming up with schemes for things like what do module files actually contain, how are they named, how are they organized in big projects, how to best divide work between the different tools. Given that the ISO standard does not even acknowledge the fact that source code might reside in a file, none of this is in its purview. It is not in anybody's purview.

The end result of all that is that everybody has gone in their own corner, done the bits that are the easiest for them and hoping for the best. To illustrate how bad things are, I have been in discussions with compiler developers about this. In said discussion various avenues were considered on how to get things actually working, but one compiler developer replied "we do not want to turn the compiler into a build system" to every single proposal, no matter what it was. The experience was not unlike talking to a brick wall. My guess is that the compiler team in question did not have resources to change their implementation so vetoing everything became the sensible approach for them (though not for the module world in general).

The last time I looked into adding module support to Meson, things were so mind-bogglingly terrible, that you needed to create, during compilation time, additional compiler flags, store them in temp files and pass them along to compilation commands. I wish I was kidding but I am not. It's quite astounding that the module work started basically from Fortran modules, which are simple and work (in production even), and ended up in their current state, a kafkaesque nightmare of complexity which does not work.

If we look at the whole thing from a project management viewpoint, the reason for this failure is fairly obvious. This is a big change across multiple isolated organizations. The only real way to get those done is to have a product owner who a) is extremely good at their job b) is tasked with and paid to get the thing done properly c) has sufficient stripes to give orders to the individual teams and d) has no problems slapping people on metaphorical wrists if they try to weasel out of doing their part.

Such a person does not exist in the modules space. It is arguable whether such a person could exist even in theory. Because of this modules can never become good, which is a reasonable bar to expect a foundational piece of technology to reach.

The design that went backwards

If there is one golden rule of software design, it is "Do not do a grand design up front". This is mirrored in the C++ committee's guideline of "standardize existing practice".

C++ modules may be the grandest up-frontest design the computing world has ever seen. There were no implementations (one might argue there still aren't, but I digress), no test code, no prototypes, nothing. Merely a strong opinion of "we need this and we need it yesterday".

For the benefit of future generations, one better way to approach the task would have gone something like this. First you implement enough in the compiler to be able to produce one module file and then consume it in a different compilation unit. Keep it as simple as possible. It's fine to only serialize a subset of functionality and error out if someone tries to go outside the lines. Then take a build system that runs that. Then expand that to support a simple project, say, one that has ten source files and produces one executable. Implement features in the module file until you can compile the whole thing. Then measure the output. If you do not see performance increases, stop further development until you either find out why that is or you can fix your code to work better. Now you update the API so that no part of the integration makes people's eyes bleed of horror. Then scale the prototype to handle project with 100 sources. Measure again. Improve again. Then do two 100 source pairs, one that produces a library and one that creates an executable that uses the library. Measure again. Improve again. Then do 1000 sources in 10 subprojects. Repeat.

If the gains are there, great, now you have base implementation that has been proven to work with real world code and which can be expanded to a full implementation. If the implementation can't be made fast and clean, that is a sign that there is a fundamental design flaw somewhere. Throw your code away and either start from scratch or declare the problem too difficult and work on something else instead.

Hacking on an existing C++ compiler is really difficult and it takes months of work to even get started. If someone wants to try to work on modules but does not want to dive into compiler development, I have implemented a "module playground", which consists of a fake C++ compiler, a fake build system and a fake module scanner all in ~300 lines of Python.

The promise of import std

There is a second way of doing modules in an iterative fashion and it is actually being pursued by C++ implementers, namely import std. This is a very good approach in several different ways. First of all, the most difficult part of modules is the way compilations must be ordered. For the standard library this is not an issue, because it has no dependencies and you can generate all of it in one go. The second thing is the fact that most of the slowness of most of C++ development comes from the standard library. For reference, merely doing an #include<vector> brings in 27 000 lines of code and that is fairly small amount compared to many other common headers.

What sort of an improvement can we expect from this on real world code bases? Implementations are still in flux, so let's estimate using information we have. The way import std is used depends on the compiler but roughly:

  1. Replace all #include statements for standard library headers with import std.
  2. Run the compiler in a special mode.
  3. The compiler parses headers of the standard library and produces some sort of a binary representation of them
  4. The representation is written to disk.
  5. When compiling normally, add compiler flags that tell the compiler to load the file in question before processing actual source code

If you are thinking "wait a minute, if we remove step #1, this is exactly how precompiled headers work", you are correct. Conceptually it is pretty much the same and I have been told (but have not verified myself) that in GCC at least module files are just repurposed precompiled headers with all the same limitations (e.g. you must use all the same compiler flags to use a module file as you did when you created it).

Barring a major breakthrough in compiler data structure serialization, the expected speedup should be roughly equivalent to the speedup you get from precompiled headers. Which is to say, maybe 10-20% with Visual Studio and a few percentage points on Clang and GCC. OTOH if such a serialization improvement has occurred, it could probably be adapted to be usable in precompiled headers, too. Until someone provides verifiable measurements proving otherwise, we must assume that is the level of achievable improvement.

For reference, here is a Reddit thread where people report improvements in the 10-20% range.

But why 5×?

A reasonable requirement for the speedup would be "better than can be achieved using currently available tools and technologies". As an experiment I wrote a custom standard library (not API compatible with the ISO one on purpose) whose main design goal was to be fast to compile. I then took an existing library, converted that to use the new library and measured. The code compiled four times faster. In addition the binary it produced was smaller and, unexpectedly, ran faster. Details can be found in this blog post.

Given that 4× is already achievable (though, granted, only tested on one project, not proven in general), 5× seems like a reasonable target.

But what's in it for you?

The C++ standard committee has done a lot of great (and highly underappreciated) work to improve the language. On several occasions Herb Sutter has presented new functionality with "all you have to do is to recompile your code with a new compiler and the end result runs faster and is safer". It takes a ton of work to get these kinds of results, and it is exactly where you want to be.

Modules are not there. In fact they are in the exact opposite corner.

Using modules brings with it the following disadvantages:

  1. Need to rewrite (possibly refactor) your code.
  2. Loss of portability.
  3. Module binary files (with the exception of MSVC) are not portable so you need to provide header files for libraries in any case.
  4. The project build setup becomes more complicated.
  5. Any toolchain version except the newest one does not work (at the time of writing Apple's module support is listed as "partial")
In exchange for all this you, the regular developer-about-town, get the following advantages:

  1. Nothing.

Tuesday, August 26, 2025

Reimplementing argparse in Pystd

One of the pieces of the Python standard library I tend to use the most is argparse. It is really convenient so I chose to implement that in Pystd. The conversion was fairly simple, with one exception. As C++ is not duck typed, adapting the behaviour to be strictly typed while still feeling "the same" took some thinking. Parsing command line options is also quite complicated and filled with weird edge cases. For example, if you have short options -a and -b, then according to some command line parsers (but not others) an argument of -ab is valid (but only sometimes).

I chose to ignore all the hard bits and instead focus on the core parts that I use most of the time, meaning:

  • Short and long command line arguments
  • Arguments can be of type bool, int, string or array of strings
  • Full command line help
  • Same actions as in argparse
The main missing feature is subcommands (and all of the edge cases, obviously). That being said, approximately 400 lines of code later, we have this:

The only cheat is that I did the column alignment by hand due to laziness. Using the Pystd implementation resembles regular argparse (full code is here). First you create the parser and options:

Then you do the parse and use the results:

Compiling this test application and all support code from scratch takes approximately 0.2 seconds.

Wednesday, August 6, 2025

Let's properly analyze an AI article for once

Recently the CEO of Github wrote a blog post called Developers reinvented. It was reposted with various clickbait headings like GitHub CEO Thomas Dohmke Warns Developers: "Either Embrace AI or Get Out of This Career" (that one feels like an LLM generated summary of the actual post, which would be ironic if it wasn't awful). To my great misfortune I read both of these. Even if we ignore whether AI is useful or not, the writings contain some of the absolute worst reasoning and stretched logical leaps I have seen in years, maybe decades. If you are ever in the need of finding out how not to write a "scientific" text on any given subject, this is the disaster area for you.

But before we begin, a detour to the east.

Statistics and the Soviet Union

One of the great wonders of statistical science of the previous century was without a doubt the Soviet Union. They managed to invent and perfect dozens of ways to turn data to your liking, no matter the reality. Almost every official statistic issued by USSR was a lie. Most people know this. But even most of those do not grasp just how much the stats differed from reality. I sure didn't until I read this book. Let's look at some examples.

Only ever report percentages

The USSR's glorious statistics tended to be of the type "manufacturing of shoes grew over 600% this five year period". That certainly sounds a lot better than "In the last five years our factory made 700 pairs of shoes as opposed to 100" or even "7 shoes instead of 1". If you are really forward thinking, you can even cut down shoe production on those five year periods when you are not being measured. It makes the stats even more impressive, even though in reality many people have no shoes at all.

The USSR classified the real numbers as state secrets because the truth would have made them look bad. If a corporation only gives you percentages, they may be doing the same thing. Apply skepticism as needed.

Creative comparisons

The previous section said the manufacturing of shoes has grown. Can you tell what it is not saying? That's right, growth over what? It is implied that the comparison is to the previous five year plan. But it is not. Apparently a common comparison in these cases was the production amounts of the year 1913. This "best practice" was not only used in the early part of the 1900s, it was used far into the 1980s.

Some of you might wonder why 1913 and not 1916, which was the last year before the bolsheviks took over? Simply because that was the century's worst year for Russia as a whole. So if you encounter a claim that "car manufacturing was up 3700%" some year in 1980s Soviet Union, now you know what that actually meant.

"Better" measurements

According to official propaganda, the USSR was the world's leading country in wheat production. In this case they even listed out the production in absolute tonnes. In reality it was all fake. The established way of measuring wheat yields is to measure the "dry weight", that is, the mass of final processed grains. When it became apparent that the USSR could not compete with imperial scum, they changed their measurements to "wet weight". This included the mass of everything that came out from the nozzle of a harvester, such as stalks, rats, mud, rain water, dissidents and so on.

Some people outside the iron curtain even believed those numbers. Add your own analogy between those people and modern VC investors here.

To business then

The actual blog post starts with this thing that can be considered a picture.

What message would this choice of image tell about the person using it in their blog post?

  1. Said person does not have sufficient technical understanding to grasp the fact that children's toy blocks should, in fact, be affected by gravity (or that perspective is a thing, but we'll let that pass).
  2. Said person does not give a shit about whether things are correct or could even work, as long as they look "somewhat plausible".
Are these the sort of traits a person in charge of the largest software development platform on Earth should have? No, they are not.

To add insult to injury, the image seems to have been created with the Studio Ghibli image generator, which Hayao Miyazaki described as an abomination on art itself. Cultural misappropriation is high on the list of core values at Github HQ it seems.

With that let's move on to the actual content, which is this post from Twitter (to quote Matthew Garrett, I will respect their name change once Elon Musk starts respecting his child's).

Oh, wow! A field study. That makes things clear. With evidence and all! How can we possibly argue against that?

Easily. As with a child.

Let's look at this "study" (and I'm using the word in its loosest possible term here) and its details with an actual critical eye. The first thing is statistical representativeness. The sample size is 22. According to this sample size calculator I found, a required sample size for just one thousand people would be 278, but, you know, one order of magnitude one way or another, who cares about those? Certainly not business big shot movers and shakers. Like Stockton Rush for example.

The math above assumes an unbiased sampling. The post does not even attempt to answer whether that is the case. It would mean getting answers to questions like:

  • How were the 22 people chosen?
  • How many different companies, skill levels, nationalities, genders, age groups etc were represented?
  • Did they have any personal financial incentive on making their new AI tools look good?
  • Were they under any sort of duress to produce the "correct" answers?
  • What was/were the exact phrase(s) that was asked?
  • Were they the same for all participants?
  • Was the test run multiple times until it produced the desired result?
The latter is an age old trick where you run a test with random results over and over on small groups. Eventually you will get a run that points the way you want. Then you drop the earlier measurements and publish the last one. In "the circles" this is known as data set selection.

Just to be sure, I'm not saying that is what they did. But if someone drove a dump truck full of money to my house and asked me to create a "study" that produced these results, that is exactly how I would do it. (I would not actually do it because I have a spine.)

Moving on. The main headline grabber is "Either you embrace AI or get out of this career". If you actually read the post (I know), what you find is that this is actually a quote from one of the participants. It's a bit difficult to decipher from the phrasing but my reading is that this is not a grandstanding hurrah of all things AI, but more of a "I guess this is something I'll have to get used to" kind of submission. That is not evidence, certainly not of the clear type. It is an opinion.

The post then goes on a buzzwordsalad tour of statements that range from the incomprehensible to the puzzling. Perhaps the weirdest is this nugget on education:

Teaching [programming] in a way that evaluates rote syntax or memorization of APIs is becoming obsolete.

It is not "becoming obsolete". It has been considered the wrong thing to do for as long as computer science has existed. Learning the syntax of most programming languages takes a few lessons, the rest of the semester is spent on actually using the language to solve problems. Any curriculum not doing that is just plain bad. Even worse than CS education in Russia in 1913.

You might also ponder that if the author is so out of touch with reality in this simple issue, how completely off base the rest of his statements might be. In fact the statement is so wrong at such a fundamental level that it has probably been generated with an LLM.

A magician's shuffle

As nonsensical as the Twitter post is, we have not yet even mentioned the biggest misdirection in it. You might not even have noticed it yet. I certainly did not until I read the actual post. Try if you can spot it.

Ready? Let's go.

The actual fruit of this "study" boils down to this snippet.

Developers rarely mentioned “time saved” as the core benefit of working in this new way with agents. They were all about increasing ambition.

Let that sink in. For the last several years the main supposed advantage of AI tools has been the fact that they save massive amounts of developer time. This has lead to the "fire all your developers and replace them with AI bots" trend sweeping the nation. Now even this AI advertisement of a "study" can not find any such advantages and starts backpedaling into something completely different. Just like we have always been at war with Eastasia, AI has never been about "productivity". No. No. It is all about "increased ambition", whatever that is. The post then carries on with this even more baffling statement.

When you move from thinking about reducing effort to expanding scope, only the most advanced agentic capabilities will do.

Really? Only the most advanced agentics you say? That is a bold statement to make given that the leading reason for software project failure is scope creep. This is the one area where human beings have decades long track record for beating any artificial system. Even if machines were able to do it better, "Make your project failures more probable! Faster! Spectacularer!" is a tough rallying cry to sell. 

 To conclude, the actual findings of this "study" seem to be that:

  1. AI does not improve developer productivity or skills
  2. AI does increase developer ambition
This is strictly worse than the current state of affairs.

Wednesday, July 23, 2025

Comparing a red-black tree to a B-tree

 In an earlier blog post we found that optimizing the memory layout of a red-black tree does not seem to work. A different way of implementing an ordered container is to use a B-tree. It was originally designed to be used for on-disk data. The design principle was that memory access is "instant" while disk access is slow. Nowadays this applies to memory access as well, as cache hits are "instant" and uncached memory is slow.

I implemented a B-tree in Pystd. Here is how all the various containers compare. For test data we used numbers from zero to one million in a random order.


As we can see, an unordered map is massively faster than any ordered container. If your data does not need to be ordered, that is the one you should use. For ordered data, the B-tree is clearly faster than either red-black tree implementation.

Tuning the B-tree

B-trees have one main tunable parameter, namely the spread factor of the nodes. In the test above it was five, but for on disk purposes the recommended value is "in the thousands". Here's how altering the value affects performance.


The sweet spot seems to be in the 256-512 range, where the operations are 60% faster than standard set. As the spread factor grows towards infinity, the B-tree reduces to just storing all data in a single sorted array. Insertion into that is an O(N^2) algorithm as can be seen here.

Getting weird

The B-tree implementation has many assert calls to verify the internal structures. We can compile the code with -DNDEBUG to make all those asserts disappear. Removing redundant code should make things faster, so let's try it.

There are 13 measurements in total and disabling asserts (i.e. enabling NDEBUG) makes the code run slower in 8 of those cases. Let's state that again, since it is very unexpected: in this particular measurement code with assertions enabled runs faster than the same code without them. This should not be happening. What could be causing it?

I don't know for sure, so here is some speculation instead.

First of all a result of 8/13 is probably not statistically significant to say that enabling assertions makes things faster. OTOH it does mean that enabling them does not make the code run noticeably slower. So I guess we can say that both ways of building the code are approximately as fast.

As to why that is, things get trickier. Maybe GCC's optimizer is just really good at removing unnecessary checks. It might even be that the assertions give the compiler more information so it can skip generating code for things that can never happen. I'm not a compiler engineer, so I'll refrain from speculating further, it would probably be wrong in any case.

Saturday, July 5, 2025

Deoptimizing a red-black tree

An ordered map is typically slower than a hash map, but it is needed every now and then. Thus I implemented one in Pystd. This implementation does not use individually allocated nodes, but instead stores all data in a single contiguous array.

Implementing the basics was not particularly difficult. Debugging it to actually work took ages of staring at the debugger, drawing trees by hand on paper, printfing things out in Graphviz format and copypasting the output to a visualiser. But eventually I got it working. Performance measurements showed that my implementation is faster than std::map but slower than std::unordered_map.

So far so good.

The test application creates a tree with a million random integers. This means that the nodes are most likely in a random order in the backing store and searching through them causes a lot of cache misses. Having all nodes in an array means we can rearrange them for better memory access patterns.

I wrote some code to reorganize the nodes so that the root is at the first spot and the remaining nodes are stored layer by layer. In this way the query always processes memory "left to right" making things easier for the branch predictor.

Or that's the theory anyway. In practice this made things slower. And not even a bit slower, the "optimized version" was more than ten percent slower. Why? I don't know. Back to the drawing board.

Maybe interleaving both the left and right child node next to each other is the problem? That places two mutually exclusive pieces of data on the same cache line. An alternative would be to place the entire left subtree in one memory area and the right one in a separate one. Thinking about this for a while, this can be accomplished by storing the nodes in tree traversal order, i.e. in numerical order.

I did that. It was also slower than a random layout. Why? Again: no idea.

Time to focus on something else, I guess.