"C is not how the computer works" can lead to inefficient code
A little over a year ago, I wrote âShould you learn C to âlearn how the computer worksââ. It was a bit controversial. I had promised two follow-up posts. Itâs taken me a year, but hereâs the first one.
In that post, I argued that C doesnât âwork like the computerâ, but rather, by a concept called the âC abstract machine.â It happens to be close to how computers operate in some sense, but that when you write C, youâre writing for the abstract machine, not the computer itself. Hereâs the trick, though: the C abstract machine doesnât define every single aspect of how computation happens. And so sometimes, you can write code in two different ways, that the C abstract machine says are equivalent, yet are not equivalent in performance. This means that, even if you buy into my original point, you still need to know the boundaries of this definition, and can exploit those boundaries in important ways.
Letâs start with a made-up example, and then weâll show how this happens in real code. Imagine that we have a very simple programming language, built on an abstract machine. It defines only the semantics for the multiplication of two 32-bit integers. It also ignores things like overflow. You could write a program like â3 * 4â and this machine would say âhey the result is 12.â Letâs also imagine we have a very simple computer: it has a single instruction, âadd two numbers together.â This takes one clock cycle to execute. If we wrote a compiler for this machine, and didnât apply any optimizations, it would execute â4 + 4â, and then take that result and add four to it. We have two add operations, and so our program takes two clock cycles.
Letâs imagine a new version of our computer comes out. It adds a new instruction: multiplication. This also takes one cycle, but can multiply two numbers directly. Our compiler adds a new back-end, and when compiling our â3 * 4â program for this fancy new machine, it emits a single multiplication instruction. Our program now takes one clock cycle.
Now imagine a new version of our compiler. It is able to do âconstant foldingâ, an optimization. Our program does no computation at all, only produces a â12â directly. This now means our program takes zero cycles on both bits of hardware.
Is all of this okay?
Yes! The machine only defines how the *
operator works on numbers in a âwhat number comes out the other sideâ sense. It doesnât specify how to actually accomplish this task. Compilers for our little language are free to implement the details in any way they choose, as long as they respect the rules of the abstract machine. Other details are fair game. As long as the program â3 * 4â produces â12â, the rest of the computational process doesnât actually matter.
This is how you can write âportableâ code that is portable in one sense, but not at all in another sense. As long as you respect the rules of the abstract machine, youâll be guaranteed (we live in a world with no compiler bugs, of course) to get the correct result. If it takes too long, or takes no time at all, thatâs not the languageâs concern, strictly speaking.
If this idea seems silly to you, well, I can assure you itâs real. For example, letâs consider the actual C language, and some actual hardware, specifically, x86.
Imagine we have an array of arrays, four by four: (markdown gives me a header even though I donât want it, sorry about the weirdness)
There are two naive ways to process it: row by row:
or column by column:
Hereâs a fun question: is one of these faster than the other?
At this small size, probably not. But thereâs a way to do this thatâs faster than either of these ways, and thatâs called âblockingâ or âtilingâ. It means that you process a sub-set, or âblockâ, of the array each time. Something like this:
On x86, this is meaningfully faster in many circumstances. This has to do with the way that the CPU caches information. For more, see this page from Intel, including a visualization thatâs probably better than mine.
This raises an interesting point: is this code portable? Imagine a machine that didnât have the cache behavior of x86 chips here. This code would run on them, but it may be 14 times slower. Strictly speaking, this program is âportableâ in the sense that youâll get the right computation in the end, but maybe this speed is part of the user requirements for this software to be successful, and so being super slow means that it âdoesnât workâ according to those requirements, even though from a programming language point of view, itâs all good.
Because Câs abstract machine is so thin, these kinds of details can really, really matter, as weâve seen. And this is where the good part of the âC teaches you how the computer worksâ meme comes in. Because the machine is thinner, you can learn more about the details, and as long as theyâre not details that the C abstract machine cares about, exploit them. And thereâs a good reason I mentioned an optimizing compiler above; itâs also the job of compiler authors to realize the difference between a given computer and the C abstract machine, and then exploit the difference to make your code go as fast as possible. But this really blurs the line between abstract machine and physical machine. And thatâs why we all argue online about this all the time.
I have a third post to wrap up this series. I hope it wonât take me another year, but no guaranteesâŚ