Ian's Blog: 2024.05.31: Faster is not always better

2024.05.31: Faster is not always better

2024-08-02 03:48 by Ian

I've spent more than 75% of my life writing computer programs. In my considered opinion, the most difficult thing about my craft is Time and timing. Every aspect of Time in a computer program is contingent on chains of premises so long that most mortals won't even see the whole chain piecewise in their lives, let alone all at once to be able to solve a deep timing problem effectively.

But I think the difficulty is not the subject matter (Intrinsic complexity of the concept of Time). I think the worst of the problem is the same as it is for rotation:
People (not just programmers) have a strong tendency to trivialize the concept of Time in the first place, because their brains fool them into thinking that they understand how it works.

I cite the excellent post "Falsehoods programmers believe about time". These are strictly consequences of human cognitive bottlenecks and the fact that computers make terrible clocks. Which isn't the topic of this post. They are only being brought up because they will occasionally become enmeshed with the thing I do want to talk about: How choices about the specific arrangements of atoms that we use to perform computation impacts Time, as experienced by a computer program running on those atoms.

My qualifications to opine

I spent enough time in infosec to have an excellent understanding of timing side-channel attacks. And side-channel attacks against the power-domain (proximately), which (due to clock-scaling and other efficiency measures by hardware vendors, compilers, and the authors of top-level implementations) often amounts to a timing side-channel (ultimately).

I've also written a large quantity of firmware for machines that have timing constraints of varying severity against Nature, and must be able to report or react to stimulus in a timely manner. Much of the time, the code is so much faster than the external world that timing doesn't apparently matter. But no one stays lucky forever. Timing is critical in basically every non-trivial program that ends up embedded, and I've had numerous occasions to write code (in many languages) that had to run on regimented timescales anywhere in the range of 14 nanoseconds to 30 years.

Finally, I've debugged an enormous number of concurrency problems at various scales of complexity. Usually, I have found myself to be the author of my own worst concurrency bugs. But occasionally, very subtle timing impacts that have nothing to do with my code have caused me weeks of grief.

FINAB

"Faster" (whatever that means) is not always better.
I will use this initialism again. And it is used to drive home the recurring bifurcation of values surrounding software timing: Do you want low (or constant) latency, or maximum throughput?
Because (hardware held constant), you can't have it both ways. Anyone who has touched networks, audio production, or any number of other fields, has their own homologous trade-off near the center of their craft, because it is a fact of information theory (or physics?) shared by those domains. A feature of Nature.
You can't have it both ways.
Only a compromise between two extrema.
FINAB.

Some applications will and should value low jitter over maximal throughput. Timing attacks are possible against any conceivable flavor of code I've ever encountered. Cryptographic, network stacks, `malloc()`, and even microcode (the foundation of reality for many programmers). All of it is subject to exploit to the extent that it is not constant-time, and formally-closed.

So going forward, when I say "improving timings", or "better timing", my standard is constancy unless noted. Which might also (but needn't) include overall runtimes. If I set out to do jitter control, I almost certainly improved overall runtime as a side-effect.
Example: If I make changes to a program that causes it to run 1% slower, but reduces the standard deviation of runtime (given equal-input) by 90%, I would almost certainly say that I improved the program's timing.
If pressed for detail, I would tell another engineer that I increased runtime and decreased jitter.

The sand

People who build silicon care a lot about Watts, and don't want to use too many of them. They care because we all care. It costs less to build and operate machines that use fewer Watts. This is a fundamental value (constant time or not). And some of the best ways to reduce Wattage are to reduce clock rates, add caching layers, and pipeline their ALUs.

Every single one of those sensible measures will have a tremendous impact on your program's timing. The silicon vendors today are as good as I've ever seen them. True sorcerers, the lot of them. They form the immovable extreme of the perspective that "better timing implies higher ops-per-Watt" (or ops-per-second, which is equivalent here). Getting things done faster is ipso facto better, even if it means workload jitter goes up. And most of the time, they are right.

Thus: If you don't know the detailed nature of your platform, you should assume that it has been optimized (from the assembly language, down) for threaded throughput at the possible expense of jitter.

The Threads

Reality at our scale is not synchronous. But computers are. Since a sophisticated program often needs to be doing several things at once, and since the real-world cannot be paused while the CPU does them, any program reacting to real-world stimulus is obliged to have a means of chunking up the work it can't manage at the moment, and returning to it when time is available. Solving this problem of time/memory-allocation is the basic design goal of any operating system. And these concerns apply not only to operating systems, but also stand-alone programs that employ explicit time-partitioning tactics.

Depending on the program's purpose and arrangement, threading, fork(), and asynchronous design patterns will all have tremendous impact on timing quality at the millisecond scale. When people talk about "realtime operating systems", I usually take that to mean: Operating systems with context-switching periods on the order of 1 millisecond.

Threads are great for solving certain kinds of problems. Notably, those that involve a continual stream of input that needs to be transformed into a continual stream of output. Threads all operate essentially the same way:
Carve up the available memory into separate stacks. and use a hardware faculty to periodically switch stacks (also called context-switching). If you only need latency control down to a few milliseconds, and burning some Wattage and RAM to get it is acceptable, high-frequency threading might be a good way to go. The hardware assistance needed to mediate context switching also assures the reliable partitioning of time into millisecond-wide windows, within which a program can completely disregard its own timing (and even block execution indefinitely) without impacting the latency assurances given to other duties in the broader program.

For a great deal of my work, threading isn't a good option without a tremendous amount of hand-crafted effort put into the threading system itself. And sometimes no amount of effort would allow threads to be viable. If you need to carve time into more granularity than several hundreds of microseconds, the Wattage (time) spent context-switching begins to overtake that of solving the core problem of the program, and failure cascades begin. Often with maddening intermittency.

In such cases, I will write programs that have super-loops composed of coroutines with sub-microsecond response times and deterministic ordering (which threading tends to trample). Not so much constant-time (except in certain workflows), but very low latency and overhead, and with well-bounded concurrency concerns. In most cases, I can stop the execution clock entirely while I dwell in the _WFI() function and wait for the external world to impinge and force me to burn Watts. Which I am able to do in less time than it takes light to cross the street.

The tactics and advice below applies generally, and should have equal validity under threads or not. Threads have jitter concerns, too. But the timing impacts would be more likely to be masked or swamped under the noise floor of the threading itself (or God forbid, an operating system with other processes), and thus only visible on a Watt meter (if at all).

Remember: the entire stack values throughput over jitter.

What to do when FINAB

TL;DR: There is no quick solution.
You are going to have to learn something new. Probably every time you use a new platform or library.
Probably for decades.
Timing is a realm of tuning and magecraft, and you may never be done studying it.

Bad news for C programmers: timing closure on our compiled code is practically unattainable. The VHDL guys have that capability. And some vendors can give you silicon that is tunable to very tight tolerances, or has special hardware registers that allow programs to compensate cheaply for timing jitter that is innate to the platform. But the point is: if you are going to write software, you will probably never have the timing behavior that you would expect unless you actively work to achieve it. Cache arrangements on modern CPUs essentially assure so.
NOTE: At one point in 2017, I was aware of a project that was trying to change this. It involved Haskell and VHDL, and I cannot for the life of me find it. Please email me if you know how to get any amount of formal timing closure on a given arch for C/C++, and I will happily eat my words.

I'll get to embedded later, but if we are to take the simple case of a program running under Linux, there are a number of timing improvement measures that can be achieved by-structure without appealing to the OS for special treatment. Below are the fine-grained things that I start looking at when the program needs response latency measured in nanoseconds:

(Code): Reducing non-determinism in the program (IE, by eliminating conditionals, making things const, etc). Every branch is a potential cache miss. And every non-const is a potential bus wait for reading.
(Code): Even if you can't reduce branching, you can often simplify the compiler's notion of it so that it can generate pipeline-friendly code. Using switch() instead of if()/else chains will prevent considerations of the condition code register, and allows constant-time evaluation of the branch address, which is likely nearby.
(Code): If you have the stack to burn, migrate shared state (in statics or instance data) into ephemeral stack data. This might cost memory, and reduce runtime, but will eliminate the state so migrated as a possible concurrency concern. And that means it doesn't need a semaphore or mutex (either of which reliably ruins jitter performance).
(Code / Linker): Taking advantage of the fact that cache page sizes typically exceed the sizes of single functions. You can use the linker to place related functions near one-another in the final executable. When you incur a cache miss upon branching to such code, if the deeper sub-functions are nearby, they might be fetched on accident on the first cache refresh (thus avoiding a second cache miss). This is especially useful for small base classes in C++. Public functions at the head of the page boundary, and privates behind them is a simple default pattern to start with.
(Toolchain): You might be shocked to know how many people that use GCC have no idea what is going on under the hood. Nor do they bother ever examining the toolchain's output. GCC v10 had a tremendous number of timing improvements versus v8, for instance. Check your compiler version, RTFM, and objdump your binaries from two separate versions and diff the linker-generated MAP files alongside them. I promise you, you will learn something every time.
(Code / RTFM): Go dig in the sand, and write some assembly. If you can leverage the hardware's built-in SIMD pipelines, there might be single instructions that can replace a dozen more-general instructions. Not every platform assures that single instructions are atomic. Many will begin the process of interrupt service (or context switching, for threads) while the interrupted instruction is still winding through the ALU. But this is a terribly technical edge-case that I won't digress into further. Regardless, fewer instructions means fewer opportunities for concurrency gaps.
(Toolchain): GCC will segregate executable code and const data into separate sections of its output (.text, .data, respectively (at minimum). Programs with non-trivial timing requirements (of any kind) will often come packaged with linker scripts to specify the layout of their ticker-tape to take maximal advantage of the stack below them. Read those linker scripts, and learn from them. For some programs, it is sufficient (and preferable) to give GCC hints about specific functions and data by tagging them with attributes intended to influence section ordering or optimization level (cold, hot, etc).

Special concerns and strategies for embedded

I've worked for people who haven't programmed a microcontroller since the 8051 in assembly, and still hold an unshakable belief that deterministic timing on a microcontroller is still the default state of affairs in 2018. As far as I can tell, that hasn't been true since I moved to the MCP5206 (ColdFire), and embedded DRAM controllers started becoming a thing around the year 2000.

If you ask for memory access while a DRAM refresh is happening, your access will be stalled until the refresh completes, and the memory becomes accessible again. Those refreshes are deterministic. But I don't synchronize my code against that periodic event. Do you?

So what constitutes "deterministic execution timing" (IE, zero jitter) will be a question of error-bars on the jitter figure, and will be achieved when it falls below some systemic lower-bound defined by our platform. For my MCP5206e @25MHz with code executing out of 60ns DRAM and no enabled interrupts, those error bars would need to be above 80ns (two instructions), since that would be the innate jitter of the bus being used for execution itself, and thus represents the most-granular timing assurance I could possibly give from software alone.

Even in cases where execution is happening from zero waitstate SRAM, if the memory bus connecting that RAM to the ALU is not exclusive to those two architectural units (basically always the case), then the timing of execution will become contingent on the other potential traffic on the bus. Which might be any memory access for data (as would be the case for Von Neumann-bottlenecked CPUs). Thus, the specific code being executed impacts the timing of the code around it.
And combined with the fact that the actions on the bus by other bus masters (such as DMA) are partially contingent on the outside world, specific aspects of program timing quickly become impossible to predict with any reasonable foresight. Even without interrupts, threading, memory timing anisotropy, or any other such concern.

Most of the programs I write end up running on sand that doesn't have the Von Neumann bottleneck. At least: not innately. But I've seen other embedded devs struggle pointlessly with timing for months because I couldn't get them to understand i-bus versus d-bus, and how to use the linker to segregate memory in such a way as to respect the hardware divisions. So they unintentionally gave themselves a Von Neumann bottleneck on an MCU that was designed to be immune from them.

Special kinds of brain-damage caused by XiP

Some platforms (especially embedded platforms) have abnormally high penalties for cache misses, from the point-of-view of a non-embedded programmer. Noteworthy among these case are those platforms that employ a technique called eXecution In Place, or XiP. These platforms store all of their executable code in a serial flash chip that is separate from the MCU, which generally contains a peripheral for seemlessly and transparently integrating the flash chip into its own memory map. Many top-end MCUs from Espressif and NXP are organized this way, as are some low-end SoC designs that use MicroSD cards for non-volatile storage.

Vendors do this because it allows integrators to choose the flash features, and saves the costs of putting flash on the die. The payoff is a cheaper more flexible MCU that often has more RAM than it would otherwise. And even if most of that extra RAM goes toward caching of the memory map represented in the flash chip, the gambit still pays off. An on-die cache at full bus rate is cheaper (in terms of die area) than the equivalent number of bytes of flash, and the flash also requires special die regions to be given over for charge-pumps.

But if you don't carefully tune your timings in such systems, you stand to have trash-tier timing performance, with your program stalling for dozens of microseconds while waiting on a cache fetch for code or const data. When cache-miss penalties get this high, and a system becomes I/O-bound, much conventional wisdom surrounding performance in C becomes inverted.

Short-circuit eval becomes slower because it costs more program space and a branch.
Inline of functions results in slower runtimes due to the replication of 36-bytes that must be loaded from flash, versus a honest branch to those same 36-bytes that were already cached elsewhere.
Invoking the compiler with -Os yields faster code than with -O2, which is itself faster than -O3.
Don't forget to relocate your vtables into a place that is perma-cached (if you can). Your program's runtimes will improve tremendously, as will their jitter.
Make sure you are not becoming I/O bound by hidden RO data associated with switch(). Relocating from .text sections might not be sufficient to avoid cache-misses in critical sections. Usually, nothing within the linker-generated MAP file will reveal this condition.
Ultimately, the final arbiter of the success or failure of your efforts at avoiding cache-misses will be a logic analyzer on the flash chip. If the chip-select line is bouncing when you think it ought to be silent, you've missed a spot.

Previous: 2024.06.07: Beekeeping, year #2
Next: 2024.08.03: My thoughts on testing and doc

Ian's Blog

Oh god... how did this get here? I am not good with computer.