Synchronous IO is fast?!

Thursday, March 20, 2025

When Cloudflare announced SQLite backed Durable Objects, the claim was made that synchronous IO will actually complete faster than asynchronous IO. This does come as a surprise to most of the engineers that I speak with and so let us substantiate the claim.

To do this, we will run twelve benchmarks by looking at read only, write only, and an 80% read and 20% write workload. Ironically, we do see that approximately 80% of row operations on Cloudflare Durable Objects are reads, so I am not blindly implementing the 80/20 rule. We will consider these workloads with sequential and random operations and, most importantly, with synchronous and asynchronous operations.

In the previous post, I walked through how to benchmark a storage performance using fio. This post will primarily focus on the results of these benchmarks, though these benchmarks will be running on the same hardware (however this matters less for the sake of comparing synchronous and asynchronous IO).

Results part 1

Latency

There is a noticeable, approximately 30 times, reduction in latency for all types of synchronous operations. Another interesting side effect is that we see a slight reduction in latency for synchronous sequential operations when compared to synchronous random operations, but this does not carry over to asynchronous operations.

Latency (ns)

SynchronousAsynchronous
ReadRandom94331,410
Sequential65733,910
WriteRandom1,96843,230
Sequential1,67542,030
80% Read / 20% WriteRandom961 / 2,00035,740 / 37,770
Sequential680 / 1,78534,100 / 36,870

Throughput

Similarly to the latencies seen above, synchronous IO is noticeably faster in terms of throughput. Throughput for synchronous operations is approximately 2 times that of asynchronous operations. This can be seen in terms of either IOPS or bandwidth given that the blobs used in this benchmark are all the same size. Both results are shown below. Results for the 80% read and 20% write workload are summed.

IOPS

SynchronousAsynchronous
ReadRandom762,000307,000
Sequential1,320,000311,000
WriteRandom424,000237,000
Sequential561,000245,000
80% Read / 20% WriteRandom644,000290,000
Sequential918,000304,000

Bandwidth (MiB/s)

SynchronousAsynchronous
ReadRandom2,9751,199
Sequential5,1561,213
WriteRandom1,656927
Sequential2,190958
80% Read / 20% WriteRandom2,5141,133
Sequential3,5831,184

Results part 2

The reason why the previous benchmarks favor synchronous IO is partially because operating systems buffer IO operations in memory by default. These benchmarks were not configured to explicitly avoid buffering in memory, so these read and write "IO operations" are not really IO operations. Importantly, this is not cheating! Most real workloads are going to benefit similarly from buffering.

However, there is more nuance if we disable buffering, so let us repeat the benchmark and disable buffering this time, that is setting direct=1 in the fio initialization file. Results of this benchmark are shown below in the same format, but this time only include a read only workload.

Latency

There are a couple of things that are different about the latency results with buffering disabled. First of all, our latencies are orders of magnitude higher across the board. While buffered synchronous sequential IO operations can take several hundred nanoseconds it takes tens of microseconds to do a IO operation without buffering. Second, synchronous IO is still faster.

Latency (ns)

SynchronousAsynchronous
ReadRandom84,700362,640
Sequential26,450144,270

Throughput

It turns out that synchronous IO is not better across the board though. With buffering disabled asynchronous IO does provide better throughput. Sequential synchronous IO operations are close in terms of throughput to random asynchronous IO operations, but not quite. Disabling buffering also shows the expected performance difference between sequential and random IO operations more clearly.

IOPS

SynchronousAsynchronous
ReadRandom12,00044,000
Sequential38,000110,000

Bandwidth (MiB/s)

SynchronousAsynchronous
ReadRandom46172
Sequential147430

If you have made it this far, hopefully you will now believe the claim that synchronous IO is faster than asynchronous IO. Beyond that, buffered IO is awesome.

Setup

The benchmarks in this post can be recreated using the initialization file below.

$ cat benchmark.ini
[global]
size=1G           ; File size of 1 GiB per job
direct=0          ; Use buffered IO operations (toggle this for the second set of benchmarks)
bs=4k             ; Block size of 4 KiB
numjobs=1         ; Single job to isolate IO method
runtime=60        ; Run for 60 seconds
time_based        ; Ensure the test runs for the full duration
stonewall         ; Ensure jobs run sequentially

[sync-read]
ioengine=sync     ; Synchronous IO engine
rw=read           ; Sequential reads (replicate with: randread, write, randwrite, rw, and randrw)

[async-read]
ioengine=posixaio ; POSIX asynchronous IO engine
iodepth=32        ; Queue depth for asynchronous I/O
rw=read           ; Sequential reads (replicate with: randread, write, randwrite, rw, and randrw)

Measuring storage performance

Wednesday, March 19, 2025

I would like to collect a baseline for the performance of a modern filesystem and SSD. While this information is freely available on the internet, it is not difficult to collect first hand.

For my test I will be using fio and a 2022 M2 MacBook Air that I own. This MacBook uses uses the APFS filesystem and has 256 GB of storage among other things. I mention the storage capacity only because the M2 MacBook with a 256 GB SSD was repeatedly called out for having slow IO because it has a single chip rather than two 128 GB chips like the M1 MacBook Air.

Setup

Let us start by pulling some information about the disk.

$ diskutil info /
   Device Identifier:         disk3s1s1
   Device Node:               /dev/disk3s1s1
   Whole:                     No
   Part of Whole:             disk3

   Volume Name:               Macintosh HD
   Mounted:                   Yes
   Mount Point:               /

   Partition Type:            41504653-0000-11AA-AA11-00306543ECAC
   File System Personality:   APFS
   Type (Bundle):             apfs
   Name (User Visible):       APFS
   Owners:                    Enabled

   OS Can Be Installed:       No
   Booter Disk:               disk3s2
   Recovery Disk:             disk3s3
   Media Type:                Generic
   Protocol:                  Apple Fabric
   SMART Status:              Verified
   Volume UUID:               DC0B9406-4DDE-4A3B-B63A-53CD93A2E08F
   Disk / Partition UUID:     DC0B9406-4DDE-4A3B-B63A-53CD93A2E08F

   Disk Size:                 245.1 GB (245107195904 Bytes) (exactly 478724992 512-Byte-Units)
   Device Block Size:         4096 Bytes
   ...

To configure the benchmark we need to write an initialization file, benchmark.ini. This will specify that storage operations will be asynchronous, will not be cached, et cetera.... This will also make use of some information based on our disk.

$ cat benchmark.ini
[global]
ioengine=posixaio      ; Use asynchronous I/O for "Better performance"
direct=1               ; Do not use buffered IO operations
bs=4k                  ; Block size of 4 KiB, matching the APFS block size (above)
size=1G                ; File size of 1 GiB per job
runtime=60             ; Run for 60 seconds
time_based             ; Ensure the test runs for the full duration
group_reporting        ; Aggregate results across jobs

[job1]
rw=randrw              ; Perform random reads and writes
rwmixread=80           ; 80% reads, 20% writes
iodepth=32             ; Queue depth for asynchronous I/O
numjobs=1              ; A single process will perform reads and writes
filename=/...          ; A file the user has write access to below the APFS mount point (above)

Results

After configuring the benchmark it can be ran with the straightforward invocation of fio below.

$ fio benchmark.ini
job1: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=32
fio-3.39
Starting 1 process
Jobs: 1 (f=1): [m(1)][100.0%][r=92.8MiB/s,w=23.8MiB/s][r=23.8k,w=6084 IOPS][eta 00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=23272: Wed Mar 19 10:42:33 2025
  read: IOPS=28.1k, BW=110MiB/s (115MB/s)(6598MiB/60003msec)
    slat (nsec): min=0, max=45000, avg=346.23, stdev=700.18
    clat (usec): min=80, max=72139, avg=446.26, stdev=688.56
     lat (usec): min=80, max=72140, avg=446.61, stdev=688.57
    clat percentiles (usec):
     |  1.00th=[  212],  5.00th=[  247], 10.00th=[  265], 20.00th=[  285],
     | 30.00th=[  297], 40.00th=[  314], 50.00th=[  330], 60.00th=[  343],
     | 70.00th=[  363], 80.00th=[  392], 90.00th=[  453], 95.00th=[ 1303],
     | 99.00th=[ 2507], 99.50th=[ 3851], 99.90th=[ 4228], 99.95th=[ 4490],
     | 99.99th=[29492]
   bw (  KiB/s): min=20796, max=130828, per=100.00%, avg=112743.83, stdev=15552.87, samples=119
   iops        : min= 5199, max=32707, avg=28185.59, stdev=3888.20, samples=119
  write: IOPS=7046, BW=27.5MiB/s (28.9MB/s)(1652MiB/60003msec); 0 zone resets
    slat (nsec): min=0, max=47000, avg=492.36, stdev=849.72
    clat (usec): min=90, max=69969, avg=482.18, stdev=679.77
     lat (usec): min=90, max=69973, avg=482.68, stdev=679.81
    clat percentiles (usec):
     |  1.00th=[  245],  5.00th=[  281], 10.00th=[  297], 20.00th=[  314],
     | 30.00th=[  330], 40.00th=[  343], 50.00th=[  359], 60.00th=[  375],
     | 70.00th=[  396], 80.00th=[  424], 90.00th=[  494], 95.00th=[ 1450],
     | 99.00th=[ 2769], 99.50th=[ 3916], 99.90th=[ 4228], 99.95th=[ 4490],
     | 99.99th=[23725]
   bw (  KiB/s): min= 5346, max=32456, per=100.00%, avg=28222.54, stdev=3855.00, samples=119
   iops        : min= 1336, max= 8114, avg=7055.30, stdev=963.78, samples=119
  lat (usec)   : 100=0.01%, 250=4.63%, 500=86.82%, 750=1.83%, 1000=0.67%
  lat (msec)   : 2=3.29%, 4=2.49%, 10=0.25%, 20=0.01%, 50=0.01%
  lat (msec)   : 100=0.01%
  cpu          : usr=7.90%, sys=7.99%, ctx=1539978, majf=0, minf=9
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=51.1%, 16=48.9%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=98.9%, 8=1.1%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=1688969,422787,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=110MiB/s (115MB/s), 110MiB/s-110MiB/s (115MB/s-115MB/s), io=6598MiB (6918MB), run=60003-60003msec
  WRITE: bw=27.5MiB/s (28.9MB/s), 27.5MiB/s-27.5MiB/s (28.9MB/s-28.9MB/s), io=1652MiB (1732MB), run=60003-60003msec

Input/Output Operations Per Second (IOPS)

IOPS measure the frequency of IO operations over the time of the benchmark. We see 28,100 read IOPS and 7046 write IOPS, which corresponds to the benchmark of 80% reads (well 79.95% reads to be precise). In sum, we can expect that our MacBook can do 35,000 IOPS.

For reads and writes of a fixed size, which is the case with this benchmark, IOPS provide the same information as bandwidth which tracks throughput. The benchmark reported a bandwidth of 137.5 MiB per second. This works out to be roughly the number of IOPS times the block size, that is 35,000 * 4 KiB.

If the benchmark were instead writing chunks below the filesystem block size, then you could expect to see bandwidth drop. For example, if the benchmark instead wrote 2 KiB chunks, then we should expect effectively half of the bandwidth from our benchmark. This is left as an exercise to the reader.

Latency

There are three measures of latency found in the report:

  1. Submission latency (slat): The time it takes to submit the IO operation to the OS IO queue.
  2. Completion latency (clat): The time it takes for the OS IO process the IO operation and remove it from the queue.
  3. Total latency (lat): The sum of completion and submission latencies.

With that in mind we know that we can expect each read operation to take 446.61 μs on average and each write operation to take 482.68 μs on average. It is also to view the latency percentiles for clat. For example, the median, 50th percentile, write operation takes 359 μs, so we know latency skews high.

How anonymous namespaces work

Sunday, March 2, 2025

In the immediately preceding post I discussed how a linker enforces access to a symbol from within a single translation unit. In that post I emphasized the usage of the static keyword because it is a shared approach across both C and C++. A more idiomatic approach in C++ would be to use an anonymous namespace. Usage of the static keyword for this purpose was actually deprecated in the C++ standard and then undeprecated. An example making use of anonymous namespaces is below:

static int count = 0;

// instead of static void incrementCount
namespace {
    void incrementCount() {
        count++;
    }
}

// Force the compiler not to skip incrementCount
void callIncrementCount() { incrementCount(); }

In light of the preceding post, a reasonable question to ask could be: "How does this work?" There are not many satisfactory answers available. Going straight to the primary source of authoritative but unsatisfactory answers on C++, cppreference.com states:

[An anonymous namespace definition] is treated as a definition of a namespace with unique name ... The unique name is unique over the entire program, but within a translation unit each [anonymous namespace] definition maps to the same unique name: multiple [anonymous namespace] definitions in the same scope denote the same [anonymous namespace].

Pardon the punctuation in the quote. CppReference also refers to anonymous namespaces as unnamed namespaces, which is a term that I have never heard used in practice. Beyond that, the definition is also incomplete. To see this, we can use the same approach in the preceding post.

$ clang++ -c lib.cpp
$ nm lib.o
0000000000000000 T __Z23callIncrementCountv
0000000000000068 b __ZL5count
0000000000000014 t __ZN12_GLOBAL__N_114incrementCountEv
0000000000000000 t ltmp0
0000000000000068 b ltmp1
0000000000000028 s ltmp2

The line 0000000000000014 t __ZN12_GLOBAL__N_114incrementCountEv shows the a symbols address, type, and name respectively. The type t indicates that the incrementCount function is not an exported symbol like it would have been if it were not wrapped in a namespace.

Interestingly, this is different than if we were to use a regular namespace. For example:

static int count = 0;

namespace foo {
    void incrementCount() {
        count++;
    }
}

void callIncrementCount() { foo::incrementCount(); }

The corresponding symbols are below:

$ clang++ -c lib.cpp
$ nm lib.o
0000000000000014 T __Z18callIncrementCountv
0000000000000068 b __ZL5count
0000000000000000 T __ZN3foo14incrementCountEv
0000000000000000 t ltmp0
0000000000000068 b ltmp1
0000000000000028 s ltmp2

So interestingly anonymous namespaces do not only result in a unique name being assigned to a symbol, but they also change the fact that they are exported by the object file.

Another fun thing that I discovered while trying to have prettify symbols was that extern "C" silently changes the behavior of anonymous namespaces, while erroring when combined with the static keyword. The example:

static int count = 0;

namespace {
    extern "C" void incrementCount() {
        count++;
    }
}

void callIncrementCount() { incrementCount(); }

Compiles to the symbols:

$ clang++ -c lib.cpp
$ nm lib.o
0000000000000014 T __Z18callIncrementCountv
0000000000000068 b __ZL5count
0000000000000000 T _incrementCount
0000000000000000 t ltmp0
0000000000000068 b ltmp1
0000000000000028 s ltmp2

When the example:

static int count = 0;

extern "C" static void incrementCount() {
    count++;
}

Fails to compile with the error:

clang++ -c lib.cpp
lib.cpp:3:12: error: cannot combine with previous 'extern' declaration specifier
    3 | extern "C" static void incrementCount() {
      |            ^
1 error generated.

This is kind of a toy example, but as I said it is fun. The nice thing about the results of this post is that you can make use of anonymous namespaces as a drop in for the static keyword and trust that they operate the same for pretty much all intents and purposes (outside of the last one).

How linkage works

Saturday, March 1, 2025

Sometimes you want to limit access to a symbol, for example a variable, to a single translation unit. The shared approach between C and C++ is to declare a symbol within a translation unit as static. This example is shown below:

// In lib.hpp

// Prevent count from being accessed outside of this translation unit
static int count = 0;

...

// In main.cpp

// Tell the compiler that count is defined in another translation unit 
extern int count;

int main(int argc, char** argv) {
    return count;
}

This example will throw an error when attempting to compile it. Specifically:

$ clang++ -c lib.cpp
$ clang++ main.cpp lib.o
Undefined symbols for architecture arm64:
  "_count", referenced from:
      _main in main-4c6fc0.o
ld: symbol(s) not found for architecture arm64
clang++: error: linker command failed with exit code 1 (use -v to see invocation)

I have probably read errors like this thousands of times, but I have never actually thought about the fact that this error is generated from the linker and not the compiler. How does the linker know that the C or C++ code indicates that the symbol should not be accessed from outside of the translation unit? Let us take a look at some examples.

First let us take an example with a non-static function:

static int count = 0;

void incrementCount() {
    count++;
}

We can compile this example and then show the symbols present in the corresponding object file. These objects are presented to the linker to determine access.

$ clang++ -c lib.cpp
$ nm lib.o
0000000000000000 T __Z14incrementCountv
0000000000000038 b __ZL5count
0000000000000000 t ltmp0
0000000000000038 b ltmp1
0000000000000018 s ltmp2

The line 0000000000000000 T __Z14incrementCountv shows the a symbols address, type, and name respectively. The type T indicates that the incrementCount function is part of the text section of the program and that it is exported by virtue of being upper case.

If we consider another example with a static function:

static int count = 0;

static void incrementCount() {
    count++;
}

// Force the compiler not to skip incrementCount
void callIncrementCount() {
    incrementCount();
}

We can then follow the same compilation process and show symbols as we did previously

$ clang++ -c lib.cpp
$ nm lib.o
0000000000000000 T __Z23callIncrementCountv
0000000000000014 t __ZL14incrementCountv
0000000000000068 b __ZL5count
0000000000000000 t ltmp0
0000000000000068 b ltmp1
0000000000000028 s ltmp2

The line 0000000000000014 t __ZL14incrementCountv is similar to what it was previously, with the exception of the type t being lower case. This means that it is no longer being exported by virtue of being lower case, but it still belongs to the text section in the object file.

Not only is this interesting, but it also explains where the terminology internal linkage and external linkage comes from. Internal linkage is when a symbol cannot be accessed from outside of the translation unit in which it is defined. External linkage, the default behavior, is when a symbol can be accessed from any translation unit.

C++ Trait Bounds

Sunday, October 16, 2022

High performance generic programming is a killer feature in both Rust and C++. In each language the compiler can generate a concrete implementation of a generic program which results in zero runtime overhead. However, the Rust compiler also allows you to specify the functionality that a concrete type must possess in order to be used in a generic instance. This feature for additional safety is referred to as a trait bound and we can use similar ideas in C++ to achieve the same safety. Consider the following code:

trait Shape {
    fn draw(self);
}

struct Circle;

impl Shape for Circle {
    fn draw(self) {
        ...
    }
}

We have an abstract method Shape::draw. Assume we have a generic function exec_draw which calls the draw function on the passed parameter. Rust requires that a bound be placed on the parameter's generic type by specifying a trait, in this case Shape, which requires all usage of the exec_draw function to occur on concrete types which implement the Shape trait. This trait bound is shown below:

fn exec_draw<T: Shape>(c: T)
{
    T::draw(c);
}

The benefit here is that a different type, Picasso which does not explicitly implement Shape, cannot be passed to exec_draw function without generating a compile time error. This prevents accidental duck typing and it is still a compile time check with zero runtime overhead.

...

struct Picasso;

impl Picasso {
    fn draw(self) {
        ...
    }
}

...

fn main() {
    // Works as expected
    let circle = Circle;
    exec_draw(circle);

    // Generates compiler error "the trait `Shape` is not implemented for `Picasso`"
    let picasso = Picasso;
    exec_draw(picasso);
}

The default behavior in C++ is that the compiler does not provide a similar check. See a similar implementation in C++ below:

class Shape {
public:
    virtual void draw() = 0;
};

class Circle: public Shape {
public:
    void draw() {
        ...
    }
};

class Picasso {
public:
    void draw() {
        ...
    }
};

template<class T>
void exec_draw(T& x) {
    x->draw();
}

int main(int argc, char** argv) {
    // Works as expected
    auto circle = new Circle();
    exec_draw(circle);

    // Also works; C++ compiler don't care
    auto picasso = new Picasso();
    exec_draw(picasso);
}

Maybe allowing duck typing is desirable behavior, but getting the safety that we have in Rust with C++ is fairly straightforward here. One might be inclined to do something like the following:

void exec_draw(Shape* x) {
    x->draw();
}

Although this works, it is not guaranteed to have zero runtime overhead. In this toy program the compiler is able to remove the virtual function call, but as additional shapes are added to the program this optimization will fail and the result will be poorer performance.

To get the same behavior with zero runtime overhead we can revert back to templates and define a class called Can_copy, with a static constraint function containing the behavior that we want to test and a constructor assigning a function pointer to the static function. The class is used only for encapsulation. In this case, the behavior is that T must be a Shape*, or a pointer to a subclass of shape, or the user must have defined conversion to either of the former.

...

template<class T1, class T2> struct Can_copy {
    static void constraints(T1 a, T2 b) { T2 c = a; b = a; }
    Can_copy() { void(*p)(T1,T2) = constraints; }
};

...

template<class T>
void exec_draw(T& x) {
    // This code is interpreted at compile time by the compiler
    Can_copy<T, Shape*>();

    x->draw();
}

int main(int argc, char** argv) {
    // Works as expected
    auto circle = new Circle();
    exec_draw(circle);

    // Generates compiler error "error: incompatible pointer types assigning to 'Shape *' from 'Picasso *'"
    auto picasso = new Picasso();
    exec_draw(picasso);
}

This approach can be straightforwardly extended to be used for arbitrary compile time checks for constraints on generic types.

Note: This example is drawn from Bjarne Stroustrup's C++ Style and Technique FAQ. A key difference is that this approach is discussed there as a technique to generate better compiler errors, rather than prevent accidental duck typing when using generics. Presenting the idea in the context of program safety is in my opinion more important.