iidBlog - Home

Catch bugs with static analysis

Wednesday, April 2, 2025

In a previous blog post a few examples were discussed demonstrating how the default behavior of an implicitly-defined copy constructor, copy assignment operator, and destructor can lead to bugs. While these bugs are simple enough to inspect manually, it raises the question, how can these bugs be caught on larger projects?

clang-tidy is a linter which is part of the LLVM ecosystem that catches bugs, among other things, via static analysis. This post describes how to use the tool to lint the code examples in the previous blog post. The tool can be installed via Homebrew and invoked using the script below.

$ brew install llvm
$ $(brew --prefix llvm)/bin/clang-tidy main.cpp -checks='modernize-*' -- -std=c++23

Consider the code shown below which has a memory leak:

class Int {
    int* value;

public:
    explicit Int(const int& value): value(new int(value)) {}
};

auto main(int argc, char** argv) -> int {
    Int i(1);
    Int j(i);
}

If it is not immediately clear that there is a memory leak in this code due to the implicitly defined constructor, then please observe the output of clang-tidy shown below:

$(brew --prefix llvm)/bin/clang-tidy main.cpp -checks="clang-analyzer-cplusplus.*" -- -std=c++23
1 warning generated.
main.cpp:11:1: warning: Potential leak of memory pointed to by 'i.value' [clang-analyzer-cplusplus.NewDeleteLeaks]
   11 | }
      | ^
main.cpp:9:9: note: Calling constructor for 'Int'
    9 |     Int i(1);
      |         ^~~~
main.cpp:5:43: note: Memory is allocated
    5 |     explicit Int(const int& value): value(new int(value)) {}
      |                                           ^~~~~~~~~~~~~~
main.cpp:9:9: note: Returning from constructor for 'Int'
    9 |     Int i(1);
      |         ^~~~
main.cpp:11:1: note: Potential leak of memory pointed to by 'i.value'
   11 | }
      | ^

Not only does this clearly warn us of the memory leak, but also that the memory leak is pointed at by i.value, that the memory is allocated in the constructor of Int, and that the memory becomes orphaned the variable i goes out of scope. This should be everything necessary to understand that ~Int needs to be explicitly defined to deallocate the allocated memory.

Now consider the code shown below which has a double free:

class Int {
    int* value;

public:
    explicit Int(const int& value): value(new int(value)) {}

    ~Int() {
        delete value;
    }
};

auto main(int argc, char** argv) -> int {
    Int i(1);
    Int j(i);
}

Again, we can see based on the output of clang-tidy below exactly where the bug is:

$(brew --prefix llvm)/bin/clang-tidy main.cpp -checks="clang-analyzer-cplusplus.*" -- -std=c++23
1 warning generated.
main.cpp:8:9: warning: Attempt to free released memory [clang-analyzer-cplusplus.NewDelete]
    8 |         delete value;
      |         ^
main.cpp:13:9: note: Calling constructor for 'Int'
   13 |     Int i(1);
      |         ^~~~
main.cpp:5:43: note: Memory is allocated
    5 |     explicit Int(const int& value): value(new int(value)) {}
      |                                           ^~~~~~~~~~~~~~
main.cpp:13:9: note: Returning from constructor for 'Int'
   13 |     Int i(1);
      |         ^~~~
main.cpp:15:1: note: Calling '~Int'
   15 | }
      | ^
main.cpp:8:9: note: Memory is released
    8 |         delete value;
      |         ^~~~~~~~~~~~
main.cpp:15:1: note: Returning from '~Int'
   15 | }
      | ^
main.cpp:15:1: note: Calling '~Int'
main.cpp:8:9: note: Attempt to free released memory
    8 |         delete value;
      |         ^~~~~~~~~~~~

Specifically, the destructor which is deallocating the memory is being called twice on line 15. First for the variable j and second for the variable i because variables are popped off of the stack in the reverse order in which they are pushed on. In either case, it can be seen that the memory is allocated only once in the constructor of Int for the variable i.

It is left as an exercise to the reader to run the final "Rule of zero" example through clang-tidy, which still finds several things to complain about when enabling all checks, that is $(brew --prefix llvm)/bin/clang-tidy main.cpp -checks="*" -- -std=c++23. It is worth mentioning that clang-tidy can be noisy, so it is important to be judicious about which checks are enabled.

The "Rule of zero"

Tuesday, April 1, 2025

So back in the day there used to be this thing called the "Rule of three", which you can read about in more detail on stackoverflow among other place like this blog post. The statement of the rule is:

If a class requires a user-defined destructor, a user-defined copy constructor, or a user-defined copy assignment operator, it almost certainly requires all three.

The reason for this rule is that classes which acquire resources in their constructor are expected to manage their resources. This principle is commonly called Resource Acquisition Is Initialization (RAII). The implicitly-defined copy constructor, copy assignment operator, and destructor make some assumptions about how these classes manage their resources, and if those assumptions are incorrect for one method, then they are likely to be incorrect for all methods.

As an example, the implicitly-defined copy constructor will perform a shallow copy by default:

... the constructor performs full member-wise copy of the object's direct base subobjects and member subobjects, in their initialization order, using direct initialization. For each non-static data member of a reference type, the copy constructor binds the reference to the same object or function to which the source reference is bound.

Below are a few examples demonstrating how the default behavior of these methods might lead to bugs and how explicitly defining all of these methods can avoid these bugs. If the bugs are not obvious after reading the description below, then consider reading catching bugs with static analysis.

Example 1

When j is copy constructed from i using the implicitly-defined copy constructor, j gets a shallow copy of i.value. Memory is leaked when i goes out of scope because the implicitly-defined destructor has an empty body, and therefore never frees the memory allocated when i is constructed.

class Int {
    int* value;

public:
    explicit Int(const int& value): value(new int(value)) {}
};

auto main(int argc, char** argv) -> int {
    Int i(1);
    Int j(i);
}

Example 2

As before j gets a shallow copy of i.value when it is constructed, but now that we have an explicitly defined destructor that does not leak memory, we instead have a double free of the memory allocated when i is constructed when i goes out of scope.

class Int {
    int* value;

public:
    explicit Int(const int& value): value(new int(value)) {}

    ~Int() {
        delete value;
    }
};

auto main(int argc, char** argv) -> int {
    Int i(1);
    Int j(i);
}

Example 3

Because we have an explicitly defined copy constructor, j gets a deep copy of i.value when j is constructed. When our explicitly defined destructors run they each free their own copy of value. The copy assignment operator is included for completeness, but it is not explicitly used.

class Int {
    int* value;

public:
    explicit Int(const int& value): value(new int(value)) {}

    Int(const Int& other): Int(*other.value) {}

    Int& operator=(const Int& other) {
        Int temp(other);
        this->value = temp.value;
        return *this;
    }

    ~Int() {
        delete value;
    }
};

auto main(int argc, char** argv) -> int {
    Int i(1);
    Int j(i);
}

Hopefully you can understand the motivation for the "Rule of three". The "Rule of five" variant is similar as it includes the move constructor and move assignment operator which were added in C++ 11.

The "Rule of zero" however is more interesting because it states that classes should not have to define custom destructors, copy or move constructors, or copy or move assignment operators unless they need to manually manage a resource. Classes which need to manually manage a resource, for example a file descriptor, should abstract away the management of that resource and do nothing else. An example of this rule can be seen below with std::shared_ptr abstracting away the management of the raw pointer in the previous examples.

#include <memory>

class Int {
    std::shared_ptr<int> value;

public:
    explicit Int(const int& value): value(std::make_shared<int>(value)) {}
};

auto main(int argc, char** argv) -> int {
    Int i(1);
    Int j(i);
}

This implementation is so clean it should speak for itself, but if it does not here the guideline is even included in the C++ Core Guidelines.

Synchronous IO is fast?!

Thursday, March 20, 2025

When Cloudflare announced SQLite backed Durable Objects, the claim was made that synchronous IO will actually complete faster than asynchronous IO. This does come as a surprise to most of the engineers that I speak with and so let us substantiate the claim.

To do this, we will run twelve benchmarks by looking at read only, write only, and an 80% read and 20% write workload. Ironically, we do see that approximately 80% of row operations on Cloudflare Durable Objects are reads, so I am not blindly implementing the 80/20 rule. We will consider these workloads with sequential and random operations and, most importantly, with synchronous and asynchronous operations.

In the previous post, I walked through how to benchmark a storage performance using fio. This post will primarily focus on the results of these benchmarks, though these benchmarks will be running on the same hardware (however this matters less for the sake of comparing synchronous and asynchronous IO).

Results part 1

Latency

There is a noticeable, approximately 30 times, reduction in latency for all types of synchronous operations. Another interesting side effect is that we see a slight reduction in latency for synchronous sequential operations when compared to synchronous random operations, but this does not carry over to asynchronous operations.

Latency (ns)


		Synchronous	Asynchronous
Read	Random	943	31,410
Read	Sequential	657	33,910
Write	Random	1,968	43,230
Write	Sequential	1,675	42,030
80% Read / 20% Write	Random	961 / 2,000	35,740 / 37,770
80% Read / 20% Write	Sequential	680 / 1,785	34,100 / 36,870

Throughput

Similarly to the latencies seen above, synchronous IO is noticeably faster in terms of throughput. Throughput for synchronous operations is approximately 2 times that of asynchronous operations. This can be seen in terms of either IOPS or bandwidth given that the blobs used in this benchmark are all the same size. Both results are shown below. Results for the 80% read and 20% write workload are summed.

IOPS


		Synchronous	Asynchronous
Read	Random	762,000	307,000
Read	Sequential	1,320,000	311,000
Write	Random	424,000	237,000
Write	Sequential	561,000	245,000
80% Read / 20% Write	Random	644,000	290,000
80% Read / 20% Write	Sequential	918,000	304,000

Bandwidth (MiB/s)


		Synchronous	Asynchronous
Read	Random	2,975	1,199
Read	Sequential	5,156	1,213
Write	Random	1,656	927
Write	Sequential	2,190	958
80% Read / 20% Write	Random	2,514	1,133
80% Read / 20% Write	Sequential	3,583	1,184

Results part 2

The reason why the previous benchmarks favor synchronous IO is partially because operating systems buffer IO operations in memory by default. These benchmarks were not configured to explicitly avoid buffering in memory, so these read and write "IO operations" are not really IO operations. Importantly, this is not cheating! Most real workloads are going to benefit similarly from buffering.

However, there is more nuance if we disable buffering, so let us repeat the benchmark and disable buffering this time, that is setting direct=1 in the fio initialization file. Results of this benchmark are shown below in the same format, but this time only include a read only workload.

Latency

There are a couple of things that are different about the latency results with buffering disabled. First of all, our latencies are orders of magnitude higher across the board. While buffered synchronous sequential IO operations can take several hundred nanoseconds it takes tens of microseconds to do a IO operation without buffering. Second, synchronous IO is still faster.

Latency (ns)


		Synchronous	Asynchronous
Read	Random	84,700	362,640
Read	Sequential	26,450	144,270

Throughput

It turns out that synchronous IO is not better across the board though. With buffering disabled asynchronous IO does provide better throughput. Sequential synchronous IO operations are close in terms of throughput to random asynchronous IO operations, but not quite. Disabling buffering also shows the expected performance difference between sequential and random IO operations more clearly.

IOPS


		Synchronous	Asynchronous
Read	Random	12,000	44,000
Read	Sequential	38,000	110,000

Bandwidth (MiB/s)


		Synchronous	Asynchronous
Read	Random	46	172
Read	Sequential	147	430

If you have made it this far, hopefully you will now believe the claim that synchronous IO is faster than asynchronous IO. Beyond that, buffered IO is awesome.

Setup

The benchmarks in this post can be recreated using the initialization file below.

$ cat benchmark.ini
[global]
size=1G           ; File size of 1 GiB per job
direct=0          ; Use buffered IO operations (toggle this for the second set of benchmarks)
bs=4k             ; Block size of 4 KiB
numjobs=1         ; Single job to isolate IO method
runtime=60        ; Run for 60 seconds
time_based        ; Ensure the test runs for the full duration
stonewall         ; Ensure jobs run sequentially

[sync-read]
ioengine=sync     ; Synchronous IO engine
rw=read           ; Sequential reads (replicate with: randread, write, randwrite, rw, and randrw)

[async-read]
ioengine=posixaio ; POSIX asynchronous IO engine
iodepth=32        ; Queue depth for asynchronous I/O
rw=read           ; Sequential reads (replicate with: randread, write, randwrite, rw, and randrw)

Measuring storage performance

Wednesday, March 19, 2025

I would like to collect a baseline for the performance of a modern filesystem and SSD. While this information is freely available on the internet, it is not difficult to collect first hand.

For my test I will be using fio and a 2022 M2 MacBook Air that I own. This MacBook uses uses the APFS filesystem and has 256 GB of storage among other things. I mention the storage capacity only because the M2 MacBook with a 256 GB SSD was repeatedly called out for having slow IO because it has a single chip rather than two 128 GB chips like the M1 MacBook Air.

Setup

Let us start by pulling some information about the disk.

$ diskutil info /
   Device Identifier:         disk3s1s1
   Device Node:               /dev/disk3s1s1
   Whole:                     No
   Part of Whole:             disk3

   Volume Name:               Macintosh HD
   Mounted:                   Yes
   Mount Point:               /

   Partition Type:            41504653-0000-11AA-AA11-00306543ECAC
   File System Personality:   APFS
   Type (Bundle):             apfs
   Name (User Visible):       APFS
   Owners:                    Enabled

   OS Can Be Installed:       No
   Booter Disk:               disk3s2
   Recovery Disk:             disk3s3
   Media Type:                Generic
   Protocol:                  Apple Fabric
   SMART Status:              Verified
   Volume UUID:               DC0B9406-4DDE-4A3B-B63A-53CD93A2E08F
   Disk / Partition UUID:     DC0B9406-4DDE-4A3B-B63A-53CD93A2E08F

   Disk Size:                 245.1 GB (245107195904 Bytes) (exactly 478724992 512-Byte-Units)
   Device Block Size:         4096 Bytes
   ...

To configure the benchmark we need to write an initialization file, benchmark.ini. This will specify that storage operations will be asynchronous, will not be cached, et cetera.... This will also make use of some information based on our disk.

$ cat benchmark.ini
[global]
ioengine=posixaio      ; Use asynchronous I/O for &quot;Better performance&quot;
direct=1               ; Do not use buffered IO operations
bs=4k                  ; Block size of 4 KiB, matching the APFS block size (above)
size=1G                ; File size of 1 GiB per job
runtime=60             ; Run for 60 seconds
time_based             ; Ensure the test runs for the full duration
group_reporting        ; Aggregate results across jobs

[job1]
rw=randrw              ; Perform random reads and writes
rwmixread=80           ; 80% reads, 20% writes
iodepth=32             ; Queue depth for asynchronous I/O
numjobs=1              ; A single process will perform reads and writes
filename=/...          ; A file the user has write access to below the APFS mount point (above)

Results

After configuring the benchmark it can be ran with the straightforward invocation of fio below.

$ fio benchmark.ini
job1: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=32
fio-3.39
Starting 1 process
Jobs: 1 (f=1): [m(1)][100.0%][r=92.8MiB/s,w=23.8MiB/s][r=23.8k,w=6084 IOPS][eta 00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=23272: Wed Mar 19 10:42:33 2025
  read: IOPS=28.1k, BW=110MiB/s (115MB/s)(6598MiB/60003msec)
    slat (nsec): min=0, max=45000, avg=346.23, stdev=700.18
    clat (usec): min=80, max=72139, avg=446.26, stdev=688.56
     lat (usec): min=80, max=72140, avg=446.61, stdev=688.57
    clat percentiles (usec):
     |  1.00th=[  212],  5.00th=[  247], 10.00th=[  265], 20.00th=[  285],
     | 30.00th=[  297], 40.00th=[  314], 50.00th=[  330], 60.00th=[  343],
     | 70.00th=[  363], 80.00th=[  392], 90.00th=[  453], 95.00th=[ 1303],
     | 99.00th=[ 2507], 99.50th=[ 3851], 99.90th=[ 4228], 99.95th=[ 4490],
     | 99.99th=[29492]
   bw (  KiB/s): min=20796, max=130828, per=100.00%, avg=112743.83, stdev=15552.87, samples=119
   iops        : min= 5199, max=32707, avg=28185.59, stdev=3888.20, samples=119
  write: IOPS=7046, BW=27.5MiB/s (28.9MB/s)(1652MiB/60003msec); 0 zone resets
    slat (nsec): min=0, max=47000, avg=492.36, stdev=849.72
    clat (usec): min=90, max=69969, avg=482.18, stdev=679.77
     lat (usec): min=90, max=69973, avg=482.68, stdev=679.81
    clat percentiles (usec):
     |  1.00th=[  245],  5.00th=[  281], 10.00th=[  297], 20.00th=[  314],
     | 30.00th=[  330], 40.00th=[  343], 50.00th=[  359], 60.00th=[  375],
     | 70.00th=[  396], 80.00th=[  424], 90.00th=[  494], 95.00th=[ 1450],
     | 99.00th=[ 2769], 99.50th=[ 3916], 99.90th=[ 4228], 99.95th=[ 4490],
     | 99.99th=[23725]
   bw (  KiB/s): min= 5346, max=32456, per=100.00%, avg=28222.54, stdev=3855.00, samples=119
   iops        : min= 1336, max= 8114, avg=7055.30, stdev=963.78, samples=119
  lat (usec)   : 100=0.01%, 250=4.63%, 500=86.82%, 750=1.83%, 1000=0.67%
  lat (msec)   : 2=3.29%, 4=2.49%, 10=0.25%, 20=0.01%, 50=0.01%
  lat (msec)   : 100=0.01%
  cpu          : usr=7.90%, sys=7.99%, ctx=1539978, majf=0, minf=9
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=51.1%, 16=48.9%, 32=0.0%, &gt;=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, &gt;=64=0.0%
     complete  : 0=0.0%, 4=98.9%, 8=1.1%, 16=0.1%, 32=0.0%, 64=0.0%, &gt;=64=0.0%
     issued rwts: total=1688969,422787,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=110MiB/s (115MB/s), 110MiB/s-110MiB/s (115MB/s-115MB/s), io=6598MiB (6918MB), run=60003-60003msec
  WRITE: bw=27.5MiB/s (28.9MB/s), 27.5MiB/s-27.5MiB/s (28.9MB/s-28.9MB/s), io=1652MiB (1732MB), run=60003-60003msec

Input/Output Operations Per Second (IOPS)

IOPS measure the frequency of IO operations over the time of the benchmark. We see 28,100 read IOPS and 7046 write IOPS, which corresponds to the benchmark of 80% reads (well 79.95% reads to be precise). In sum, we can expect that our MacBook can do 35,000 IOPS.

For reads and writes of a fixed size, which is the case with this benchmark, IOPS provide the same information as bandwidth which tracks throughput. The benchmark reported a bandwidth of 137.5 MiB per second. This works out to be roughly the number of IOPS times the block size, that is 35,000 * 4 KiB.

If the benchmark were instead writing chunks below the filesystem block size, then you could expect to see bandwidth drop. For example, if the benchmark instead wrote 2 KiB chunks, then we should expect effectively half of the bandwidth from our benchmark. This is left as an exercise to the reader.

Latency

There are three measures of latency found in the report:

Submission latency (slat): The time it takes to submit the IO operation to the OS IO queue.
Completion latency (clat): The time it takes for the OS IO process the IO operation and remove it from the queue.
Total latency (lat): The sum of completion and submission latencies.

With that in mind we know that we can expect each read operation to take 446.61 μs on average and each write operation to take 482.68 μs on average. It is also to view the latency percentiles for clat. For example, the median, 50th percentile, write operation takes 359 μs, so we know latency skews high.

How anonymous namespaces work

Sunday, March 2, 2025

In the immediately preceding post I discussed how a linker enforces access to a symbol from within a single translation unit. In that post I emphasized the usage of the static keyword because it is a shared approach across both C and C++. A more idiomatic approach in C++ would be to use an anonymous namespace. Usage of the static keyword for this purpose was actually deprecated in the C++ standard and then undeprecated. An example making use of anonymous namespaces is below:

static int count = 0;

// instead of static void incrementCount
namespace {
    void incrementCount() {
        count++;
    }
}

// Force the compiler not to skip incrementCount
void callIncrementCount() { incrementCount(); }

In light of the preceding post, a reasonable question to ask could be: "How does this work?" There are not many satisfactory answers available. Going straight to the primary source of authoritative but unsatisfactory answers on C++, cppreference.com states:

[An anonymous namespace definition] is treated as a definition of a namespace with unique name ... The unique name is unique over the entire program, but within a translation unit each [anonymous namespace] definition maps to the same unique name: multiple [anonymous namespace] definitions in the same scope denote the same [anonymous namespace].

Pardon the punctuation in the quote. CppReference also refers to anonymous namespaces as unnamed namespaces, which is a term that I have never heard used in practice. Beyond that, the definition is also incomplete. To see this, we can use the same approach in the preceding post.

$ clang++ -c lib.cpp
$ nm lib.o
0000000000000000 T __Z23callIncrementCountv
0000000000000068 b __ZL5count
0000000000000014 t __ZN12_GLOBAL__N_114incrementCountEv
0000000000000000 t ltmp0
0000000000000068 b ltmp1
0000000000000028 s ltmp2

The line 0000000000000014 t __ZN12_GLOBAL__N_114incrementCountEv shows the a symbols address, type, and name respectively. The type t indicates that the incrementCount function is not an exported symbol like it would have been if it were not wrapped in a namespace.

Interestingly, this is different than if we were to use a regular namespace. For example:

static int count = 0;

namespace foo {
    void incrementCount() {
        count++;
    }
}

void callIncrementCount() { foo::incrementCount(); }

The corresponding symbols are below:

$ clang++ -c lib.cpp
$ nm lib.o
0000000000000014 T __Z18callIncrementCountv
0000000000000068 b __ZL5count
0000000000000000 T __ZN3foo14incrementCountEv
0000000000000000 t ltmp0
0000000000000068 b ltmp1
0000000000000028 s ltmp2

So interestingly anonymous namespaces do not only result in a unique name being assigned to a symbol, but they also change the fact that they are exported by the object file.

Another fun thing that I discovered while trying to have prettify symbols was that extern "C" silently changes the behavior of anonymous namespaces, while erroring when combined with the static keyword. The example:

static int count = 0;

namespace {
    extern "C" void incrementCount() {
        count++;
    }
}

void callIncrementCount() { incrementCount(); }

Compiles to the symbols:

$ clang++ -c lib.cpp
$ nm lib.o
0000000000000014 T __Z18callIncrementCountv
0000000000000068 b __ZL5count
0000000000000000 T _incrementCount
0000000000000000 t ltmp0
0000000000000068 b ltmp1
0000000000000028 s ltmp2

When the example:

static int count = 0;

extern "C" static void incrementCount() {
    count++;
}

Fails to compile with the error:

clang++ -c lib.cpp
lib.cpp:3:12: error: cannot combine with previous 'extern' declaration specifier
    3 | extern "C" static void incrementCount() {
      |            ^
1 error generated.

This is kind of a toy example, but as I said it is fun. The nice thing about the results of this post is that you can make use of anonymous namespaces as a drop in for the static keyword and trust that they operate the same for pretty much all intents and purposes (outside of the last one).

How linkage works

Saturday, March 1, 2025

Sometimes you want to limit access to a symbol, for example a variable, to a single translation unit. The shared approach between C and C++ is to declare a symbol within a translation unit as static. This example is shown below:

// In lib.hpp

// Prevent count from being accessed outside of this translation unit
static int count = 0;

...

// In main.cpp

// Tell the compiler that count is defined in another translation unit 
extern int count;

int main(int argc, char** argv) {
    return count;
}

This example will throw an error when attempting to compile it. Specifically:

$ clang++ -c lib.cpp
$ clang++ main.cpp lib.o
Undefined symbols for architecture arm64:
  "_count", referenced from:
      _main in main-4c6fc0.o
ld: symbol(s) not found for architecture arm64
clang++: error: linker command failed with exit code 1 (use -v to see invocation)

I have probably read errors like this thousands of times, but I have never actually thought about the fact that this error is generated from the linker and not the compiler. How does the linker know that the C or C++ code indicates that the symbol should not be accessed from outside of the translation unit? Let us take a look at some examples.

First let us take an example with a non-static function:

static int count = 0;

void incrementCount() {
    count++;
}

We can compile this example and then show the symbols present in the corresponding object file. These objects are presented to the linker to determine access.

$ clang++ -c lib.cpp
$ nm lib.o
0000000000000000 T __Z14incrementCountv
0000000000000038 b __ZL5count
0000000000000000 t ltmp0
0000000000000038 b ltmp1
0000000000000018 s ltmp2

The line 0000000000000000 T __Z14incrementCountv shows the a symbols address, type, and name respectively. The type T indicates that the incrementCount function is part of the text section of the program and that it is exported by virtue of being upper case.

If we consider another example with a static function:

static int count = 0;

static void incrementCount() {
    count++;
}

// Force the compiler not to skip incrementCount
void callIncrementCount() {
    incrementCount();
}

We can then follow the same compilation process and show symbols as we did previously

$ clang++ -c lib.cpp
$ nm lib.o
0000000000000000 T __Z23callIncrementCountv
0000000000000014 t __ZL14incrementCountv
0000000000000068 b __ZL5count
0000000000000000 t ltmp0
0000000000000068 b ltmp1
0000000000000028 s ltmp2

The line 0000000000000014 t __ZL14incrementCountv is similar to what it was previously, with the exception of the type t being lower case. This means that it is no longer being exported by virtue of being lower case, but it still belongs to the text section in the object file.

Not only is this interesting, but it also explains where the terminology internal linkage and external linkage comes from. Internal linkage is when a symbol cannot be accessed from outside of the translation unit in which it is defined. External linkage, the default behavior, is when a symbol can be accessed from any translation unit.

C++ Trait Bounds

Sunday, October 16, 2022

High performance generic programming is a killer feature in both Rust and C++. In each language the compiler can generate a concrete implementation of a generic program which results in zero runtime overhead. However, the Rust compiler also allows you to specify the functionality that a concrete type must possess in order to be used in a generic instance. This feature for additional safety is referred to as a trait bound and we can use similar ideas in C++ to achieve the same safety. Consider the following code:

trait Shape {
    fn draw(self);
}

struct Circle;

impl Shape for Circle {
    fn draw(self) {
        ...
    }
}

We have an abstract method Shape::draw. Assume we have a generic function exec_draw which calls the draw function on the passed parameter. Rust requires that a bound be placed on the parameter's generic type by specifying a trait, in this case Shape, which requires all usage of the exec_draw function to occur on concrete types which implement the Shape trait. This trait bound is shown below:

fn exec_draw<T: Shape>(c: T)
{
    T::draw(c);
}

The benefit here is that a different type, Picasso which does not explicitly implement Shape, cannot be passed to exec_draw function without generating a compile time error. This prevents accidental duck typing and it is still a compile time check with zero runtime overhead.

...

struct Picasso;

impl Picasso {
    fn draw(self) {
        ...
    }
}

...

fn main() {
    // Works as expected
    let circle = Circle;
    exec_draw(circle);

    // Generates compiler error "the trait `Shape` is not implemented for `Picasso`"
    let picasso = Picasso;
    exec_draw(picasso);
}

The default behavior in C++ is that the compiler does not provide a similar check. See a similar implementation in C++ below:

class Shape {
public:
    virtual void draw() = 0;
};

class Circle: public Shape {
public:
    void draw() {
        ...
    }
};

class Picasso {
public:
    void draw() {
        ...
    }
};

template<class T>
void exec_draw(T& x) {
    x->draw();
}

int main(int argc, char** argv) {
    // Works as expected
    auto circle = new Circle();
    exec_draw(circle);

    // Also works; C++ compiler don't care
    auto picasso = new Picasso();
    exec_draw(picasso);
}

Maybe allowing duck typing is desirable behavior, but getting the safety that we have in Rust with C++ is fairly straightforward here. One might be inclined to do something like the following:

void exec_draw(Shape* x) {
    x->draw();
}

Although this works, it is not guaranteed to have zero runtime overhead. In this toy program the compiler is able to remove the virtual function call, but as additional shapes are added to the program this optimization will fail and the result will be poorer performance.

To get the same behavior with zero runtime overhead we can revert back to templates and define a class called Can_copy, with a static constraint function containing the behavior that we want to test and a constructor assigning a function pointer to the static function. The class is used only for encapsulation. In this case, the behavior is that T must be a Shape*, or a pointer to a subclass of shape, or the user must have defined conversion to either of the former.

...

template<class T1, class T2> struct Can_copy {
    static void constraints(T1 a, T2 b) { T2 c = a; b = a; }
    Can_copy() { void(*p)(T1,T2) = constraints; }
};

...

template<class T>
void exec_draw(T& x) {
    // This code is interpreted at compile time by the compiler
    Can_copy<T, Shape*>();

    x->draw();
}

int main(int argc, char** argv) {
    // Works as expected
    auto circle = new Circle();
    exec_draw(circle);

    // Generates compiler error "error: incompatible pointer types assigning to 'Shape *' from 'Picasso *'"
    auto picasso = new Picasso();
    exec_draw(picasso);
}

This approach can be straightforwardly extended to be used for arbitrary compile time checks for constraints on generic types.

Note: This example is drawn from Bjarne Stroustrup's C++ Style and Technique FAQ. A key difference is that this approach is discussed there as a technique to generate better compiler errors, rather than prevent accidental duck typing when using generics. Presenting the idea in the context of program safety is in my opinion more important.