Measuring storage performance

Wednesday, March 19, 2025

I would like to collect a baseline for the performance of a modern filesystem and SSD. While this information is freely available on the internet, it is not difficult to collect first hand.

For my test I will be using fio and a 2022 M2 MacBook Air that I own. This MacBook uses uses the APFS filesystem and has 256 GB of storage among other things. I mention the storage capacity only because the M2 MacBook with a 256 GB SSD was repeatedly called out for having slow IO because it has a single chip rather than two 128 GB chips like the M1 MacBook Air.

Setup

Let us start by pulling some information about the disk.

$ diskutil info /
   Device Identifier:         disk3s1s1
   Device Node:               /dev/disk3s1s1
   Whole:                     No
   Part of Whole:             disk3

   Volume Name:               Macintosh HD
   Mounted:                   Yes
   Mount Point:               /

   Partition Type:            41504653-0000-11AA-AA11-00306543ECAC
   File System Personality:   APFS
   Type (Bundle):             apfs
   Name (User Visible):       APFS
   Owners:                    Enabled

   OS Can Be Installed:       No
   Booter Disk:               disk3s2
   Recovery Disk:             disk3s3
   Media Type:                Generic
   Protocol:                  Apple Fabric
   SMART Status:              Verified
   Volume UUID:               DC0B9406-4DDE-4A3B-B63A-53CD93A2E08F
   Disk / Partition UUID:     DC0B9406-4DDE-4A3B-B63A-53CD93A2E08F

   Disk Size:                 245.1 GB (245107195904 Bytes) (exactly 478724992 512-Byte-Units)
   Device Block Size:         4096 Bytes
   ...

To configure the benchmark we need to write an initialization file, benchmark.ini. This will specify that storage operations will be asynchronous, will not be cached, et cetera.... This will also make use of some information based on our disk.

$ cat benchmark.ini
[global]
ioengine=posixaio      ; Use asynchronous I/O for "Better performance"
direct=1               ; Do not use buffered IO operations
bs=4k                  ; Block size of 4 KiB, matching the APFS block size (above)
size=1G                ; File size of 1 GiB per job
runtime=60             ; Run for 60 seconds
time_based             ; Ensure the test runs for the full duration
group_reporting        ; Aggregate results across jobs

[job1]
rw=randrw              ; Perform random reads and writes
rwmixread=80           ; 80% reads, 20% writes
iodepth=32             ; Queue depth for asynchronous I/O
numjobs=1              ; A single process will perform reads and writes
filename=/...          ; A file the user has write access to below the APFS mount point (above)

Results

After configuring the benchmark it can be ran with the straightforward invocation of fio below.

$ fio benchmark.ini
job1: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=32
fio-3.39
Starting 1 process
Jobs: 1 (f=1): [m(1)][100.0%][r=92.8MiB/s,w=23.8MiB/s][r=23.8k,w=6084 IOPS][eta 00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=23272: Wed Mar 19 10:42:33 2025
  read: IOPS=28.1k, BW=110MiB/s (115MB/s)(6598MiB/60003msec)
    slat (nsec): min=0, max=45000, avg=346.23, stdev=700.18
    clat (usec): min=80, max=72139, avg=446.26, stdev=688.56
     lat (usec): min=80, max=72140, avg=446.61, stdev=688.57
    clat percentiles (usec):
     |  1.00th=[  212],  5.00th=[  247], 10.00th=[  265], 20.00th=[  285],
     | 30.00th=[  297], 40.00th=[  314], 50.00th=[  330], 60.00th=[  343],
     | 70.00th=[  363], 80.00th=[  392], 90.00th=[  453], 95.00th=[ 1303],
     | 99.00th=[ 2507], 99.50th=[ 3851], 99.90th=[ 4228], 99.95th=[ 4490],
     | 99.99th=[29492]
   bw (  KiB/s): min=20796, max=130828, per=100.00%, avg=112743.83, stdev=15552.87, samples=119
   iops        : min= 5199, max=32707, avg=28185.59, stdev=3888.20, samples=119
  write: IOPS=7046, BW=27.5MiB/s (28.9MB/s)(1652MiB/60003msec); 0 zone resets
    slat (nsec): min=0, max=47000, avg=492.36, stdev=849.72
    clat (usec): min=90, max=69969, avg=482.18, stdev=679.77
     lat (usec): min=90, max=69973, avg=482.68, stdev=679.81
    clat percentiles (usec):
     |  1.00th=[  245],  5.00th=[  281], 10.00th=[  297], 20.00th=[  314],
     | 30.00th=[  330], 40.00th=[  343], 50.00th=[  359], 60.00th=[  375],
     | 70.00th=[  396], 80.00th=[  424], 90.00th=[  494], 95.00th=[ 1450],
     | 99.00th=[ 2769], 99.50th=[ 3916], 99.90th=[ 4228], 99.95th=[ 4490],
     | 99.99th=[23725]
   bw (  KiB/s): min= 5346, max=32456, per=100.00%, avg=28222.54, stdev=3855.00, samples=119
   iops        : min= 1336, max= 8114, avg=7055.30, stdev=963.78, samples=119
  lat (usec)   : 100=0.01%, 250=4.63%, 500=86.82%, 750=1.83%, 1000=0.67%
  lat (msec)   : 2=3.29%, 4=2.49%, 10=0.25%, 20=0.01%, 50=0.01%
  lat (msec)   : 100=0.01%
  cpu          : usr=7.90%, sys=7.99%, ctx=1539978, majf=0, minf=9
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=51.1%, 16=48.9%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=98.9%, 8=1.1%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=1688969,422787,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=110MiB/s (115MB/s), 110MiB/s-110MiB/s (115MB/s-115MB/s), io=6598MiB (6918MB), run=60003-60003msec
  WRITE: bw=27.5MiB/s (28.9MB/s), 27.5MiB/s-27.5MiB/s (28.9MB/s-28.9MB/s), io=1652MiB (1732MB), run=60003-60003msec

Input/Output Operations Per Second (IOPS)

IOPS measure the frequency of IO operations over the time of the benchmark. We see 28,100 read IOPS and 7046 write IOPS, which corresponds to the benchmark of 80% reads (well 79.95% reads to be precise). In sum, we can expect that our MacBook can do 35,000 IOPS.

For reads and writes of a fixed size, which is the case with this benchmark, IOPS provide the same information as bandwidth which tracks throughput. The benchmark reported a bandwidth of 137.5 MiB per second. This works out to be roughly the number of IOPS times the block size, that is 35,000 * 4 KiB.

If the benchmark were instead writing chunks below the filesystem block size, then you could expect to see bandwidth drop. For example, if the benchmark instead wrote 2 KiB chunks, then we should expect effectively half of the bandwidth from our benchmark. This is left as an exercise to the reader.

Latency

There are three measures of latency found in the report:

  1. Submission latency (slat): The time it takes to submit the IO operation to the OS IO queue.
  2. Completion latency (clat): The time it takes for the OS IO process the IO operation and remove it from the queue.
  3. Total latency (lat): The sum of completion and submission latencies.

With that in mind we know that we can expect each read operation to take 446.61 μs on average and each write operation to take 482.68 μs on average. It is also to view the latency percentiles for clat. For example, the median, 50th percentile, write operation takes 359 μs, so we know latency skews high.