Measuring storage performance
I would like to collect a baseline for the performance of a modern filesystem and SSD. While this information is freely available on the internet, it is not difficult to collect first hand.
For my test I will be using fio
and a 2022 M2 MacBook Air that I own. This MacBook uses uses the
APFS filesystem and has 256 GB of storage among other things. I mention the storage capacity
only because the M2 MacBook with a 256 GB SSD was repeatedly
called out
for having slow IO because it has a single chip rather than two 128 GB chips like the M1 MacBook
Air.
Setup
Let us start by pulling some information about the disk.
$ diskutil info /
Device Identifier: disk3s1s1
Device Node: /dev/disk3s1s1
Whole: No
Part of Whole: disk3
Volume Name: Macintosh HD
Mounted: Yes
Mount Point: /
Partition Type: 41504653-0000-11AA-AA11-00306543ECAC
File System Personality: APFS
Type (Bundle): apfs
Name (User Visible): APFS
Owners: Enabled
OS Can Be Installed: No
Booter Disk: disk3s2
Recovery Disk: disk3s3
Media Type: Generic
Protocol: Apple Fabric
SMART Status: Verified
Volume UUID: DC0B9406-4DDE-4A3B-B63A-53CD93A2E08F
Disk / Partition UUID: DC0B9406-4DDE-4A3B-B63A-53CD93A2E08F
Disk Size: 245.1 GB (245107195904 Bytes) (exactly 478724992 512-Byte-Units)
Device Block Size: 4096 Bytes
...
To configure the benchmark we need to write an initialization file, benchmark.ini
. This will
specify that storage operations will be asynchronous, will not be cached, et cetera.... This will
also make use of some information based on our disk.
$ cat benchmark.ini
[global]
ioengine=posixaio ; Use asynchronous I/O for "Better performance"
direct=1 ; Do not use buffered IO operations
bs=4k ; Block size of 4 KiB, matching the APFS block size (above)
size=1G ; File size of 1 GiB per job
runtime=60 ; Run for 60 seconds
time_based ; Ensure the test runs for the full duration
group_reporting ; Aggregate results across jobs
[job1]
rw=randrw ; Perform random reads and writes
rwmixread=80 ; 80% reads, 20% writes
iodepth=32 ; Queue depth for asynchronous I/O
numjobs=1 ; A single process will perform reads and writes
filename=/... ; A file the user has write access to below the APFS mount point (above)
Results
After configuring the benchmark it can be ran with the straightforward invocation of fio
below.
$ fio benchmark.ini
job1: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=32
fio-3.39
Starting 1 process
Jobs: 1 (f=1): [m(1)][100.0%][r=92.8MiB/s,w=23.8MiB/s][r=23.8k,w=6084 IOPS][eta 00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=23272: Wed Mar 19 10:42:33 2025
read: IOPS=28.1k, BW=110MiB/s (115MB/s)(6598MiB/60003msec)
slat (nsec): min=0, max=45000, avg=346.23, stdev=700.18
clat (usec): min=80, max=72139, avg=446.26, stdev=688.56
lat (usec): min=80, max=72140, avg=446.61, stdev=688.57
clat percentiles (usec):
| 1.00th=[ 212], 5.00th=[ 247], 10.00th=[ 265], 20.00th=[ 285],
| 30.00th=[ 297], 40.00th=[ 314], 50.00th=[ 330], 60.00th=[ 343],
| 70.00th=[ 363], 80.00th=[ 392], 90.00th=[ 453], 95.00th=[ 1303],
| 99.00th=[ 2507], 99.50th=[ 3851], 99.90th=[ 4228], 99.95th=[ 4490],
| 99.99th=[29492]
bw ( KiB/s): min=20796, max=130828, per=100.00%, avg=112743.83, stdev=15552.87, samples=119
iops : min= 5199, max=32707, avg=28185.59, stdev=3888.20, samples=119
write: IOPS=7046, BW=27.5MiB/s (28.9MB/s)(1652MiB/60003msec); 0 zone resets
slat (nsec): min=0, max=47000, avg=492.36, stdev=849.72
clat (usec): min=90, max=69969, avg=482.18, stdev=679.77
lat (usec): min=90, max=69973, avg=482.68, stdev=679.81
clat percentiles (usec):
| 1.00th=[ 245], 5.00th=[ 281], 10.00th=[ 297], 20.00th=[ 314],
| 30.00th=[ 330], 40.00th=[ 343], 50.00th=[ 359], 60.00th=[ 375],
| 70.00th=[ 396], 80.00th=[ 424], 90.00th=[ 494], 95.00th=[ 1450],
| 99.00th=[ 2769], 99.50th=[ 3916], 99.90th=[ 4228], 99.95th=[ 4490],
| 99.99th=[23725]
bw ( KiB/s): min= 5346, max=32456, per=100.00%, avg=28222.54, stdev=3855.00, samples=119
iops : min= 1336, max= 8114, avg=7055.30, stdev=963.78, samples=119
lat (usec) : 100=0.01%, 250=4.63%, 500=86.82%, 750=1.83%, 1000=0.67%
lat (msec) : 2=3.29%, 4=2.49%, 10=0.25%, 20=0.01%, 50=0.01%
lat (msec) : 100=0.01%
cpu : usr=7.90%, sys=7.99%, ctx=1539978, majf=0, minf=9
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=51.1%, 16=48.9%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=98.9%, 8=1.1%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=1688969,422787,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=32
Run status group 0 (all jobs):
READ: bw=110MiB/s (115MB/s), 110MiB/s-110MiB/s (115MB/s-115MB/s), io=6598MiB (6918MB), run=60003-60003msec
WRITE: bw=27.5MiB/s (28.9MB/s), 27.5MiB/s-27.5MiB/s (28.9MB/s-28.9MB/s), io=1652MiB (1732MB), run=60003-60003msec
Input/Output Operations Per Second (IOPS)
IOPS measure the frequency of IO operations over the time of the benchmark. We see 28,100 read IOPS and 7046 write IOPS, which corresponds to the benchmark of 80% reads (well 79.95% reads to be precise). In sum, we can expect that our MacBook can do 35,000 IOPS.
For reads and writes of a fixed size, which is the case with this benchmark, IOPS provide the same information as bandwidth which tracks throughput. The benchmark reported a bandwidth of 137.5 MiB per second. This works out to be roughly the number of IOPS times the block size, that is 35,000 * 4 KiB.
If the benchmark were instead writing chunks below the filesystem block size, then you could expect to see bandwidth drop. For example, if the benchmark instead wrote 2 KiB chunks, then we should expect effectively half of the bandwidth from our benchmark. This is left as an exercise to the reader.
Latency
There are three measures of latency found in the report:
- Submission latency (
slat
): The time it takes to submit the IO operation to the OS IO queue. - Completion latency (
clat
): The time it takes for the OS IO process the IO operation and remove it from the queue. - Total latency (
lat
): The sum of completion and submission latencies.
With that in mind we know that we can expect each read operation to take 446.61 μs on average and
each write operation to take 482.68 μs on average. It is also to view the latency percentiles for
clat
. For example, the median, 50th percentile, write operation takes 359 μs, so we know latency
skews high.