Previously we looked at how the POSIX_FADVISE_DONTNEED
hint influences the Linux page cache when doing IO via a filesystem. Here we take a look at two more filesystem hints POSIX_FADV_RANDOM
and POSIX_FADV_SEQUENTIAL
Storage Performance
Storage Performance and benchmarking
Using fio to read from Linux buffer-cache
Sometimes we want to read from the Linux cache rather than the underlying device using fio. There are couple of gotchas that might trip you up. Thankfully fio provides the required work-arounds.
TL;DR
To get this to work as expected (reads are serviced from buffer cache) – the best way is to use the option invalidate=0
in the fio file.
fio versions < 3.3 may show inflated random write performance
TL;DR
If your storage system implements inline compression, performance results with small IO size random writes with time_based and runtime may be inflated with fio versions < 3.3 due to fio generating unexpectedly compressible data when using fio’s default data pattern. Although unintuitive, performance can often be increased by enabling compression especially if the bottleneck is on the storage media, replication or a combination of both.
Therefore if you are comparing performance results generated using fio version < 3.3 and fio >=3.3 the random write performance on the same storage platform my appear reduced with more recent fio versions.
fio-3.3 was released in December 2017 but older fio versions are still in use particularly on distributions with long term (LTS) support. For instance Ubuntu 16, which is supported until 2026 ships with fio-2.2.10
Specifying Drive letters with fio for Windows.
fio on Windows
Download pre-compiled fio binary for Windows
Example fio windows file, single drive
This will create a 1GB file called fiofile
on the F:\
Drive in Windows then read the file. Notice that the specification is “Driveletter” “Backslash” “Colon” “Filename”
In fio terms we are “escaping” the :
which fio traditionally uses as a file separator.
[global]
bs=1024k
size=1G
time_based
runtime=30
rw=read
direct=1
iodepth=8
[job1]
filename=F\:fiofile
Continue reading
Hunting for bandwidth on a consumer NVMe drive
The Samsung SSD 970 EVO 500GB claims a sequential read bandwidth of 3400 MB/s this is a story of trying to achieve that number.
Beware of tiny working-set-sizes when testing storage performance.
I was recently asked to investigate why Nutanix storage was not as fast as a competing solution in a PoC environment. When I looked at the output from diskspd, the data didn’t quite make sense.
Continue readingUsing rwmixread and rate_iops in fio
Creating a mixed read/write workload with fio can be a bit confusing. Assume we want to create a fixed rate workload of 100 IOPS split 70:30 between reads and writes.
TL;DR
Specify the rate directly with rate_iops=<read-rate>,<write-rate> do not try to use rwmixread with rate_iops. For the example above use.
rate_iops=70,30
Additionally older versions of fio exhibit problems when using rate_poisson with rate_iops . fio version 3.7 that I was using did not exhibit the problem.
Continue readingUnderstanding fio norandommap and randrepeat parameters
The parameters norandommap and randrepeat significantly change the way that repeated random IO workloads will be executed, and also can meaningfully change the results of an experiment due to the way that caching works on most storage system.
Continue readingIdentifying Optane drives in Linux
How to identify optane drives in linux OS using lspci.
Continue readingMicrosoft diskspd Part 3. Oddities and FAQ
Tips and tricks for using diskspd especially useful for those familar with tools like fio
Continue readingMicrosoft diskspd. Part 2 How to bypass NTFS Cache.
How to ensure performance testing with diskspd is stressing the underlying storage devices, not the OS filesystem.
Continue readingMicrosoft diskspd. Part 1 Preparing to test.
How to install and setup diskspd before starting your first performance tests and avoiding wrong results due to null byte issues.
Continue readingHow to identify NVME drive types and test throughput
Why does my SSD not issue 1MB IO’s?
First things First
Why do we tend to use 1MB IO sizes for throughput benchmarking?
To achieve the maximum throughput on a storage device, we will usually use a large IO size to maximize the amount of data is transferred per IO request. The idea is to make the ratio of data-transfers to IO requests as large as possible to reduce the CPU overhead of the actual IO request so we can get as close to the device bandwidth as possible. To take advantage of and pre-fetching, and to reduce the need for head movement in rotational devices, a sequential pattern is used.
For historical reasons, many storage testers will use a 1MB IO size for sequential testing. A typical fio command line might look like something this.
fio --name=read --bs=1m --direct=1 --filename=/dev/sdaContinue reading
How to identify SSD types and measure performance.
The real-world achievable SSD performance will vary depending on factors like IO size, queue depth and even CPU clock speed. It’s useful to know what the SSD is capable of delivering in the actual environment in which it’s used. I always start by looking at the performance claimed by the manufacturer. I use these figures to bound what is achievable. In other words, treat the manufacturer specs as “this device will go no faster than…”.
Identify SSD
Start by identifying the exact SSD type by using lsscsi. Note that the disks we are going to test are connected by ATA transport type, therefore the maximum queue depth that each device will support is 32.
# lsscsi
[1:0:0:0] cd/dvd QEMU QEMU DVD-ROM 2.5+ /dev/sr0
[2:0:0:0] disk ATA SAMSUNG MZ7LM1T9 404Q /dev/sda
[2:0:1:0] disk ATA SAMSUNG MZ7LM1T9 404Q /dev/sdb
[2:0:2:0] disk ATA SAMSUNG MZ7LM1T9 404Q /dev/sdc
[2:0:3:0] disk ATA SAMSUNG MZ7LM1T9 404Q /dev/
The marketing name for these Samsung SSD’s is “SSD 850 EVO 2.5″ SATA III 1TB“
Identify device specs
The spec sheet for this ssd claims the following performance characteristics.
Workload (Max) | Spec | Measured |
Sequential Read (QD=8) | 540 MB/s | 534 |
Sequential Write (QD=8) | 520 MB/s | 515 |
Read IOPS 4KB (QD=32) | 98,000 | 80,00 |
Write IOPS 4KB (QD=32) | 90,000 | 67,000 |
Paper: A Nine year study of filesystem and storage benchmarking
A 2007 paper, that still has lots to say on the subject of benchmarking storage and filesystems. Primarily aimed at researchers and developers, but is relevant to anyone about to embark on a benchmarking effort.
- Use a mix of macro and micro benchmarks
- Understand what you are testing, cached results are fine – as long as that is what you had intended.
The authors are clear on why benchmarks remain important:
“Ideally, users could test performance in their own settings using real work- loads. This transfers the responsibility of benchmarking from author to user. However, this is usually impractical because testing multiple systems is time consuming, especially in that exposing the system to real workloads implies learning how to configure the system properly, possibly migrating data and other settings to the new systems, as well as dealing with their respective bugs.”
We cannot expect end-users to be experts in benchmarking. It is out duty as experts to provide the tools (benchmarks) that enable users to make purchasing decisions without requiring years of benchmarking expertise.
Storage Bus Speeds 2018
Storage bus speeds with example storage endpoints.
Bus | Lanes | End-Point | Theoretical Bandwidth (MB/s) | Note |
SAS-3 | 1 | HBA <-> Single SATA Drive | 600 | SAS3<->SATA 6Gbit |
SAS-3 | 1 | HBA <-> Single SAS Drive | 1200 | SAS3<->SAS3 12Gbit |
SAS-3 | 4 | HBA <-> SAS/SATA Fanout | 4800 | 4 Lane HBA to Breakout (6 SSD)[2] |
SAS-3 | 8 | HBA <-> SAS/SATA Fanout | 8400 | 8 Lane HBA to Breakout (12 SSD)[1] |
PCIe-3 | 1 | N/A | 1000 | Single Lane PCIe3 |
PCIe-3 | 4 | PCIe <-> SAS HBA or NVMe | 4000 | Enough for Single NVMe |
PCIe-3 | 8 | PICe <-> SAS HBA or NVMe | 8000 | Enough for SAS-3 4 Lanes |
PCIe-3 | 40 | PCIe Bus <-> Processor Socket | 40000 | Xeon Direct conect to PCIe Bus |
Notes
All figures here are the theoretical maximums for the busses using rough/easy calculations for bits/s<->bytes/s. Enough to figure out where the throughput bottlenecks are likely to be in a storage system.
- SATA devices contain a single SAS/SATA port (connection), and even when they are connected to a SAS3 HBA, the SATA protocol limits each SSD device to ~600MB/s (single port, 6Gbit)
- SAS devices may be dual ported (two connections to the device from the HBA(s)) – each with a 12Gbit connection giving a potential bandwidth of 2x12Gbit == 2.4Gbyte/s (roughly) per SSD device.
- An NVMe device directly attached to the PCIe bus has access to a bandwidth of 4GB/s by using 4 PCIe lanes – or 8GB/s using 8 PCIe lanes. On current Xeon processors, a single socket attaches to 40 PCIe lanes directly (see diagram below) for a total bandwidth of 40GB/s per socket.
- https://en.wikipedia.org/wiki/Serial_ATA#SATA_3_Gbit/s_and_SATA_6_Gbit/s
- https://www.nextplatform.com/2017/07/14/system-bottleneck-shifts-pci-express/
- https://www.hardwarezone.com.sg/feature-understanding-intel-x99-high-performance-computing-platform
- I first started down the road of finally coming to grips with all the different busses and lane types after reading this excellent LSI paper. I omitted the SAS-2 figures from this article since modern systems use SAS-3 exclusively.
[pdf-embedder url=”https://www.n0derunner.com/wp-content/uploads/2018/04/LSI-SAS-PCI-Bottlenecks.pdf” title=”LSI SAS PCI Bottlenecks”]
Intel Processor & PCI connections
The return of misaligned IO
We have started seeing misaligned partitions on Linux guests runnning certain HDFS distributions. How these partitions became mis-aligned is a bit of a mystery, because the only way I know how to do this on Linux is to create a partition using old DOS format like this (using -c=dos and -u=cylinders) Continue reading
High Response time and low throughput in vCenter performance charts.
Often we are presented with a vCenter screenshot, and an observation that there are “high latency spikes”. In the example, the response time is indeed quite high – around 80ms. Continue reading
Creating compressible data with fio.
Today I used fio to create some compressible data to test on my Nutanix nodes. I ended up using the following fio params to get what I wanted.
buffer_compress_percentage=50 refill_buffers buffer_pattern=0xdeadbeef
- buffer_compress_percentage does what you’d expect and specifies how compressible the data is
- refill_buffers Is required to make the above compress percentage do what you’d expect in the large. IOW, I want the entire file to be compressible by the buffer_compress_percentage amount
- buffer_pattern This is a big one. Without setting this pattern, fio will use Null bytes to achieve compressibility, and Nutanix like many other storage vendors will suppress runs of Zero’s and so the data reduction will mostly be from zero suppression rather than from compression.
Much of this is well explained in the README for latest version of fio.
Also NOTE Older versions of fio do not support many of the fancy data creation flags, but will not alert you to the fact that fio is ignoring them. I spent quite a bit of time wondering why my data was not compressed, until I downloaded and compiled the latest fio.