Often we are presented with a vCenter screenshot, and an observation that there are “high latency spikes”. In the example, the response time is indeed quite high – around 80ms. Continue reading
Creating compressible data with fio.
Today I used fio to create some compressible data to test on my Nutanix nodes. I ended up using the following fio params to get what I wanted.
buffer_compress_percentage=50 refill_buffers buffer_pattern=0xdeadbeef
- buffer_compress_percentage does what you’d expect and specifies how compressible the data is
- refill_buffers Is required to make the above compress percentage do what you’d expect in the large. IOW, I want the entire file to be compressible by the buffer_compress_percentage amount
- buffer_pattern This is a big one. Without setting this pattern, fio will use Null bytes to achieve compressibility, and Nutanix like many other storage vendors will suppress runs of Zero’s and so the data reduction will mostly be from zero suppression rather than from compression.
Much of this is well explained in the README for latest version of fio.
Also NOTE Older versions of fio do not support many of the fancy data creation flags, but will not alert you to the fact that fio is ignoring them. I spent quite a bit of time wondering why my data was not compressed, until I downloaded and compiled the latest fio.
Code Red – The healthcare.gov story.
A downtime classic, for several months in 2013 the troubles of a very particular website were front page news across the US. Full Story from Time Magazine (PDF)
Cache behavior – How long will it take to fill my cache?
When benchmarking filesystems or storage, we need to understand the caching effects. Most often this involves filling the cache and reaching steady state. But how long will it take to fill a cache of a given size? The answer depends of course on the size of the cache, the IO size and the IO rate. So, to simpify let’s just say that a cache consists of some number of entries. For instance a 4GB cache would have 1 million 4KB entries. In my example this is simply a 1M entry cache.
In terms of time to fill the cache, it’s simpler to think about how many entries will need to be read before the cache is filled.
For a random workload, it will be more than 1M “reads”. Let’s see why.
The first read will be inserted into the cache, the second read will probably be inserted into the cache, but there is a small (1/1000000) chance that the second read will actually be already in the cache since it’s random. As the cache gets fuller – the chances of a given read already being present in cache increases. As a result it will take a lot more than 1 million reads to populate the entire cache with a random read workload.
The question is this. Is is possible to predict, how many “reads” it will take to fill the cahe?
The experiment.
In this experiment, we create an array to represent the cache. It has 1M entries. Then using a random number generator, simulate the workload and measure how long it takes to populate the cahche.
Results
After 1,000,000 “reads” there are 633,000 positive entries (entries that have data in them). So what happened to the other 367,000? The 367,000 represent cache “hits” on an existing entry. Since the read “workload” is 100% random, there is some chance that a subsequent read will be for an entry that is already cached. Over the life of 1,000,000 reads around 37% are for an entry that is already cached.
After 2,0000,000 reads the cache contains 864,000 entries. Another 1,000,000 reads yields 950,000.
The fuller that the cache becomes, the fewer new entries are added. Intuitively this makes sense because as the cache becomes more full, more of the “random reads” are satisfied by an existing cache entry.
In my experiments it takes about 17,000,000 “reads” to ensure that every cache entry is filled in a 1M entry cache. Here are the data for 19 runs.
Iteration | Positive Entries | Empty Slots | 1 | 631998 | 368002 | 2 | 864334 | 135666 | 3 | 950184 | 49816 | 4 | 981630 | 18370 | 5 | 993266 | 6734 | 6 | 997577 | 2423 | 7 | 999080 | 920 | 8 | 999660 | 340 | 9 | 999879 | 121 | 10 | 999951 | 49 | 11 | 999985 | 15 | 12 | 999996 | 4 | 13 | 999998 | 2 | 14 | 999998 | 2 | 15 | 999999 | 1 | 16 | 999999 | 1 | 17 | 1000000 | 0 | 18 | 1000000 | 0 |
- For 500,000 Entries it takes 15 iterations to fill all the entries.
- For 2,000,000 Entries it takes 19 iterations to fill all the entries.
Interestingly, the ratio of positive to empty entries after one iteration is always about 0.632:0.368
- 0.368 is roughly 1/e
- .632 is roughly 1-(1/e).
SQL*Server on Nutanix. Force backups to HDD.
As an experiment, I wanted to (a) Create a HDD only container, and (b) measure the bandwidth I could achieve when backing up the SQL DB. This was performed on a standard hybrid platform with only 4 HDD’s in the node.
First create a container, but add the special options “sequential-io-priority-order=DAS-SATA random-io-priority-order=DAS-SATA” which means that all IO will be directed to the HDD only. This also means that data on this container will never be migrated up. This is just fine for a backup that will hopefully never be read, and if it is – only once, sequentially.
ncli> ctr create name=cold-only sequential-io-priority-order=DAS-SATA random-io-priority-order=DAS-SATA sp-name=all ncli> datastore create name=cold ctr-name=cold-only
Next create a vDisk in that container – this disk will contain the SQL Server backup data
Format and initialize the drive.
Add backup targets to the drive. Adding multiple targets increases throughput because SQL Server will generate 1-2 outstanding IO’s per target. I created 16 total targets (these are just files).
The first backup is a little slow (~64MB/s), because we’re creating the files. A second (and subsequent) backups go faster, around 120 MB/s writing directly to the HDD spindles on a single node with 4 HDDs.
This backup stream drives around 25MB/s per HDD spindle on the Nutanix node. On a larger platform with more spindles – we could easily drive 500MB/s, and still skip SSD by writing directly to HDD.
Completed backup:
Things to know when using vdbench.
Recently I found that vdbench was not giving me the amount of outstanding IO that I had intended to configure by using the “threads=N” parameter. It turned out that with Linux, most of the filesystems (ext2, ext3 and ext4) do not support concurrent directIO, although they do support directIO. This was a bit of a shock coming from Solaris which had concurrent directIO since 2001.
All the Linux filesystems I tested allow multiple outstanding IO’s if the IO is submitted using asynchronous IO (AKA asyncIO or AIO) but not when using multiple writer threads (except XFS). Unfortunately vdbench does not allow AIO since it tries to be platform agnostic.
fio however does allow either threads or AIO to be used and so that’s what I used in the experiments below.
The column fio QD is the amount of outstanding IO, or Queue Depth that is intended to be passed to the storage device. The column iostat QD is the actual Queue Depth seen by the device. The iostat QD is not “8” because the response time is so low that fio cannot issue the IO’s quickly enough to maintain the intended queue depth.
Device
|
fio QD
|
fio QD Type
|
direct
|
iostat QD
|
ps -efT | grep fio | wc -l
|
/dev/sd
|
8
|
libaio
|
Yes
|
7
|
5
|
/dev/sd
|
8
|
Threads
|
Yes
|
7
|
12
|
ext2 fs (mke2fs)
|
8
|
Threads
|
Yes
|
1
|
12
|
ext2 fs (mke2fs)
|
8
|
libaio
|
Yes
|
7
|
5
|
ext3 (mkfs -t ext3)
|
8
|
Threads
|
Yes
|
1
|
12
|
ext3 (mkfs -t ext3)
|
8
|
libaio
|
Yes
|
7
|
5
|
ext4 (mkfs -t ext4)
|
8
|
Threads
|
Yes
|
1
|
12
|
ext4 (mkfs -t ext4)
|
8
|
libaio
|
Yes
|
7
|
5
|
xfs (mkfs -t xfs)
|
8
|
Threads
|
Yes
|
7
|
12
|
xfs (mkfs -t xfs)
|
8
|
libaio
|
Yes
|
7
|
5
|
At any rate, all is not lost – using raw devices (/dev/sdX) will give concurrent directIO, as will XFS. These issues are well known by Linux DB guys, and I found interesting articles from Percona and Kevin Closson after I finally figured out what was going on with vdbench.
fio “scripts”
For the “threads” case.
[global] bs=8k ioengine=sync iodepth=8 direct=1 time_based runtime=60 numjobs=8 size=1800m [randwrite-threads] rw=randwrite filename=/a/file1
For the “aio” case
[global] bs=8k ioengine=libaio iodepth=8 direct=1 time_based runtime=60 size=1800m [randwrite-aio] rw=randwrite filename=/a/file1
SATA on Nutanix. Some experimental data.
The question of why Nutanix uses SATA drive comes up sometimes, especially from customers who have experienced very poor performance using SATA on traditional arrays.
I can understand this anxiety. In my time at NetApp we exclusively used SAS or FC-AL drives in performance test work. At the time there was a huge difference in performance between SCSI and SATA. Even a few short years ago, FC typically spun at 15K RPM whereas SATA was stuck at about a 5K RPM, so experiencing 3X the rotational delay.
These days SAS and SATA are both available in 7200 RPM configurations, and these are the type we use in standard Nutanix nodes. In fact the SATA drives that we use are marketed by Seagate as “Nearline SAS” or NL-SAS. Mainly to differentiate them from the consumer grade SATA drives that are found in cheap laptops. There are hundreds of SAS Vs SATA articles on the web, so I won’t go over the theoretical/historical arguments.
SATA in Hybrid/Tiered Storage
In a Nutanix cluster the “heavy lifting” of IO is mainly done by the SSD’s – leaving the SATA drives to service the few remaining IO’s that miss the SSD tier. Under moderate load, the SATA spindles do pretty well, and since the SATA $/GB is only 60% of SAS. SATA seems like a good choice for mostly-cold data.
Let’s Experiment.
From a performance perspective, I decided to run a few experiments to see just how well SATA performs. In the test, the SATA drives are Nutanix standard drives “ST91000640NS” (Seagate, priced around $150). The comparable SAS drives are the same form-factor (2.5 Inch) “AL13SEB900” (Toshiba, priced at about $250 USD). These drives spin at 10K RPM. Both drives hold around 1TB.
There are three experiments per drive type to reveal the impact of seek-times. This is achieved using the “filesize” parameter of fio – which determines the LBA range to read. One thing to note, is that I use a queue-depth of one. Therefore IOPs can be calculated as simply 1/Response-Time (converted to seconds).
[global] bs=8k rw=randread iodepth=1 ioengine=libaio time_based runtime=10 direct=1 filesize=1g [randread] filename=/dev/sdf1
Random Distribution. SATA Vs SAS
Working Set Size | 7.2K RPM SATA Response Time (ms) | 10K RPM SAS Response Time (ms) |
1 GB | 5.5 | 4 |
100 GB | 7.5 | 4.5 |
1000 GB | 12.5 | 7 |
Zipf Distribution. SATA Only.
Working Set Size | Response Time (ms) |
1000G | 8.5 |
Somewhat intuitively as the working-set (seek) gets larger, the difference between “Real SAS” and “NL-SAS/SATA” gets wider. This is intuitive because with a 1GB working-set, the seek-time is close to zero, and so only the rotational delay (based on RPM) is a factor. In fact the difference in response time is the same as the difference in rotational speed (1:1.3).
Also (just for fun) I used the “random_distribution=zipf” function in fio to test the response time when reading across the entire range of the disk – but with a “hotspot” (zipf) rather than a uniform random read – which is pretty unrealistic.
In the “realistic” case – reading across the entire disk on the SATA drives shipped with Nutanix nodes is capable of 8.5 ms response time at 125 IOPS per spindle.
Conclusion
The performance difference between SAS and SATA is often over-stated. At moderate loads SATA performs well enough for most use-cases. Even when delivering fully random IO over the entirety of the disk – SATA can deliver 8K in less than 15ms. Using a more realistic (not 100% random) access pattern the response time is < 10ms.
For a properly sized Nutanix implementation, the intent is to service most IO from Flash. It’s OK to generate some work on HDD from time-to-time even on SATA.
Impact of Paravirtual SCSI driver VS LSI Emulation with Data.
TL;DR Comparison of Paravirtual SCSI Vs Emulated SCSI in with measurements. PVSCSI gives measurably better response times at high load.
During a performance debugging session, I noticed that the response time on two of the SCSI devices was much higher than the others (Linux host under vmware ESX). The difference was unexpected since all the devices were part of the same stripe doing a uniform synthetic workload.
The immediate observation is that queue length is higher, as is wait time. All these devices reside on the same back-end storage so I am looking for something else. When I traced back the devices it turned out that the “slow devices” were attached to LSI emulated controllers in ESX. Whereas the “fast devices” are attached to para-virtual controllers.
I was surprised to see how much difference using para virtual (PV) SCSI drivers made to the guest response time once IOPS started to ramp up. In these plots the y-axis is iostat “await” time. The x-axis is time (each point is a 3 second average).
PVSCSI = Gey Dots
LSI Emulated SCSI = Red Dots
Lower is better.
Each plot is from a workload which uses a different offered IO rate. The offered rates are 8000,9000 and 10,000 the storage is able to meet the rates even though latency increases because there is a lot of outstanding IO. The workload is mixed read/write with bursts.
After converting sdh and sdi to PV SCSI the response time is again uniform across all devices.
SuperScalin’: How I learned to stop worrying and love SQL Server on Nutanix.
TL;DR It’s pretty easy to get 1M SQL TPM running a TPC-C like workload on a single Nutanix node. Use 1 vDisk for Log files, and 6 vDisks for data files. SQL Server needs enough CPU and RAM to drive it. I used 16 vCPU’s and 64G of RAM.
Running database servers on Nutanix is an increasing trend and DBA’s are naturally skeptical about moving their DB’s to new platforms. I recently had the chance to run some DB benchmarks on a couple of nodes in our lab. My goal was to achieve 1M SQL transactions per node, and have that be linearly scalable across multiple nodes.
It turned out to be ridiculously easy to generate decent numbers using SQL Server. As a Unix and Oracle old-timer it was a shock to me, just how simple it is to throw up a SQL server instance. In this experiment, I am using Windows Server 2012 and SQL-Server 2012.
For the test DB I provision 1 Disk for the SQL log files, and 6 disks for the data files. Temp and the other system DB files are left unchanged. Nothing is tuned or tweaked on the Nutanix side, everything is setup as per standard best practices – no “benchmark specials”.
Load is being generated by HammerDB configured to run the OLTP database workload. I get a little over 1Million SQL transactions per minute (TPM) on a single Nutanix node. The scaling is more-or-less linear, yielding 4.2 Million TPM with 4 Nutanix nodes, which fit in a single 2U chassis . Each node is running both the DB itself, and the shared storage using NDFS. I stopped at 6 nodes, because that’s all I had access to at the time.
The thing that blew me away in this was just how simple it had been. Prior to using SQL server, I had been trying to set up Oracle to do the same workload. It was a huge effort that took me back to the 1990’s, configuring kernel parameters by hand – just to stand up the DB. I’ll come back to Oracle at a later date.
My SQL Server is configured with 16 vCPU’s and 64GB of RAM, so that the SQL server VM itself has as many resources as possible, so as not to be the bottleneck.
I use the following flags on SQL server. In SQL terminology these are known as traceflags which are set in the SQL console (I used “DBCC trace status” to display the following. These are fairly standard and are mentioned in our best practice guide.
One thing I did change from the norm was to set the target recovery time to 240 seconds, rather than let SQL server determine the recovery time dynamically. I found that in the benchmarking scenario, SQL server would not do any background flushing at all, and then suddenly would checkpoint a huge amount of data which caused the TPM to fluctuate wildly. With the recovery time hard coded to 240 seconds, the background page flusher keeps up with the incoming workload, and does not need to issue huge checkpoints. My guess is that in real (non benchmark conditions) SQL server waits for the incoming work to drop-off and issues the checkpoint at that time. Since my benchmark never backs off, SQL server eventually has to issue the checkpoint.
Lord Kelvin Vs the IO blender
One of the characteristics of a successful storage system for virtualized environments is that it must handle the IO blender. Put simply, when lots of regular looking workloads are virtualized and presented to the storage, their regularity is lost, and the resulting IO stream starts to look more and more random.
This is very similar to the way that synthesisers work – they take multiple regular sine waves of varying frequencies and add them together to get a much more complex sound.
http://msp.ucsd.edu/techniques/v0.11/book-html/node14.html#fig01.08
That’s all pretty awesome for making cool space noises, but not so much when presented to the storage OS. Without the ability to detect regularity, things like caching, pre-fetching and any kind of predictive algorithm break down.
That pre-fetch is never going to happen.
In Nutanix NOS we treat each of these sine waves (workloads) individually, never letting them get mixed together. NDFS knows about vmdk’s or vhdx disks – and so by keeping the regular workloads separate we can still apply all the usual techniques to keep the bits flowing, even at high loads and disparate workload mixes that cause normal storage systems to fall over in a steaming heap of cache misses and metadata chaos.
Multiple devices/jobs in fio
If your underlying filesystem/devices have different response times (e.g. some devices are cached – or are on SSD) and others are on spinning disk, then the behavior of fio can be quite different depending on how the fio config file is specified. Typically there are two approaches
1) Have a single “job” that has multiple devices
2) Make each device a “job”
With a single job, the iodepth parameter will be the total iodepth for the job (not per device) . If multiple jobs are used (with one device per job) then the iodepth value is per device.
Option 1 (a single job) results in [roughly] equal IO across disks regardless of response time. This is like having a volume manager or RAID device, in that the overall oprate is limited by the slowest device.
For example, notice that even though the wait/response times are quite uneven (ranging from 0.8 ms to 1.5ms) the r/s rate is quite even. You will notice though the that queue size is very variable so as to achieve similar throughput in the face of uneven response times.
To get this sort of behavior use the following fio syntax. We allow fio to use up to 128 outstanding IO’s to distribute amongst the 8 “files” or in this case “devices”. In order to maintain the maximum throughput for the overall job, the devices with slower response times have more outstanding IO’s than the devices with faster response times.
[global] bs=8k iodepth=128 direct=1 ioengine=libaio randrepeat=0 group_reporting time_based runtime=60 filesize=6G [job1] rw=randread filename=/dev/sdb:/dev/sda:/dev/sdd:/dev/sde:/dev/sdf:/dev/sdg:/dev/sdh:/dev/sdi name=random-read
The second option, give an uneven throughput because each device is linked to a separate job, and so is completely independent. The iodepth parameter is specific to each device, so every device has 16 outstanding IO’s. The throughput (r/s) is directly tied to the response time of the specific device that it is working on. So response times that are 10x faster generate throughput that is 10x faster. For simulating real workloads this is probably not what you want.
For instance when sizing workingset and cache, the disks that have better throughput may dominate the cache.
[global] bs=8k iodepth=16 direct=1 ioengine=libaio randrepeat=0 group_reporting time_based runtime=60 filesize=2G [job1] rw=randread filename=/dev/sdb name=raw=random-read [job2] rw=randread filename=/dev/sda name=raw=random-read [job3] rw=randread filename=/dev/sdd name=raw=random-read [job4] rw=randread filename=/dev/sde name=raw=random-read [job5] rw=randread filename=/dev/sdf name=raw=random-read [job6] rw=randread filename=/dev/sdg name=raw=random-read [job7] rw=randread filename=/dev/sdh name=raw=random-read [job8] rw=randread filename=/dev/sdi name=raw=random-read
Here be Zeroes
Many storage devices/filesystems treat blocks containing nothing but zeros in a special way, often short-circuiting reads from the back-end. This is normally a good thing but this behavior can cause odd results when benchmarking. This typically comes up when testing against storage using raw devices that have been thin provisioned.
In this example, I have several disks attached to my linux virtual machine. Some of these disks contain data, but some of them have never been written to.
When I run an fio test against the disks, we can clearly see that the response time is better for some than for others. Here’s the fio output…
and here is the output of iostat -x
The devices sdf, sdg and sdh are thin provisioned disks that have never been written to. The read response times are be much lower. Even though the actual storage is identical.
There are a few ways to detect that the data being read is all zero’s.
Firstly use a simple tool like unix “od” or “hd” to dump out a small section of the disk device and see what it contains. In the example below I just take the first 1000 bytes and check to see if there is any data.
Secondly, see if your storage/filesystem has a way to show that it read zeros from the device. NDFS has a couple of ways of doing that. The easiest is to look at the 2009:/latency page and look for the stage “FoundZeroes”.
If your storage is returning zeros and so making your benchmarking problematic, you will need to get some data onto the disks! Normally I just do a large sequential write with whatever benchmarking tool that I am using. Both IOmeter and fio will write “junk” to disks when writing.
Designing a scaleout storage platform.
I was speaking to one of our developers the other day, and he pointed me to the following paper: SEDA: An Architecture for Well-Conditioned, Scalable Internet Services as an example of the general philosophy behind the design of the Nutanix Distributed File System (NDFS).
Although the paper uses examples of both a webserver and a gnutella client, the philosophies are relevant to a large scale distributed filesystem. In the case of NDFS we are serving disk blocks to clients who happen to be virtual machines. One trade-off that is true in both cases is that scalability is traded for low latency in the single-stream case. However at load, the response time is generally better than a system that is designed to low-latency, and then attempted to scale-up to achive high throughput.
At Nutanix we often talk about web-scale architectures, and this paper gives a pretty solid idea of what that might mean in concrete terms.
FWIW., according to google scholar, the paper has been cited 937 times, including Cassandra which is how we store filesystem meta-data in a distributed fashion.