Cache behavior – How long will it take to fill my cache?

When benchmarking filesystems or storage, we need to understand the caching effects. Most often this involves filling the cache and reaching steady state. But how long will it take to fill a cache of a given size? The answer depends of course on the size of the cache, the IO size and the IO rate. So, to simpify let’s just say that a cache consists of some number of entries. For instance a 4GB cache would have 1 million 4KB entries. In my example this is simply a 1M entry cache.

In terms of time to fill the cache, it’s simpler to think about how many entries will need to be read before the cache is filled.

For a random workload, it will be more than 1M “reads”. Let’s see why.

The first read will be inserted into the cache, the second read will probably be inserted into the cache, but there is a small (1/1000000) chance that the second read will actually be already in the cache since it’s random. As the cache gets fuller – the chances of a given read already being present in cache increases. As a result it will take a lot more than 1 million reads to populate the entire cache with a random read workload.

The question is this. Is is possible to predict, how many “reads” it will take to fill the cahe?

The experiment.

In this experiment, we create an array to represent the cache. It has 1M entries. Then using a random number generator, simulate the workload and measure how long it takes to populate the cahche.

Results

After 1,000,000 “reads” there are 633,000 positive entries (entries that have data in them). So what happened to the other 367,000? The 367,000 represent cache “hits” on an existing entry. Since the read “workload” is 100% random, there is some chance that a subsequent read will be for an entry that is already cached. Over the life of 1,000,000 reads around 37% are for an entry that is already cached.

After 2,0000,000 reads the cache contains 864,000 entries. Another 1,000,000 reads yields 950,000.

The fuller that the cache becomes, the fewer new entries are added. Intuitively this makes sense because as the cache becomes more full, more of the “random reads” are satisfied by an existing cache entry.

In my experiments it takes about 17,000,000 “reads” to ensure that every cache entry is filled in a 1M entry cache. Here are the data for 19 runs.

Cachefullness

Iteration Positive Entries Empty Slots 1 631998 368002 2 864334 135666 3 950184 49816 4 981630 18370 5 993266 6734 6 997577 2423 7 999080 920 8 999660 340 9 999879 121 10 999951 49 11 999985 15 12 999996 4 13 999998 2 14 999998 2 15 999999 1 16 999999 1 17 1000000 0 18 1000000 0
  • For 500,000 Entries it takes 15 iterations to fill all the entries.
  • For 2,000,000 Entries it takes 19 iterations to fill all the entries.

Interestingly, the ratio of positive to empty entries after one iteration is always about 0.632:0.368

  • 0.368 is roughly 1/e
  • .632 is roughly 1-(1/e).

Things to know when using vdbench.

Recently I found that vdbench was not giving me the amount of outstanding IO that I had intended to configure by using the “threads=N” parameter. It turned out that with Linux, most of the filesystems (ext2, ext3 and ext4) do not support concurrent directIO, although they do support directIO. This was a bit of a shock coming from Solaris which had concurrent directIO since 2001.

All the Linux filesystems I tested allow multiple outstanding IO’s if the IO is submitted using asynchronous IO (AKA asyncIO or AIO) but not when using multiple writer threads (except XFS). Unfortunately vdbench does not allow AIO since it tries to be platform agnostic.

fio however does allow either threads or AIO to be used and so that’s what I used in the experiments below.

The column fio QD is the amount of outstanding IO, or Queue Depth that is intended to be passed to the storage device. The column iostat QD is the actual Queue Depth seen by the device. The iostat QD is not “8” because the response time is so low that fio cannot issue the IO’s quickly enough to maintain the intended queue depth.

Device
fio QD
fio QD Type
direct
iostat QD
 ps -efT | grep fio | wc -l
/dev/sd
8
libaio
Yes
7
5
/dev/sd
8
Threads
Yes
7
12
ext2 fs (mke2fs)
8
Threads
Yes
1
12
ext2 fs (mke2fs)
8
libaio
Yes
7
5
ext3 (mkfs -t ext3)
8
Threads
Yes
1
12
ext3 (mkfs -t ext3)
8
libaio
Yes
7
5
ext4 (mkfs -t ext4)
8
Threads
Yes
1
12
ext4 (mkfs -t ext4)
8
libaio
Yes
7
5
xfs (mkfs -t xfs)
8
Threads
Yes
7
12
xfs (mkfs -t xfs)
8
libaio
Yes
7
5

At any rate, all is not lost – using raw devices (/dev/sdX) will give concurrent directIO, as will XFS. These issues are well known by Linux DB guys, and I found interesting articles from Percona and Kevin Closson after I finally figured out what was going on with vdbench.

fio “scripts”

For the “threads” case.

[global]
bs=8k
ioengine=sync
iodepth=8
direct=1
time_based
runtime=60
numjobs=8
size=1800m

[randwrite-threads]
rw=randwrite
filename=/a/file1

For the “aio” case

[global]
bs=8k
ioengine=libaio
iodepth=8
direct=1
time_based
runtime=60
size=1800m


[randwrite-aio]
rw=randwrite
filename=/a/file1

Impact of Paravirtual SCSI driver VS LSI Emulation with Data.

TL;DR  Comparison of Paravirtual SCSI Vs Emulated SCSI in with measurements.  PVSCSI gives measurably better response times at high load. 

During a performance debugging session, I noticed that the response time on two of the SCSI devices was much higher than the others (Linux host under vmware ESX).  The difference was unexpected since all the devices were part of the same stripe doing a uniform synthetic workload.

iostat output from the system under investigation.

 

The immediate observation is that queue length is higher, as is wait time.  All these devices reside on the same back-end storage so I am looking for something else.  When I traced back the devices it turned out that the “slow devices” were attached to LSI emulated controllers in ESX.  Whereas the “fast devices” are attached to para-virtual controllers.

I was surprised to see how much difference using para virtual (PV) SCSI drivers made to the guest response time once IOPS started to ramp up.  In these plots the y-axis is iostat “await” time.  The x-axis is time (each point is a 3 second average).

PVSCSI = Gey Dots
LSI Emulated SCSI = Red Dots
Lower is better.

 

Each plot is from a workload which  uses a different offered IO rate.  The offered rates are   8000,9000 and 10,000 the storage is able to meet the rates even though latency increases because there is a lot of outstanding IO.  The workload is mixed read/write with bursts.

8K IOPS9K IOPS10K IOPS

 

After converting sdh and sdi to PV SCSI the response time is again uniform across all devices.

10K IOPS PV

Multiple devices/jobs in fio

If your underlying filesystem/devices have different response times (e.g. some devices are cached – or are on SSD) and others are on spinning disk, then the behavior of fio can be quite different depending on how the fio config file is specified.  Typically there are two approaches

1) Have a single “job” that has multiple devices

2) Make each device a “job”

With a single job, the iodepth parameter will be the total iodepth for the job (not per device) .  If multiple jobs are used (with one device per job) then the iodepth value is per device.

Option 1 (a single job) results in [roughly] equal IO across disks regardless of response time.  This is like  having a volume manager or RAID device, in that the overall oprate is limited by the slowest device.

For example, notice that even though the wait/response times are quite uneven (ranging from 0.8 ms to 1.5ms) the r/s rate is quite even.  You will notice though the that queue size is very variable so as to achieve similar throughput in the face of uneven response times.

Screen Shot 2014-06-17 at 5.34.55 PM

To get this sort of behavior use the following fio syntax. We allow fio to use up to 128 outstanding IO’s to distribute amongst the 8 “files” or in this case “devices”. In order to maintain the maximum throughput for the overall job, the devices with slower response times have more outstanding IO’s than the devices with faster response times.

[global]
bs=8k
iodepth=128
direct=1
ioengine=libaio
randrepeat=0
group_reporting
time_based
runtime=60
filesize=6G

[job1]
rw=randread
filename=/dev/sdb:/dev/sda:/dev/sdd:/dev/sde:/dev/sdf:/dev/sdg:/dev/sdh:/dev/sdi
name=random-read

The second option, give an uneven throughput because each device is linked to a separate job, and so is completely independent.  The  iodepth parameter is specific to each device, so every device has 16 outstanding IO’s.  The throughput (r/s) is directly tied to the response time of the specific device that it is working on.  So response times that are 10x faster generate throughput that is 10x faster.  For simulating real workloads this is probably not what you want.

For instance when sizing workingset and cache, the disks that have better throughput may dominate the cache.

Screen Shot 2014-06-17 at 5.32.52 PM

[global]
bs=8k
iodepth=16
direct=1
ioengine=libaio
randrepeat=0
group_reporting
time_based
runtime=60
filesize=2G

[job1]
rw=randread
filename=/dev/sdb
name=raw=random-read
[job2]
rw=randread
filename=/dev/sda
name=raw=random-read
[job3]
rw=randread
filename=/dev/sdd
name=raw=random-read
[job4]
rw=randread
filename=/dev/sde
name=raw=random-read
[job5]
rw=randread
filename=/dev/sdf
name=raw=random-read
[job6]
rw=randread
filename=/dev/sdg
name=raw=random-read
[job7]
rw=randread
filename=/dev/sdh
name=raw=random-read
[job8]
rw=randread
filename=/dev/sdi
name=raw=random-read

Here be Zeroes

zero-cup

Many storage devices/filesystems treat blocks containing nothing but zeros in a special way, often short-circuiting reads from the back-end.  This is normally a good thing but this behavior can cause odd results when benchmarking.  This typically comes up when testing against storage using raw devices that have been thin provisioned.

In this example, I have several disks attached to my linux virtual machine.  Some of these disks contain data, but some of them have never been written to.

When I run an fio test against the disks, we can clearly see that the response time is better for some than for others.  Here’s the fio output…

fio believes that it is issuing the same workload to all disks.
fio believes that it is issuing the same workload to all disks.

and here is the output of iostat -x

Screen Shot 2014-06-12 at 4.29.23 PM

The devices sdf, sdg and sdh are thin provisioned disks that have never been written to. The read response times are be much lower.  Even though the actual storage is identical.

There are a few ways to detect that the data being read is all zero’s.

Firstly use a simple tool like unix “od” or “hd”  to dump out a small section of the disk device and see what it contains.  In the example below I just take the first 1000 bytes and check to see if there is any data.

Screen Shot 2014-06-12 at 4.58.15 PM

Secondly, see if your storage/filesystem has a way to show that it read zeros from the device.  NDFS has a couple of ways of doing that.  The easiest is to look at the 2009:/latency page and look for the stage “FoundZeroes”.

 

If your storage is returning zeros and so making your benchmarking problematic, you will need to get some data onto the disks!  Normally I just do a large sequential write with whatever benchmarking tool that I am using.  Both IOmeter and fio will write “junk” to disks when writing.