
Scale factor to workingset size lookup for tiny databases
Continue readingA series of videos showing how to install, run, modify and analyze HCI clusters with the Nutanix X-ray tool
Continue reading
How to identify optane drives in linux OS using lspci.
Continue readingUse the following SQL to drop the tables and indexes in the HammerDB TPC-H schema, so that you can re-load it.
Continue reading
Tips and tricks for using diskspd especially useful for those familar with tools like fio
Continue reading
How to ensure performance testing with diskspd is stressing the underlying storage devices, not the OS filesystem.
Continue reading
How to install and setup diskspd before starting your first performance tests and avoiding wrong results due to null byte issues.
Continue readingMany storage performance testers are familiar with vdbench, and wish to use it to test Hyper-Converged (HCI) performance. To accurately performance test HCI you need to deploy workloads on all HCI nodes. However, deploying multiple VMs and coordinating vdbench can be tricky, so with X-ray we provide an easy way to run vdbench at scale. Here’s how to do it.
Continue reading

To achieve the maximum throughput on a storage device, we will usually use a large IO size to maximize the amount of data is transferred per IO request. The idea is to make the ratio of data-transfers to IO requests as large as possible to reduce the CPU overhead of the actual IO request so we can get as close to the device bandwidth as possible. To take advantage of and pre-fetching, and to reduce the need for head movement in rotational devices, a sequential pattern is used.
For historical reasons, many storage testers will use a 1MB IO size for sequential testing. A typical fio command line might look like something this.
fio --name=read --bs=1m --direct=1 --filename=/dev/sdaContinue reading

The real-world achievable SSD performance will vary depending on factors like IO size, queue depth and even CPU clock speed. It’s useful to know what the SSD is capable of delivering in the actual environment in which it’s used. I always start by looking at the performance claimed by the manufacturer. I use these figures to bound what is achievable. In other words, treat the manufacturer specs as “this device will go no faster than…”.
Start by identifying the exact SSD type by using lsscsi. Note that the disks we are going to test are connected by ATA transport type, therefore the maximum queue depth that each device will support is 32.
# lsscsi
[1:0:0:0] cd/dvd QEMU QEMU DVD-ROM 2.5+ /dev/sr0
[2:0:0:0] disk ATA SAMSUNG MZ7LM1T9 404Q /dev/sda
[2:0:1:0] disk ATA SAMSUNG MZ7LM1T9 404Q /dev/sdb
[2:0:2:0] disk ATA SAMSUNG MZ7LM1T9 404Q /dev/sdc
[2:0:3:0] disk ATA SAMSUNG MZ7LM1T9 404Q /dev/
The marketing name for these Samsung SSD’s is “SSD 850 EVO 2.5″ SATA III 1TB“
The spec sheet for this ssd claims the following performance characteristics.
| Workload (Max) | Spec | Measured |
| Sequential Read (QD=8) | 540 MB/s | 534 |
| Sequential Write (QD=8) | 520 MB/s | 515 |
| Read IOPS 4KB (QD=32) | 98,000 | 80,00 |
| Write IOPS 4KB (QD=32) | 90,000 | 67,000 |
How to install Prometheus on OS-X
$ cd /Users/gary.little/Downloads/prometheus-2.16.0-rc.0.darwin-amd64
$ ./prometheus

Prometheus itself does not do much apart from monitor itself, to do anything useful we have to add a scraper/exporter module. The easiest thing to do is add the scraper to monitor OS-X itself. As in Linux the OS exporter is simply called “node exporter”.
Start by downloading the pre-compiled darwin node exporter from prometheus.io
$ cd /Users/gary.little/Downloads/node_exporter-0.18.1.darwin-amd64 $ ./node_exporter INFO[0000] Starting node_exporter (version=0.18.1, branch=HEAD, revision=3db77732e925c08f675d7404a8c46466b2ece83e) source="node_exporter.go:156" INFO[0000] Build context (go=go1.11.10, user=root@4a30727bb68c, date=20190604-16:47:36) source="node_exporter.go:157" INFO[0000] Enabled collectors: source="node_exporter.go:97" INFO[0000] - boottime source="node_exporter.go:104" INFO[0000] - cpu source="node_exporter.go:104" INFO[0000] - diskstats source="node_exporter.go:104" INFO[0000] - filesystem source="node_exporter.go:104" INFO[0000] - loadavg source="node_exporter.go:104" INFO[0000] - meminfo source="node_exporter.go:104" INFO[0000] - netdev source="node_exporter.go:104" INFO[0000] - textfile source="node_exporter.go:104" INFO[0000] - time source="node_exporter.go:104" INFO[0000] Listening on :9100 source="node_exporter.go:170""Continue reading
Some versions of HammerDB (e.g. 3.2) may induce imbalanced NUMA utilization with SQL Server.
This can easily be observed with Resource monitor. When NUMA imbalance occurs one of the NUMA nodes will show much larger utilization than the other. E.g.

The cause and fix is well documented on this blog. In short HammerDB issues a short lived connection, for every persistent connection. This causes the SQL Server Round-robin allocation to send all the persistent worker threads to a single NUMA Node! To resolve this issue, simply comment out line #212 in the driver script.

If successful you will immediately see that the NUMA nodes are more balanced. Whether this results in more/better performance will depend on exactly where the bottleneck is.

How to avoid bottlenecks in the client generator when measuring database performance with HammerDB
Continue readingAn X-ray workload for measuring application density
Continue readingThe vertica vioperf tool is used to determine whether the storage you are planning on using is fast enough to feed the vertica database. When I initially ran the tool, the IO performance reported by the tool and confirmed by iostat was much lower than I expected for the storage device (a 6Gbit SATA device capable of around 500MB/s read and write).
The vioperf tool runs on a linux host or VM and can be pointed at any filesystem just like fio or vdbench
Simple execution of vioperf writing to the location /vertica
vioperf --thread-count=8 --duration=120s /vertica
Unlike traditional IO generators vioperf does not allow you to specify the working-set size. The amount of data written is simply 1MB* Achieved IO rate * runtime. So, fast storage with long run-times will need a lot of capacity otherwise the tool simply fills the partition and crashes!
The primary metric is MB/s Per-Core. The idea is that you give 1 Thread per core in the system, though there is nothing stopping you from using whatever –thread-count value you like.
Although the measure is throughput, the primary metric of (Throughput/Core) does not improve just by giving lots of concurrency. Concurrency is generated purely by the number of threads and since the measure of goodness is Throughput/Core (or per thread) it’s not possible to simply create throughput from concurrency alone.
Compared to fio the reported throughput is lower for the same device and same degree of concurrency. Vertica continually writes, and extends the files so there is some filesystem work going on whereas fio is typically overwriting an existing file. If you observe iostat during the vioperf run you will see that the IO size to disk is different than what an fio run will generate. Again this is due to the fact that vioperf is continually extending the file(s) being written and so it needs to update filesystem metadata quite frequently. These small metadata updates skew the average IO size lower.
Notice the avgrq size is 1024 blocks (512KB) which is the maximum transfer size that this drive supports.
fio --filename=/samsung/vertica/file --size=5g --bs=1m --ioengine=libaio --iodepth=1 --rw=write --direct=1 --name=samsung --create_on_open=0
avg-cpu: %user %nice %system %iowait %steal %idle
4.16 0.00 3.40 0.00 0.00 92.43
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 920.00 0.00 471040.00 1024.00 1.40 1.53 0.00 1.53 1.02 93.80
Firstly we see that iostat reports much lower disk throughput than what we achieved with fio for the same offered workload (1MB IO size with 1 outstanding IO (1 thread).
Also notice that that although vioperf issues 1MB IO sizes (which we can see from strace) iostat does not report the same 1024 block transfers as we see when we run iostat during an fio run (as above).
In the vioperf case the small metadata writes that are needed to continually extend the file cause a average IO size than than overwriting an existing file. Perhaps that is the cause of the lower performance?
./vioperf --duration=300s --thread-count=1 /samsung/vertica
avg-cpu: %user %nice %system %iowait %steal %idle
8.77 0.13 2.38 5.26 0.00 83.46
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 627.00 0.00 223232.00 712.06 1.02 1.63 0.00 1.63 0.69 43.20
strace -f ./vioperf --duration=300s --thread-count=1 --disable-crc /samsung/vertica
...
[pid 1350] write(6, "v\230\242Q\357\250|\212\256+}\224\270\256\273\\\366k\210\320\\\330z[\26[\6&\351W%D"..., 1048576) = 1048576
[pid 1350] write(6, "B\2\224\36\250\"\346\241\0\241\361\220\242,\207\231.\244\330\3453\206'\320$Y7\327|5\204b"..., 1048576) = 1048576
[pid 1350] write(6, "\346r\341{u\37N\254.\325M'\255?\302Q?T_X\230Q\301\311\5\236\242\33\1)4'"..., 1048576) = 1048576
[pid 1350] write(6, "\5\314\335\264\364L\254x\27\346\3251\236\312\2075d\16\300\245>\256mU\343\346\373\17'\232\250n"..., 1048576) = 1048576
[pid 1350] write(6, "\272NKs\360\243\332@/\333\276\2648\255\v\243\332\235\275&\261\37\371\302<\275\266\331\357\203|\6"..., 1048576) = 1048576
[pid 1350] write(6, "v\230\242Q\357\250|\212\256+}\224\270\256\273\\\366k\210\320\\\330z[\26[\6&\351W%D"..., 1048576) = 1048576
...
However, look closely and you will notice that the %user is higher than fio for a lower IO rate AND the disk is not 100% busy. That seems odd.
./vioperf --duration=300s --thread-count=1 /samsung/vertica
avg-cpu: %user %nice %system %iowait %steal %idle
8.77 0.13 2.38 5.26 0.00 83.46
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 627.00 0.00 223232.00 712.06 1.02 1.63 0.00 1.63 0.69 43.20
Finally we disable the crc checking (which vioperf does by default) to get a higher throughput more similar to what we see with fio.
It turns out that the lower performance was not due to the smaller IO sizes (and additonal filesystem work) but was caused the CRC checking that the tool does to simulate the vertica application.
./vioperf --duration=300s --thread-count=1 --disable-crc /samsung/vertica
avg-cpu: %user %nice %system %iowait %steal %idle
8.77 0.13 2.38 5.26 0.00 83.46
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 0.00 0.00 627.00 0.00 223232.00 712.06 1.02 1.63 0.00 1.63 0.69 43.20
Do database workloads benefit from data locality?
Continue reading
TL;DR – Some modern Linux distributions use a newer method of identification which, when combined with DHCP can result in duplicate IP addresses when cloning VMs, even when the VMs have unique MAC addresses.
To resolve, do the following ( remove file, run the systemd-machine-id-setup command, reboot):
# rm /etc/machine-id
# systemd-machine-id-setup
# reboot
When hypervisor management tools make clones of virtual machines, the tools usually make sure to create a unique MAC address for every clone. Combined with DHCP, this is normally enough to boot the clones and have them receive a unique IP. Recently, when I cloned several Bitnami guest VMs which are based on Debian, I started to get duplicate IP addresses on the clones. The issue can be resolved manually by following the above procedure.
To create a VM template to clone from which will generate a new machine-id for very clone, simply create an empty /etc/machine-id file (do not rm the file, otherwise the machine-id will not be generated)
# echo "" | tee /etc/machine-id
The machine-id man page is a well written explanation of the implementation and motivation.
With ubuntu 24, this method did not work – I had to resort to using dhclient
apt install isc-dhcp-client
dhclient -r
dhclient ens3