Recently I wanted to measure the impact of NUMA placement and Hugepages on the performance of postgres running in a VM on a Nutanix node. To do this I needed to drive postgres to do real transactions but have very little jitter/noise from the filesystem and storage. After reading a lot of blogs I came up with a process and set of postgres.conf
tuneables that allowed me to run HammerDB TPROC workload (TPCC-C like) with very low variation around 0.3% variance (standard deviation/mean).
The tunings are not meant to represent best practices – and running repeatedly (without manually vacuuming, or doing a restore – will create problems because I am disabling autovacuum (see this discussion with HammerDB author Steve Shaw here and here)
Results
I have put the benchmark results below – but the main point of this post is to discuss the method which allows me to generate very repeatable postgres benchmark results where I can drive the CPU/Memory to be the limiting bottleneck. The screenshot below shows 5 runs back-to-back. From top to bottom the output shows
- SQL commits per minute
- Database VM CPU usage per core
- Memory bandwidth (from Intel PCM running on the AHV hypervisor host)
- Database VM IO rates
Benchmark results
Using the method below I was able to measure the impact of NUMA and Hugepages with some confidence. IPC in this table represents “Instructions per Cycle” (higher is better) and can be used to derive how much “CPU time” is spent waiting on memory Vs executing instructions. The higher the number – the less time (relatively) is spent waiting on memory accesses.
NUMA | HUGEPAGE | NOPM (Metric) | PCT | IPC | PCT |
Local | ON | 800831 | 100.00% | 0.75 | 100.00% |
Remote | ON | 764983 | 95.52% | 0.68 | 90.67% |
Local | OFF | 771149 | 96.29% | 0.70 | 93.33% |
Remote | OFF | 716298 | 89.44% | 0.62 | 82.67% |
End to end walk-through
We’re going to create a tiny database (32GB initially) on a VM with plenty of memory (I used 74GB for my Linux VM) the ideal is to have enough memory to fit the database workingset into memory twice – Once in the Linux page-cache and once in the postgres buffercache. I found this worked well since the postgres buffer cache is populated from the Linux page cache – it is not directly read into the postgres cache in the way that Oracle does.
To additionally make a 32GB postgres buffer cache set this shared_buffers = 32768MB
in postgres.conf
I will use a 2 minute warmup and 5 minute measured benchmark run. I am using HammerDB 4.11 on a separate VM to drive the workload/send the transactions to the database server.
List of databases
Name | Owner | Encoding | Size | Tablespace |
---------------------+----------+----------+---------+------------+
tpcc300 | postgres | UTF8 | 38 GB | pgxfs
1 Enlarge the WAL filesystem / WAL limits and disable checkpoints
With a small database the overwrite/write rate can be relatively high – for a 300WH DB with 100vUsers (not using all-warehouses) – the log write rate was about 95MB/s. So during a single run we will generate (95MB/s * 60s * (2Min Warmup + 5Minutes Run))
a minimum ~40GB GB of database log writes. By doing sizing the WAL to contain at least 40GB you will avoid checkpoints during the run.
- Size the WAL filesystem / partition to be larger than (log write rate * Duration)
- Size the pogstres
postgresql.conf
valuesmax_wal_size
to be roughly the same size as the WAL filesystem - Enable the logging of checkpoints so that if any do occur you will see them in the postgres log
- Set the checkpoint timeout to be longer than the benchmark run, and ensure that checkpoints dont start too soon before the timeout/fullness limits with (
checkpoint_completion_target=0.9
) - Issue a
checkpoint;
before doing the benchmark run
WAL filesystem
Filesystem Size Used Avail Use% Mounted on
/dev/sdd1 256G 49G 208G 20% /pgwal
postgres settings
max_wal_size = 250GB
min_wal_size = 1GB
log_checkpoints = on
checkpoint_timeout = 60min
checkpoint_completion_target = 0.9
2 Disable autovacuum and manually vacuum / analyze before the run
vacuuming is a critical part of running a postgres DB – but for low variation benchmark results we want to stop vacuuming from occurring during the run. To do this disable autovacuuming and instead issue a vacuum;
command before starting the run. Also update the indexes and plan optimizer stats by running analyze
. I do all this with a script (shown below).
postgres settings
Turn autovacuum off and enable logging in case something weird happens – that way you will see messages in the postgres.log
.
autovacuum = off
log_autovacuum_min_duration = 0
3 Eliminate background writes
Even without the checkpoint process running, postgres will try to avoid having too many dirty buffers in memory. The issue here is that when this flush/sync operation occurs there is quite a lof of CPU and Disk activity (both in postgres, and then the linux kernel) – which again causes a dip in performance that we want to avoid. If the database fits entirely within the postgres cache – this setting is less critical.
bgwriter_lru_maxpages = 0 # max buffers written/round, 0 disables
bgwriter_flush_after = 0 # measured in pages, 0 disables
4 Allow async disk writes
In my testing I am trying to make the DB be CPU bound – so I am OK with doing the IO async – you may wish to do the same or not depending on your goals
fsync = off # flush data to disk for crash safety
synchronous_commit = off # synchronization level;
5 The rest
Here are my changes in one location with a few additional ones thrown in for free!
huge_pages = off # on, off, or try
work_mem = 128MB # min 64kB
maintenance_work_mem = 1GB # min 1MB
bgwriter_lru_maxpages = 0 # max buffers written/round, 0 disables
bgwriter_flush_after = 0 # measured in pages, 0 disables
fsync = off # flush data to disk for crash safety
synchronous_commit = off # synchronization level;
checkpoint_timeout = 60min # range 30s-1d
checkpoint_completion_target = 0.9 # checkpoint target duration, 0.0 - 1.0
max_wal_size = 250GB
min_wal_size = 1GB
random_page_cost = 1.1 # same scale as above
cpu_tuple_cost = 0.03 # same scale as above
effective_cache_size = 64GB
log_min_duration_statement = 10 # -1 is disabled, 0 logs all statements
autovacuum = off # Enable autovacuum subprocess? 'on'
autovacuum_max_workers = 10 # max number of autovacuum subprocesses
autovacuum_vacuum_cost_limit = 3000 # default vacuum cost limit for
lc_messages = 'C' # locale for system error message
log_checkpoints = on
6 Use a script
I setup my benchmark with a bash script – which also restores a fresh DB. Then i use haammerdbcli
on a separate machine to connect to the SQL database and run the TPC-C like workload.
Database setup script
#!/bin/bash
DB=tpcc300
TABLESPACE=pgxfs
DB_BACKUP_FILE=/xfs-1-disk/backups/tpcc-300wh-parallel-16
PSQL=/usr/bin/psql
echo "Dropping Database $DB"
time /usr/bin/dropdb $DB
sleep 1
echo "Creating database $DB on tablespace $TABLESPACE"
time /usr/bin/createdb --tablespace $TABLESPACE $DB
sleep 1
echo "Restoring database $DB from $DB_BACKUP_FILE"
echo time /usr/bin/pg_restore -d $DB -j 16 $DB_BACKUP_FILE
time /usr/bin/pg_restore -d $DB -j 16 $DB_BACKUP_FILE
sleep 1
echo "Running vaccum on $DB"
echo time $PSQL -c vacuum -d $DB
time $PSQL -c vacuum -d $DB
echo "Running analyze on $DB"
echo time $PSQL -c analyze -d $DB
time $PSQL -c analyze -d $DB
echo "Running checkpoint on $DB"
echo time $PSQL -c checkpoint -d $DB
time $PSQL -c checkpoint -d $DB
echo "DB $DB Ready to go"
HammerDB client script
Call this from hammerdbcli
using source <scriptname>
dbset db pg
dbset bm TPC-C
diset connection pg_host linuxtest
diset connection port 5432
#diset tpcc pg_partition false
#diset tpcc pg_superuser postgres
#diset tpcc pg_superuserpass postgres
#diset tpcc pg_defaultdbase postgres
diset tpcc pg_user tpcc
diset tpcc pg_pass tpcc
diset tpcc pg_dbase tpcc300
diset tpcc pg_allwarehouse false
diset tpcc pg_driver timed
diset tpcc pg_rampup 2
diset tpcc pg_duration 5
diset tpcc pg_storedprocs false
diset tpcc pg_count_ware 300
diset tpcc pg_num_vu 100
diset tpcc pg_vacuum false
print dict
loadscript
vudestroy
vuset vu 100
vuset delay 0
vuset repeat 0
vuset showoutput 0
vuset logtotemp 1
vucreate
vurun
Blogs/references
Here are some of the many blogs I read to get to this point
Tips for HammerDB
- DBA Kevlar (tuning for Silk storage) (Tuning synchronous_commit and max_wal_senders) try sync commit == off
- Google Omni Alloy/Postgres DB benchmarking guide
- GigaOm Postgres Price/Performance testing
- Citus/Postgres Benchmarking (Video)